SlightlyLoony
Tera Contributor
Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
10-27-2011
07:06 AM
Here's our exercise for today: write a regex and accompanying script that will allow us to extract the parenthetical text from this sample:
Once upon a time (a long time ago), there was a purple frog living
in a yellow lake. A princess came along and thought (if that's the
right term to apply) to herself "I think I'll kiss this purple frog!" (have
you ever met such a princess?). Right away (as in immediately), the
frog morphed into a handsome prince, who ignored the princess
and started chasing the flies hovering over the lake (the end).
Note that each line of this sample text is terminated by a newline.
How would you do this?
Oh, dear, now look what I've done — I've made Judy cry...
Well, with what I've told you about so far, you might try writing a function and test code something like this:
var text = 'Once upon a time (a long time ago), there was a purple frog living\n';
text += 'in a yellow lake. A princess came along and thought (if that\'s the\n';
text += 'right term to apply) to herself "I think I\'ll kiss this purple frog!" (have\n';
text += 'you ever met such a princess?). Right away (as in immediately), the\n';
text += 'frog morphed into a handsome prince, who ignored the princess\n';
text += 'and started chasing the flies hovering over the lake (the end).\n';
JSUtil.logObject(getParentheticals(text));
function getParentheticals(text) {
var parser = /([a-zA-Z .,]{1,})/g;
var results = [];
var x;
while ((x = parser.exec(text)) != null)
results.push(x[1]);
return results;
}
Actually, I snuck something new in here: the g flag in the regex literal — see that at the end? That flag stands for global, and it tells the regex that we want to find all the occurrences of a match in our text. If you leave that flag off, bad things happen: the while loop never terminates, because it keeps searching for (and finding!) the first occurrence of our pattern.
Anyway, what you're thinking is that this regex will match the left parenthesis, then one or more of any those characters in the character class, then the right parenthesis. But when you run it, you don't get what you expected:
[0]: string = Once upon a time
[1]: string = a long time ago
[2]: string = , there was a purple frog living
[3]: string = in a yellow lake. A princess came along and thought
[4]: string = if that
[5]: string = s the
[6]: string = right term to apply
[7]: string = to herself
[8]: string = I think I
[9]: string = ll kiss this purple frog
[10]: string =
[11]: string = have
[12]: string = you ever met such a princess
[13]: string = . Right away
[14]: string = as in immediately
[15]: string = , the
[16]: string = frog morphed into a handsome prince, who ignored the princess
[17]: string = and started chasing the flies hovering over the lake
[18]: string = the end
[19]: string = .
Just look at the very first result — it matched something with no parentheses at all! What's the matter with this stupid regex!?!?!
Well, the dumb thing is doing exactly what you told it to do. Remember that parentheses are used by regexes to define capture groups? That's what's happening here — it's not matching those parentheses, it's just defining the capture group — which, btw, you're using in that code to extract the matched text and put it in the results. So how do you tell a regex that you want to match parentheses, and not use them to define a capture group? The general rule in regexes is that if you want the regex to interpret a character as just the character, and not as a metacharacter, you escape it by preceding it with a backslash ("\"). So if we change our regex to this:
/\(([a-zA-Z .,]{1,})\)/g
and run it again, we get this:
[0]: string = a long time ago
[1]: string = as in immediately
[2]: string = the end
Now that's more like it! But still not right — our example text contains 5 parentheticals, but we only got 3 of them in the output. Can you see why? The problem is that our character class is missing several characters contained in the parentheticals that we missed: the newline, an apostrophe, and a question mark. If we add them to the character class:
/\(([a-zA-Z .,'?\n]{1,})\)/g
and run it again, we get this:
[0]: string = a long time ago
[1]: string = if that's the
right term to apply
[2]: string = have
you ever met such a princess?
[3]: string = as in immediately
[4]: string = the end
That's what we want! Yay!
But…we've actually written a pretty lame regex here. As soon as we have some text with some new character that we didn't anticipate in a parenthetical, this will not work correctly. What we really need is a way to match any character. There is a shorthand character class built into regex that matches almost anything: the dot (a period, or .), which is equivalent to [^\n]. This little guy will match any character except a newline. So if we change our regex to:
/\((.{1,})\)/g
and run it, we get this:
[0]: string = a long time ago
[1]: string = as in immediately
[2]: string = the end
Note that it once again doesn't match the two parentheticals that cross a line boundary.
That dot is a handy little shorthand class, and there are lots of good uses for it — but it doesn't solve today's problem (though it does fail with less characters than our previous fail :-)). So how can we get the regex to match any character? There's a cute little trick that has become the defacto standard (in JavaScript) for this problem: the trick is to build a character class using a shorthand class and it's opposite. For example, [\d\D] is a character class that says "match any character that is a digit or is not a digit" — which is a strange way of saying "match anything at all". A newline is not a digit, so it matches, an exclamation point is not a digit, so it matches, and so on. By convention (and for no other reason), the shorthand character class used for this purpose is for whitespace or not whitespace, like this: [\s\S] — but there's really nothing special about this particular choice. So if we rewrite the regex like this:
/\(([\s\S]{1,})\)/g
and run it, we get:
[0]: string = a long time ago), there was a purple frog living
in a yellow lake. A princess came along and thought (if that's the
right term to apply) to herself "I think I'll kiss this purple frog!" (have
you ever met such a princess?). Right away (as in immediately), the
frog morphed into a handsome prince, who ignored the princess
and started chasing the flies hovering over the lake (the end
Whoa! What just happened?
We are the victim of something called greedy quantification, which is the default behavior of a regex. What's happened here is that the regex has matched as many characters as it possibly could. When it found the first opening parenthesis, it then matched every following character up to the last closing parenthesis. This didn't happen before because we used a character class that didn't include parentheses — but with our "match anything" character class, we can even match them. It's called greedy because of this take-everything-I-possibly-can behavior, and it's a behavior that confuses many a regex writer. Understanding greediness is important.
But right now we have a case where we really don't want greediness. What we want is reluctance — we want it to match the fewest possible characters and still match the pattern. There's an app for that! Well, not an app — a character. If I add one little old question mark in the right place:
/\(([\s\S]{1,}?)\)/g
I'll get what I want:
[0]: string = a long time ago
[1]: string = if that's the
right term to apply
[2]: string = have
you ever met such a princess?
[3]: string = as in immediately
[4]: string = the end
That's a perfectly good regex for matching parentheticals — but I've got one more thing to show you today. Just as there are shorthands for character classes, there are also shorthands for commonly used quantifiers — three of them:
- ? is a shorthand for {0,1}
- + is a shorthand for {1,}
- * is a shorthand for {0,}
One of those is exactly what we're using, and if we rewrite our regex to:
/\(([\s\S]+?)\)/g
and run it, we still get the correct results.
There's a good example in the preceding of a context-sensitive metacharacter, of which there are several in regexes. In this case, the question mark — which we used earlier to indicate reluctance — is being used as a shorthand for a quantifier. Question marks are interpreted by regexes based on what kind of "thing" they follow (that's what's meant by context-sensitive). Earlier we used a question mark after a quantifier — in that context, the question mark indicates reluctance. On the other hand, if a question mark follows a character (or character class), then it indicates a shorthand for a quantifier. The only hard part about this, really, is learning to read the regex gobble-de-gook like the regex itself does — then it's easy!
Did I hurt your brain today? I hope so, 'cause that was my intent!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.