Regex: Stop the Madness!

SlightlyLoony · ‎10-25-2011

Yesterday I showed you a regular expression (regex) that would find an SSN in text:


/(^|[^0-9])([0-9]{3}-[0-9]{2}-[0-9]{4})($|[^0-9])/m

I didn't tell you anything at all about how it worked, though. Let's take it a whack at understanding it:

This is a JavaScript regex literal. Much like you can quote a string to make a string literal, you can surround a regex with slashes ("/") to make a regex literal. Unlike a string literal, though, a regex literal can be followed by so-called flag characters. In the case of our example, we've got an m — that's the multiline flag, which I'll explain later in this post.

The stuff between the slashes is our actual regex. Here it is, piece by piece:

[0-9] This is an example of a character class, which you can recognize by the enclosing square brackets. Character classes define what characters should match at a particular place. This particular character says "match any characters between 0 and 9, inclusive." In other words, it says "match any digit." The dash ("-") indicates a range. We could have specified this character class as [0123456789], which means exactly the same thing (but is more work to write out!).
[^0-9] This looks very much like the preceding example, except for that little caret character. The caret changes everything: it means "match everything but what follows." In this case, it means "match anything that's not a digit."
{3} This is called a quantifier, specifically, an interval quantifier. Quantifiers define how many characters to match. An interval quantifier (recognizable by the surrounding curly braces) specifies either an exact number of characters ({n}) or an inclusive range of allowable number of characters ({n, m}). An inclusive range with the first number missing ({, m}) means any number from 0 to m; if the last number is missing ({n, }) means any number n or more. In our example's case, it means match 3 characters. The absence of any quantifier (interval or otherwise) means "match one character."
[0-9]{3} This puts a couple of the preceding points together. It means "match three digits." Really! That's all it means.
([0-9]{3}-[0-9]{2}-[0-9]{4}) This is a big part of the entire regex, but you should be able to read it now. It means "match three digits, dash, two digits, dash, four digits." Those dashes that it's matching are outside a character class, and in that context the dash has no special meaning, and is simply directly matched. That description sounds like the entire problem, doesn't it? We've described the pattern of an SSN. But if you think about it, the SSN must have something other than a digit before it — otherwise we might think that a sentence containing '0123-45-6789' contained an SSN, when it doesn't. Similarly, it has to have a non-digit following it as well. There's one other thing here that I've just sort of ignored: the parentheses surrounding it. In a regex, those parentheses designate a capture group. That's not a military unit, but rather a directive that says "capture any text that matches between these parentheses." You can have any number of capture groups in a regex. I'll show you how you use the captured text a little later in this post. Our example has three capture groups — see 'em?
(^|[^0-9]) This little beauty means "match the beginning of a line or a non-digit (one non-digit, because there is no quantifier saying otherwise)." The first caret character here is outside of a character class (it's not inside square brackets), and has a different meaning in this context — it means "beginning of line". The vertical bar ("|") means match what's on the left side of it or what's on the right. The whole thing is in a capture group not because we actually want to capture it, but rather to define the limits of the or. It still will be captured, but that's only a side-effect in this case.
($|[^0-9]) This is much like the preceding one — the only new bit is the dollar sign ("$"), which means "match the end of line." Taken all together, this bit means "match the end of a line or a non-digit."
(^|[^0-9])([0-9]{3}-[0-9]{2}-[0-9]{4})($|[^0-9]) Now you're ready for the whole thing. This means "capture and match either the beginning of a line or a non-digit, then capture and match three digits, then a dash, then two digits, then a dash, then four digits, then finally capture and match either the end of a line or a non-digit." When you "run" your regex (see the .exec() method below), the text you're searching through is scanned from the first character through the last, looking for anything that matches the pattern of text that your regex has specified. Piece of cake!

The capture groups become useful when we start trying to use the text we've matched. This little snippet of code from yesterday's example makes use of them:


var parser = /(^|[^0-9])([0-9]{3}-[0-9]{2}-[0-9]{4})($|[^0-9])/m;
var ans = parser.exec(text);
return (ans == null) ? null : ans[2];

The first line just creates a regex object (contained in parser) from the regex literal, making it ready for use. The second line uses it (via the .exec() method) to search the contents of the variable text. The .exec() method returns a null if it didn't match anything, but if it did match, then it returns an array of useful information. The first ([0]) entry of the array contains the entire matched text. In yesterday's example, the text we were searching was:


'I have written my social security number (123-45-6789) in here, like this.'

So the first entry of the array would contain:


0: '(123-45-6789)'

That's our SSN, plus the preceding and succeeding non-digits.

The other entries in the array are for each capture group. You figure the capture group number by counting the left parentheses from left to right (this distinction matters because it is possible to have nested capture groups). Our example has three capture groups, and their results will show up in entries 1, 2, and 3 as follows;


0: '(123-45-6789)'
1: '('
2: '123-45-6789'
3: ')'

The [2] entry has our SSN, and that code above uses ans[2] to get it.

Now that wasn't so hard, was it?

Way back in the beginning of this post, I promised I'd explain the multiline flag. It controls the way the beginning of line ("^") and end of line ("$") metacharacters work. Without the multiline flag, they match the beginning and end of the entire string. With the multiline flag, they will also match the beginning or end of a line within the text. Our example would actually work just fine either way, but I included it to introduce you to the notion of a flag character. We'll run into some more of these fine beasties in later posts.

Ok, now go wrap your head in cold, wet towels. This will help with the brain overheating...

Regex: Stop the Madness!

Agentic AI (AI Agent) Development Guidelines and Use Cases (Hands-on Experience)

5 Common Pitfalls in ServiceNow Implementations (And How to Avoid Them)

CMDB Intelligent Search