Regex: Modified Madness...

SlightlyLoony · ‎10-26-2011

Yesterday I showed you how this regular expression (regex) worked to find SSNs in text:


/(^|[^0-9])([0-9]{3}-[0-9]{2}-[0-9]{4})($|[^0-9])/m

Hopefully you didn't suffer too much brain damage in the process. Today I'm going to focus on something a bit easier: some different ways to write this regex, and some ways to make it work better. Along the way, you'll learn a few new things about regexes...

Let's start with the character class [0123456789], which means "match any of the characters (which happen to be all the digits) within the square brackets." The square brackets designate a character class, and the characters between those brackets are the members of that character class. The meaning of a character class is that the regex will match any of its members at that position. Yesterday I showed you a shortcut in writing a character class, when all (or some) of its members are sequential. The digits 0 through 9 are sequential, so you can write the character class that includes all of them like this: [0-9], which you can read as "0 through 9." A moment ago I said sequential, and that's actually a key concept here. Sequential, to a regex, means "in Unicode 16 order." If that doesn't mean anything to you, don't worry about it — generally speaking, the only useful sequences are the digits 0-9, the lower case letters a-z, and the upper case letters A-Z. Just for completeness, let me note that character class ranges can span parts of a sequence with no problem: [a-g] is a perfectly legitimate way to specify the lower case characters a-g. Something that might not be so obvious is that using a range where one end is a lower case letter and the other is an upper case letter (such as [v-D]) might not do what you expect — best not to try anything like that unless you really do understand Unicode 16 ordering! You can also create character classes that mix up individual characters and ranges. For example, [15$%e-hC-G] is the exact same thing as [15$%efghCDEFG] — a strange set of characters to match, but by gum it would work.

Certain character classes are used so frequently that there are special regex "shorthand" codes for them. The character class for digits is one of them: [0123456789], [0-9], and \d are three ways of saying the same exact thing. If you're familiar with assembly language or C, this is your old friend: a macro. If you're not familiar with macros, think of \d being converted to [0123456789] before the regex is actually run. In fact, that is exactly how character class shorthands work in a regex. There are several of these macros (you can read more about them here), and they follow a pattern that's worth remembering: if you change the letter to upper case, it takes on the opposite meaning. So \D (note the upper case 'D') means "any character that's not a digit," which is the same as [^0-9]. Got that?

Using the shorthand character classes, we can now write yesterday's regex like this:


/(^|\D)(\d{3}-\d{2}-\d{4})($|\D)/m

JavaScript doesn't care which way you write it, and there's no difference in performance or behavior. You should feel perfectly free to write your regexes either with or without shorthand character classes — it's really just a matter of taste or preference. Similarly, if you have a fixed interval quantifier, you can chose to write your regex with or without the quantifier. The regex above could be written like this:


/(^|\D)(\d\d\d-\d\d-\d\d\d\d)($|\D)/m

It will behave exactly the same way, no difference at all. Again, it's a matter of taste or preference. If you think one is clearer, or prettier, or smells better — then go for it!

Now for one last little thing today. Yesterday I showed you how the .exec() method would return this array of results:


0: '(123-45-6789)'  // the entire matched text
1: '('              // capture group 1
2: '123-45-6789'    // capture group 2
3: ')'              // capture group 3

I also mentioned that the first and third capture groups had parentheses not because we really wanted (or needed!) to capture them, but because we needed to define the limits of the or character's action. Well, there's a way to make the or character work correctly, but without actually capturing the text. It's called a non-capturing group. No, that's not a failed military unit. It's a group that has all the elements of groupiness except that it doesn't capture any text, and won't result in an array entry in the results. You can write the first one from yesterday like this: (?:^|[^0-9]). I've added a ?: just after the opening parenthesis, and that tells the regex that this is a non-capturing group. It still works exactly as before, but it doesn't capture any matching text. So if we rewrote our regex from yesterday like this:


/(?:^|\D)(\d{3}-\d{2}-\d{4})(?:$|\D)/m

and then ran it, we'd get this result:


0: '(123-45-6789)'
1: '123-45-6789'

We've eliminated those two unnecessary capture groups. In our example it really doesn't hurt anything to have those unnecessary capture groups — it's just not as tidy as you might like. There are cases, however, where you really don't want the matched text to be captured, and that little piece of regex syntax comes in mighty handy for them.

Now look at you — you're actually understanding that line of unintelligible gobble-de-gook up there! Woot woot! How many people in your town could tell you that's a regex for finding SSNs? This is better than the cross-your-heart-and-hope-to-die double-secret handshakes you had back in elementary school. You know stuff now!

I think that's just about enough for one day. Are you feeling even a little more competent with regexes than you did before? If so, well, then enjoy it … 'cause I'm gonna take care of that but good tomorrow…

Regex: Modified Madness...

Agentic AI (AI Agent) Development Guidelines and Use Cases (Hands-on Experience)

5 Common Pitfalls in ServiceNow Implementations (And How to Avoid Them)

CMDB Intelligent Search