Regular expressions in JavaScript – Part 1

1. Creating a Regular Expression

A regular expression is a type of object. It can be either constructed with the RegExp constructor or written as a literal value by enclosing a pattern in forward slash (/) characters.

let re1 = new RegExp(“abc”);

let re2 = /abc/;

Both of those regular expression objects represent the same pattern: an a character followed by a b followed by a c.

When using the RegExp constructor, the pattern is written as a normal string, so the usual rules apply for backslashes.

The second notation, where the pattern appears between slash char­acters, treats backslashes somewhat differently. First, since a forward slash ends the pattern, we need to put a backslash before any forward slash that we want to be part of the pattern. In addition, backslashes that aren’t part of special character codes (like \n) will be preserved, rather than ignored as they are in strings, and change the meaning of the pattern. Some charac­ters, such as question marks and plus signs, have special meanings in regular expressions and must be preceded by a backslash if they are meant to repre­sent the character itself.

let eighteenPlus = /eighteen\+/;

2. Testing for Matches

Regular expression objects have a number of methods. The simplest one is test. If you pass it a string, it will return a Boolean telling you whether the string contains a match of the pattern in the expression.

console.log(/abc/.test(“abcde”));

// → true

console.log(/abc/.test(“abxde”));

// → false

A regular expression consisting of only nonspecial characters simply rep­resents that sequence of characters. If abc occurs anywhere in the string we are testing against (notjust at the start), test will return true.

3. Sets of Characters

Finding out whether a string contains abc could just as well be done with a call to indexOf. Regular expressions allow us to express more complicated patterns.

Say we want to match any number. In a regular expression, putting a set of characters between square brackets makes that part of the expression match any of the characters between the brackets.

Both of the following expressions match all strings that contain a digit:

console.log(/[0123456789]/.test(“in 1992”));

// → true

console.log(/[0-9]/.test(“in 1992”));

// → true

Within square brackets, a hyphen (-) between two characters can be used to indicate a range of characters, where the ordering is determined by the character’s Unicode number. Characters 0 to 9 sit right next to each other in this ordering (codes 48 to 57), so [0-9] covers all of them and matches any digit.

A number of common character groups have their own built-in short­cuts. Digits are one of them: \d means the same thing as [0-9].

\d Any digit character

\w An alphanumeric character (“word character”)

\s Any whitespace character (space, tab, newline, and similar)

\D A character that is not a digit \W A nonalphanumeric character

\S A nonwhitespace character

. Any character except for newline

So you could match a date and time format like 01-30-2003 15:20 with the following expression:

let dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/;

console.log(dateTime.test(“01-30-2003 15:20”));

// → true

console.log(dateTime.test(“30-jan-2003 15:20”));

// → false

That looks completely awful, doesn’t it? Half of it is backslashes, pro­ducing a background noise that makes it hard to spot the actual pattern expressed. We’ll see a slightly improved version of this expression in the next section.

These backslash codes can also be used inside square brackets. For example, [\d.] means any digit or a period character. But the period itself, between square brackets, loses its special meaning. The same goes for other special characters, such as +.

To invert a set of characters—that is, to express that you want to match any character except the ones in the set—you can write a caret (^) character after the opening square bracket.

let notBinary = /[^01]/;

console.log(notBinary.test(“1100100010100110”));

// → false

console.log(notBinary.test(“1100100010200110”));

// → true

4. Repeating Parts of a Pattern

We now know how to match a single digit. What if we want to match a whole number—a sequence of one or more digits?

When you put a plus sign (+) after something in a regular expression, it indicates that the element may be repeated more than once. Thus, /\d+/ matches one or more digit characters.

console.log(/’\d+’/.test(“‘l23′”));

// → true

console.log(/’\d+’/.test(…………….. ));

// → false

console.log(/’\d*’/.test(“‘l23′”));

// → true

console.log(/’\d*’/.test(……………… ));

// → true

The star (*) has a similar meaning but also allows the pattern to match zero times. Something with a star after it never prevents a pattern from matching—it’ll just match zero instances if it can’t find any suitable text to match.

A question mark makes a part of a pattern optional, meaning it may occur zero times or one time. In the following example, the u character is allowed to occur, but the pattern also matches when it is missing.

let neighbor = /neighbou?r/;

console.log(neighbor.test(“neighbour”));

// → true

console.log(neighbor.test(“neighbor”));

// → true

To indicate that a pattern should occur a precise number of times, use braces. Putting {4} after an element, for example, requires it to occur exactly four times. It is also possible to specify a range this way: {2,4} means the ele­ment must occur at least twice and at most four times.

Here is another version of the date and time pattern that allows both single- and double-digit days, months, and hours. It is also slightly easier to decipher.

let dateTime = /\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}/;

console.log(dateTime.test(“1-30-2003 8:45”));

// → true

You can also specify open-ended ranges when using braces by omitting the number after the comma. So, {5,} means five or more times.

5. Grouping Subexpressions

To use an operator like * or + on more than one element at a time, you have to use parentheses. A part of a regular expression that is enclosed in parentheses counts as a single element as far as the operators following it are concerned.

let cartoonCrying = /boo+(hoo+)+/i;

console.log(cartoonCrying.test(“Boohoooohoohooo”));

// → true

The first and second + characters apply only to the second o in boo and hoo, respectively. The third + applies to the whole group (hoo+), matching one or more sequences like that.

The i at the end of the expression in the example makes this regular expression case insensitive, allowing it to match the uppercase B in the input string, even though the pattern is itself all lowercase.

6. Matches and Groups

The test method is the absolute simplest way to match a regular expression. It tells you only whether it matched and nothing else. Regular expressions also have an exec (execute) method that will return null if no match was found and return an object with information about the match otherwise.

let match = /\d+/.exec(“one two 100”);

console.log(match);

// → [“100”] console.log(match.index);

// → 8

An object returned from exec has an index property that tells us where in the string the successful match begins. Other than that, the object looks like (and in fact is) an array of strings, whose first element is the string that was matched. In the previous example, this is the sequence of digits that we were looking for.

String values have a match method that behaves similarly.

console.log(“one two 100”.match(/\d+/));

// → [“100”]

When the regular expression contains subexpressions grouped with parentheses, the text that matched those groups will also show up in the array. The whole match is always the first element. The next element is the part matched by the first group (the one whose opening parenthesis comes first in the expression), then the second group, and so on.

let quotedText = /'([^‘]*)’/;

console.log(quotedText.exec(“she said ‘hello'”));

// → [“‘hello'”, “hello”]

When a group does not end up being matched at all (for example, when followed by a question mark), its position in the output array will hold undefined. Similarly, when a group is matched multiple times, only the last match ends up in the array.

console.log(/bad(ly)?/.exec(“bad”));

// → [“bad”, undefined]

console.log(/(\d)+/.exec(“123”));

// → [“123”, “3”]

Groups can be useful for extracting parts of a string. If we don’tjust want to verify whether a string contains a date but also extract it and con­struct an object that represents it, we can wrap parentheses around the digit patterns and directly pick the date out of the result of exec.

But first we’ll take a brief detour, in which we discuss the built-in way to represent date and time values in JavaScript.

7. The Date Class

JavaScript has a standard class for representing dates—or, rather, points in time. It is called Date. If you simply create a date object using new, you get the current date and time.

console.log(new Date());

// → Sat Sep 01 2018 15:24:32 GMT+0200 (CEST)

You can also create an object for a specific time.

console.log(new Date(2009, 11, 9));

// → Wed Dec 09 2009 00:00:00 GMT+0100 (CET)

console.log(new Date(2009, 11, 9, 12, 59, 59, 999));

// → Wed Dec 09 2009 12:59:59 GMT+0100 (CET)

JavaScript uses a convention where month numbers start at zero (so December is 11), yet day numbers start at one. This is confusing and silly.

Be careful.

The last four arguments (hours, minutes, seconds, and milliseconds) are optional and taken to be zero when not given.

Timestamps are stored as the number of milliseconds since the start of 1970, in the UTC time zone. This follows a convention set by “Unix time,” which was invented around that time. You can use negative numbers for times before 1970. The getTime method on a date object returns this num­ber. It is big, as you can imagine.

console.log(new Date(2013, 11, 19).getTime());

// → 1387407600000

console.log(new Date(1387407600000));

// → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)

If you give the Date constructor a single argument, that argument is treated as such a millisecond count. You can get the current millisecond count by creating a new Date object and calling getTime on it or by calling the Date.now function.

Date objects provide methods such as getFullYear, getMonth, getDate, getHours, getMinutes, and getSeconds to extract their components. Besides getFullYear there’s also getYear, which gives you the year minus 1900 (98 or 119) and is mostly useless.

Putting parentheses around the parts of the expression that we are inter­ested in, we can now create a date object from a string.

function getDate(string) {

let [_, month, day, year] = /(\d{1,2})-(\d{1,2})-(\d{4})/.exec(string);

return new Date(year, month – 1, day);

}

console.log(getDate(“1-30-2003”));

// → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

The _ (underscore) binding is ignored and used only to skip the full match element in the array returned by exec.

8. Word and String Boundaries

Unfortunately, getDate will also happily extract the nonsensical date 00-1­3000 from the string “100-1-30000”. A match may happen anywhere in the string, so in this case, it’lljust start at the second character and end at the second-to-last character.

If we want to enforce that the match must span the whole string, we can add the markers ^ and $. The caret matches the start of the input string, whereas the dollar sign matches the end. So, /^\d+$/ matches a string consist­ing entirely of one or more digits, U!/ matches any string that starts with an exclamation mark, and /x^/ does not match any string (there cannot be an x before the start of the string).

If, on the other hand, we just want to make sure the date starts and ends on a word boundary, we can use the marker \b. A word boundary can be the start or end of the string or any point in the string that has a word character (as in \w) on one side and a nonword character on the other.

console.log(/cat/.test(“concatenate”));

// → true

console.log(/\bcat\b/.test(“concatenate”));

// → false

Note that a boundary marker doesn’t match an actual character. Itjust enforces that the regular expression matches only when a certain condition holds at the place where it appears in the pattern.

9. Choice Patterns

Say we want to know whether a piece of text contains not only a number but a number followed by one of the words pig, cow, or chicken, or any of their plural forms.

We could write three regular expressions and test them in turn, but there is a nicer way. The pipe character (|) denotes a choice between the pattern to its left and the pattern to its right. So I can say this:

let animalCount = /\b\d+ (pig|cow|chicken)s?\b/;

console.log(animalCount.test(“15 pigs”));

// → true

console.log(animalCount.test(“15 pigchickens”));

// → false

Parentheses can be used to limit the part of the pattern that the pipe operator applies to, and you can put multiple such operators next to each other to express a choice between more than two alternatives.

10. The Mechanics of Matching

Conceptually, when you use exec or test, the regular expression engine looks for a match in your string by trying to match the expression first from the start of the string, then from the second character, and so on, until it finds a match or reaches the end of the string. It’ll either return the first match that can be found or fail to find any match at all.

To do the actual matching, the engine treats a regular expression some­thing like a flow diagram. This is the diagram for the livestock expression in the previous example:

Our expression matches if we can find a path from the left side of the diagram to the right side. We keep a current position in the string, and every time we move through a box, we verify that the part of the string after our current position matches that box.

So if we try to match “the 3 pigs” from position 4, our progress through the flow chart would look like this:

  • At position 4, there is a word boundary, so we can move past the first box.
  • Still at position 4, we find a digit, so we can also move past the sec­ond box.
  • At position 5, one path loops back to before the second (digit) box, while the other moves forward through the box that holds a single space character. There is a space here, not a digit, so we must take the sec­ond path.
  • We are now at position 6 (the start of pigs) and at the three-way branch in the diagram. We don’t see cow or chicken here, but we do see pig, so we take that branch.
  • At position 9, after the three-way branch, one path skips the s box and goes straight to the final word boundary, while the other path matches an s. There is an s character here, not a word boundary, so we go through the s
  • We’re at position 10 (the end of the string) and can match only a word boundary. The end of a string counts as a word boundary, so we go through the last box and have successfully matched this string.

11. Backtracking

The regular expression /\b([0l]+b|[\da-f]+h|\d+)\b/ matches either a binary number followed by a b, a hexadecimal number (that is, base 16, with the let­ters a to ƒ standing for the digits 10 to 15) followed by an h, or a regular dec­imal number with no suffix character. This is the corresponding diagram:

When matching this expression, it will often happen that the top (binary) branch is entered even though the input does not actually contain a binary number. When matching the string “103”, for example, it becomes clear only at the 3 that we are in the wrong branch. The string does match the expression, just not the branch we are currently in.

So the matcher backtracks. When entering a branch, it remembers its current position (in this case, at the start of the string, just past the first boundary box in the diagram) so that it can go back and try another branch if the current one does not work out. For the string “103”, after encounter­ing the 3 character, it will start trying the branch for hexadecimal numbers, which fails again because there is no h after the number. So it tries the deci­mal number branch. This one fits, and a match is reported after all.

The matcher stops as soon as it finds a full match. This means that if multiple branches could potentially match a string, only the first one (ordered by where the branches appear in the regular expression) is used.

Backtracking also happens for repetition operators like + and *. If you match /^.*x/ against “abcxe”, the .* part will first try to consume the whole string. The engine will then realize that it needs an x to match the pattern. Since there is no x past the end of the string, the star operator tries to match one character less. But the matcher doesn’t find an x after abcx either, so it backtracks again, matching the star operator to just abc. Now it finds an x where it needs it and reports a successful match from positions 0 to 4.

It is possible to write regular expressions that will do a lot of backtrack­ing. This problem occurs when a pattern can match a piece of input in many different ways. For example, if we get confused while writing a binary- number regular expression, we might accidentally write something like /([01]+)+b/.

If that tries to match some long series of zeros and ones with no trail­ing b character, the matcher first goes through the inner loop until it runs out of digits. Then it notices there is no b, so it backtracks one position, goes through the outer loop once, and gives up again, trying to backtrack out of the inner loop once more. It will continue to try every possible route through these two loops. This means the amount of work doubles with each additional character. For even just a few dozen characters, the resulting match will take practically forever.

12. The replace Method

String values have a replace method that can be used to replace part of the string with another string.

console.log(“papa”.replace(“p”, “m”));

// → mapa

The first argument can also be a regular expression, in which case the first match of the regular expression is replaced. When a g option (for global) is added to the regular expression, all matches in the string will be replaced, not just the first.

console.log(“Borobudur”.replace(/[ou]/, “a”));

// → Barobudur

console.log(“Borobudur”.replace(/[ou]/g, “a”));

// → Barabadar

It would have been sensible if the choice between replacing one match or all matches was made through an additional argument to replace or by providing a different method, replaceAll. But for some unfortunate reason, the choice relies on a property of the regular expression instead.

The real power of using regular expressions with replace comes from the fact that we can refer to matched groups in the replacement string. For example, say we have a big string containing the names of people, one name per line, in the format Lastname, Firstname. If we want to swap these names and remove the comma to get a Firstname Lastname format, we can use the following code:

console.log(

“Liskov, Barbara\nMcCarthy, John\nWadler, Philip” .replace(/(\w+), (\w+)/g, “$2 $1”));

// → Barbara Liskov

//  John McCarthy

//  Philip Wadler

The $1 and $2 in the replacement string refer to the parenthesized groups in the pattern. $1 is replaced by the text that matched against the first group, $2 by the second, and so on, up to $9. The whole match can be referred to with $&.

It is possible to pass a function—rather than a string—as the second argument to replace. For each replacement, the function will be called with the matched groups (as well as the whole match) as arguments, and its return value will be inserted into the new string.

Here’s a small example:

let s = “the cia and fbi”;

console.log(s.replace(/\b(fbi|cia)\b/g,

str => str.toUpperCase()));

// → the CIA and FBI

Here’s a more interesting one:

let stock = “1 lemon, 2 cabbages, and 101 eggs”;

function minusOne(match, amount, unit) {

amount = Number(amount) – 1;

if (amount == 1) {

// only one left, remove the ‘s’ unit = unit.slice(0, unit.length – 1);

} else if (amount == 0) {

amount = “no”;

}

return amount + ” ” + unit;

}

console.log(stock.replace(/(\d+) (\w+)/g, minusOne));

// → no lemon, 1 cabbage, and 100 eggs

This takes a string, finds all occurrences of a number followed by an alphanumeric word, and returns a string wherein every such occurrence is decremented by one.

The (\d+) group ends up as the amount argument to the function, and the (\w+) group gets bound to unit. The function converts amount to a number—which always works since it matched \d+—and makes some adjustments in case there is only one or zero left.

13. Greed

It is possible to use replace to write a function that removes all comments from a piece of JavaScript code. Here is a first attempt:

function stripComments(code) {

return code.replace(/W.*|\/\*[^]*\*\//g, “”);

}

console.log(stripComments(“1 + /* 2 */3”));

// → 1+3

console.log(stripComments(“x = 10;// ten!”));

// → x = 10;

console.log(stripComments(“1 /* a */+/* b */ 1”));

// → 11

The part before the or operator matches two slash characters followed by any number of non-newline characters. The part for multiline comments is more involved. We use [A] (any character that is not in the empty set of characters) as a way to match any character. We cannot just use a period here because block comments can continue on a new line, and the period character does not match newline characters.

But the output for the last line appears to have gone wrong. Why?

The [^]* part of the expression, as I described in the section on back­tracking, will first match as much as it can. If that causes the next part of the pattern to fail, the matcher moves back one character and tries again from there. In the example, the matcher first tries to match the whole rest of the string and then moves back from there. It will find an occurrence of */ after going back four characters and match that. This is not what we wanted—the intention was to match a single comment, not to go all the way to the end of the code and find the end of the last block comment.

Because of this behavior, we say the repetition operators (+, *, ?, and {}) are greedy, meaning they match as much as they can and backtrack from there. If you put a question mark after them (+?, *?, ??, {}?), they become nongreedy and start by matching as little as possible, matching more only when the remaining pattern does not fit the smaller match.

And that is exactly what we want in this case. By having the star match the smallest stretch of characters that brings us to a */, we consume one block comment and nothing more.

function stripComments(code) {

return code.replace(/\/\/.*|\/\*[^]*?\*\//g, “”);

}

console.log(stripComments(“1 /* a */+/* b */ 1”));

// → 1+1

A lot of bugs in regular expression programs can be traced to uninten­tionally using a greedy operator where a nongreedy one would work better. When using a repetition operator, consider the nongreedy variant first.

Source: Haverbeke Marijn (2018), Eloquent JavaScript: A Modern Introduction to Programming,

No Starch Press; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *