Regular expressions in JavaScript – Part 2

1. Dynamically Creating RegExp Objects

There are cases where you might not know the exact pattern you need to match against when you are writing your code. Say you want to look for the user’s name in a piece of text and enclose it in underscore characters to make it stand out. Since you will know the name only once the program is actually running, you can’t use the slash-based notation.

But you can build up a string and use the RegExp constructor on that. Here’s an example: let name = “harry”;

let text = “Harry is a suspicious character.”;

let regexp = new RegExp(“\\b(” + name + “)\\b”, “gi”);

console.log(text.replace(regexp, “_$1_”));

// → _Harry_ is a suspicious character.

When creating the \b boundary markers, we have to use two backslashes because we are writing them in a normal string, not a slash-enclosed reg­ular expression. The second argument to the RegExp constructor contains the options for the regular expression—in this case, “gi” for global and case insensitive.

But what if the name is “dea+hl[]rd” because our user is a nerdy teen­ager? That would result in a nonsensical regular expression that won’t actu­ally match the user’s name.

To work around this, we can add backslashes before any character that has a special meaning.

let name = “dea+hl[]rd”;

let text = “This dea+hl[]rd guy is super annoying.”;

let escaped = name.replace(/[\\[.+*?Q{|^$]/g, “\\$&”);

let regexp = new RegExp(“\\b” + escaped + “\\b”, “gi”);

console.log(text.replace(regexp, “_$&_”));

// → This _dea+hl[]rd_ guy is super annoying.

2. The search Method

The indexOf method on strings cannot be called with a regular expression. But there is another method, search, that does expect a regular expression. Like indexOf, it returns the first index on which the expression was found, or -1 when it wasn’t found.

console.log(” word”.search(/\S/));

// → 2

console.log(” “.search(/\S/));

// → -1

Unfortunately, there is no way to indicate that the match should start at a given offset (like we can with the second argument to indexOf), which would often be useful.

3. The lastIndex Property

The exec method similarly does not provide a convenient way to start search­ing from a given position in the string. But it does provide an inconvenient way.

Regular expression objects have properties. One such property is source, which contains the string that expression was created from. Another prop­erty is lastIndex, which controls, in some limited circumstances, where the next match will start.

Those circumstances are that the regular expression must have the global (g) or sticky (y) option enabled, and the match must happen through the exec method. Again, a less confusing solution would have been to just allow an extra argument to be passed to exec, but confusion is an essential feature of JavaScript’s regular expression interface.

let pattern = /y/g;

pattern.lastIndex = 3;

let match = pattern.exec(“xyzzy”);

console.log(match.index);

// → 4

console.log(pattern.lastlndex);

// → 5

If the match was successful, the call to exec automatically updates the lastIndex property to point after the match. If no match was found, lastIndex is set back to zero, which is also the value it has in a newly constructed regu­lar expression object.

The difference between the global and the sticky options is that, when sticky is enabled, the match will succeed only if it starts directly at lastIndex, whereas with global, it will search ahead for a position where a match can start.

let global = /abc/g;

console.log(global.exec(“xyz abc”));

// → [“abc”] let sticky = /abc/y;

console.log(sticky.exec(“xyz abc”));

// → null

When using a shared regular expression value for multiple exec calls, these automatic updates to the lastIndex property can cause problems. Your regular expression might be accidentally starting at an index that was left over from a previous call.

let digit = /\d/g;

console.log(digit.exec(“here it is: 1”));

// → [“1”]

console.log(digit.exec(“and now: 1”));

// → null

Another interesting effect of the global option is that it changes the way the match method on strings works. When called with a global expression, instead of returning an array similar to that returned by exec, match will find all matches of the pattern in the string and return an array containing the matched strings.

console.log(“Banana”.match(/an/g));

// → [“an”, “an”]

So be cautious with global regular expressions. The cases where they are necessary—calls to replace and places where you want to explicitly use lastIndex—are typically the only places where you want to use them.

4. Looping Over Matches

A common thing to do is to scan through all occurrences of a pattern in a string, in a way that gives us access to the match object in the loop body. We can do this by using lastIndex and exec.

let input = “A string with 3 numbers in it… 42 and 88.”;

let number = /\b\d+\b/g;

let match;

while (match = number.exec(input)) {

console.log(“Found”, match[0], “at”, match.index);

}

// → Found 3 at 14

// Found 42 at 33

// Found 88 at 40

This makes use of the fact that the value of an assignment expression (=) is the assigned value. So by using match = number.exec(input) as the condition in the while statement, we perform the match at the start of each iteration, save its result in a binding, and stop looping when no more matches are found.

5. Parsing an INI File

To conclude the chapter, we’ll take a look at a problem that calls for regular expressions. Imagine we are writing a program to automatically collect infor­mation about our enemies from the internet. (We will not actually write that program here, just the part that reads the configuration file. Sorry.) The configuration file looks like this:

searchengine=https://duckduckgo.com/?q=$1

spitefulness=9.7

; comments are preceded by a semicolon…

; each section concerns an individual enemy

[larry]

fullname=Larry Doe

type=kindergarten bully

website=http://www.geocities.com/CapeCanaveral/11451

[davaeorn]

fullname=Davaeorn

type=evil wizard

outputdir=/home/marijn/enemies/davaeorn 

The exact rules for this format (which is a widely used format, usually called an INI file) are as follows:

  • Blank lines and lines starting with semicolons are ignored
  • Lines wrapped in [ and ] start a new section.
  • Lines containing an alphanumeric identifier followed by an = character add a setting to the current section.
  • Anything else is invalid.

Our task is to convert a string like this into an object whose properties hold strings for settings written before the first section header and sub­objects for sections, with those subobjects holding the section’s settings.

Since the format has to be processed line by line, splitting up the file into separate lines is a good start. We saw the split method in “Strings and Their Properties” on page 72. Some operating systems, however, use not just a newline character to separate lines but a carriage return character followed by a newline (“\r\n”). Given that the split method also allows a reg­ular expression as its argument, we can use a regular expression like /\r?\n/ to split in a way that allows both “\n” and “\r\n” between lines.

function parselNI(string) {

// Start with an object to hold the top-level fields let result = {};

let section = result;

string.split(/\r?\n/).forEach(line => {

let match;

if (match = line.match(/^(\w+)=(.*)$/)) {

section[match[l]] = match[2];

} else if (match = line.match(/A\[(.*)\]$/)) {

section = result[match[l]] = {};

} else if (!/^\s*(;.*)?$/.test(line)) {

throw new Error(“Line ‘” + line + “‘ is not valid.”);

}

});

return result;

}

console.log(parseINI(‘

name=Vasilis

[address]

city=Tessaloniki’));

// → {name: “Vasilis”, address: {city: “Tessaloniki”}}

The code goes over the file’s lines and builds up an object. Properties at the top are stored directly into that object, whereas properties found in sections are stored in a separate section object. The section binding points at the object for the current section.

There are two kinds of significant lines—section headers or property lines. When a line is a regular property, it is stored in the current section. When it is a section header, a new section object is created, and section is set to point at it.

Note the recurring use of ^ and $ to make sure the expression matches the whole line, not just part of it. Leaving these out results in code that mostly works but behaves strangely for some input, which can be a difficult bug to track down.

The pattern if (match = string.match(…)) is similar to the trick of using an assignment as the condition for while. You often aren’t sure that your call to match will succeed, so you can access the resulting object only inside an if statement that tests for this. To not break the pleasant chain of else if forms, we assign the result of the match to a binding and immediately use that assignment as the test for the if statement.

If a line is not a section header or a property, the function checks whether it is a comment or an empty line using the expression /^\s*(;.*)?$/. Do you see how it works? The part between the parentheses will match com­ments, and the ? makes sure it also matches lines containing only white­space. When a line doesn’t match any of the expected forms, the function throws an exception.

6. International Characters

Because of JavaScript’s initial simplistic implementation and the fact that this simplistic approach was later set in stone as standard behavior, JavaScript’s regular expressions are rather dumb about characters that do not appear in the English language. For example, as far as JavaScript’s regu­lar expressions are concerned, a “word character” is only one of the 26 char­acters in the Latin alphabet (uppercase or lowercase), decimal digits, and, for some reason, the underscore character. Things like e or /3, which most definitely are word characters, will not match \w (and will match uppercase \W, the nonword category).

By a strange historical accident, \s (whitespace) does not have this prob­lem and matches all characters that the Unicode standard considers white­space, including things like the nonbreaking space and the Mongolian vowel separator.

Another problem is that, by default, regular expressions work on code units (as discussed in “Strings and Character Codes” on page 92), not actual characters. This means characters that are composed of two code units behave strangely.

The problem is that the in the first line is treated as two code units, and the {3} part is applied only to the second one. Similarly, the dot matches a single code unit, not the two that make up the rose emoji.

You must add a u option (for Unicode) to your regular expression to make it treat such characters properly. The wrong behavior remains the default, unfortunately, because changing that might cause problems for existing code that depends on it.

Though this was onlyjust standardized and is, at the time of writing, not widely supported yet, it is possible to use \p in a regular expression (that must have the Unicode option enabled) to match all characters to which the Unicode standard assigns a given property.

console.log(/\p{Script=Greek}/u.test(“a”));

// → true

console.log(/\p{Script=Arabic}/u.test(“a”));

// → false

console.log(/\p{Alphabetic}/u.test(“a”));

// → true

console.log(/\p{Alphabetic}/u.test(“!”));

// → false

Unicode defines a number of useful properties, though finding the one that you need may not always be trivial. You can use the \p{Property=Value} notation to match any character that has the given value for that property. If the property name is left off, as in \p{Name}, the name is assumed to be either a binary property such as Alphabetic or a category such as Number.

Source: Haverbeke Marijn (2018), Eloquent JavaScript: A Modern Introduction to Programming,

No Starch Press; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *