Skip to content

RegExp

Strings are one of the most commonly involved data structures in programming, and the need to manipulate them is almost ubiquitous. For instance, to determine if a string is a valid email address, one could extract the substring before and after the '@' and check each part separately. However, this approach is cumbersome and makes code less reusable.

Regular expressions are a powerful tool for matching strings. They define rules for strings in a descriptive language; any string that matches the rules is considered valid.

To check if a string is a valid email, we:

  1. Create a regex pattern for email matching.
  2. Use this pattern to validate user input.

Since regular expressions are also strings, we must first understand how to describe characters using them.

In regex, specifying a character directly means an exact match. \d matches a digit, and \w matches a letter or digit. For example:

  • '00\d' matches '007', but not '00A'.
  • '\d\d\d' matches '010'.
  • '\w\w' matches 'js'.
  • . matches any character, so 'js.' matches 'jsp', 'jss', 'js!', etc.

To match variable-length characters, regex uses:

  • * for any number of characters (including zero),
  • + for at least one character,
  • ? for zero or one character,
  • {n} for exactly n characters,
  • {n,m} for between n and m characters.

For a complex example: \d{3}\s+\d{3,8}.

Breaking it down:

  • \d{3} matches three digits, e.g., '010'.
  • \s matches a space (including tabs), so \s+ matches at least one space.
  • \d{3,8} matches between three to eight digits, e.g., '1234567'.

Together, this regex matches phone numbers with an area code separated by any number of spaces.

For a format like '010-12345', since '-' is a special character in regex, we escape it: \d{3}\-\d{3,8}.

However, it won't match '010 - 12345' due to the space, requiring a more complex pattern.

Advanced Matching

For more precise matching, you can use [] to define ranges:

  • [0-9a-zA-Z\_] matches a digit, letter, or underscore.
  • [0-9a-zA-Z\_]+ matches strings of at least one digit, letter, or underscore.
  • [a-zA-Z\_\$][0-9a-zA-Z\_\$]* matches valid JavaScript variable names.

A|B matches A or B, so (J|j)ava(S|s)cript matches different cases of "JavaScript."

^ denotes the start of a line, while $ denotes the end. For example, ^\d means the string must start with a digit.

You may have noticed that js can match 'jsp', but adding ^js$ means it must exactly match 'js'.

Using RegExp in JavaScript

JavaScript has two ways to create regex:

  1. Directly using /regex/.
  2. Using new RegExp('regex').

Both methods are equivalent:

javascript
let re1 = /ABC\-001/;
let re2 = new RegExp('ABC\\-001');

When checking if a regex matches, use:

javascript
let re = /^\d{3}\-\d{3,8}$/;
re.test('010-12345'); // true
re.test('010-1234x'); // false
re.test('010 12345'); // false

Splitting Strings

Using regex to split strings is more flexible than using fixed characters:

javascript
'a b   c'.split(' '); // ['a', 'b', '', '', 'c']
'a b   c'.split(/\s+/); // ['a', 'b', 'c']
'a,b, c  d'.split(/[\s\,]+/); // ['a', 'b', 'c', 'd']
'a,b;; c  d'.split(/[\s\,\;]+/); // ['a', 'b', 'c', 'd']

Grouping

Besides simple matching, regex can extract substrings using groups, denoted by ():

javascript
let re = /^(\d{3})-(\d{3,8})$/;
re.exec('010-12345'); // ['010-12345', '010', '12345']
re.exec('010 12345'); // null

Using exec(), you can extract matched substrings.

Greedy Matching

Regex matches greedily by default, meaning it tries to match as much as possible. For instance:

javascript
let re = /^(\d+)(0*)$/;
re.exec('102300'); // ['102300', '102300', '']

To make it non-greedy, add ?:

javascript
let re = /^(\d+?)(0*)$/;
re.exec('102300'); // ['102300', '1023', '00']

JavaScript regex has special flags, with g for global matching:

javascript
let r1 = /test/g;
let r2 = new RegExp('test', 'g');

Using g allows multiple calls to exec() on a string:

javascript
let s = 'JavaScript, VBScript, JScript and ECMAScript';
let re = /[a-zA-Z]+Script/g;

re.exec(s); // ['JavaScript']
re.lastIndex; // 10

Flags i and m are for case-insensitive and multiline matching, respectively.

Summary

Regex is powerful, and covering everything in one session is impossible. For frequent regex issues, consider a dedicated reference book.

Exercise

Try writing a regex for validating email addresses. Version one should validate similar emails:

javascript
let re = /^$/;

// Tests:
let i, success = true, should_pass = ['someone@gmail.com', 'bill.gates@microsoft.com', 'tom@voyager.org', 'bob2015@163.com'], should_fail = ['test#gmail.com', 'bill@microsoft', 'bill%gates@ms.com', '@voyager.org'];
for (i = 0; i < should_pass.length; i++) {
    if (!re.test(should_pass[i])) {
        console.log('Test failed: ' + should_pass[i]);
        success = false;
        break;
    }
}
for (i = 0; i < should_fail.length; i++) {
    if (re.test(should_fail[i])) {
        console.log('Test failed: ' + should_fail[i]);
        success = false;
        break;
    }
}
if (success) {
    console.log('Test passed!');
}

Version two can validate and extract named email addresses:

javascript
let re = /^$/;

// Test:
let r = re.exec('<Tom Paris> tom@voyager.org');
if (r === null || r.toString() !== ['<Tom Paris> tom@voyager.org', 'Tom Paris', 'tom@voyager.org'].toString()) {
    console.log('Test failed!');
} else {
    console.log('Test succeeded!');
}
RegExp has loaded