Appearance
RegExp
Strings are one of the most commonly involved data structures in programming, and the need to manipulate them is almost ubiquitous. For instance, to determine if a string is a valid email address, one could extract the substring before and after the '@' and check each part separately. However, this approach is cumbersome and makes code less reusable.
Regular expressions are a powerful tool for matching strings. They define rules for strings in a descriptive language; any string that matches the rules is considered valid.
To check if a string is a valid email, we:
- Create a regex pattern for email matching.
- Use this pattern to validate user input.
Since regular expressions are also strings, we must first understand how to describe characters using them.
In regex, specifying a character directly means an exact match. \d
matches a digit, and \w
matches a letter or digit. For example:
'00\d'
matches'007'
, but not'00A'
.'\d\d\d'
matches'010'
.'\w\w'
matches'js'
..
matches any character, so'js.'
matches'jsp'
,'jss'
,'js!'
, etc.
To match variable-length characters, regex uses:
*
for any number of characters (including zero),+
for at least one character,?
for zero or one character,{n}
for exactly n characters,{n,m}
for between n and m characters.
For a complex example: \d{3}\s+\d{3,8}
.
Breaking it down:
\d{3}
matches three digits, e.g.,'010'
.\s
matches a space (including tabs), so\s+
matches at least one space.\d{3,8}
matches between three to eight digits, e.g.,'1234567'
.
Together, this regex matches phone numbers with an area code separated by any number of spaces.
For a format like '010-12345'
, since '-'
is a special character in regex, we escape it: \d{3}\-\d{3,8}
.
However, it won't match '010 - 12345'
due to the space, requiring a more complex pattern.
Advanced Matching
For more precise matching, you can use []
to define ranges:
[0-9a-zA-Z\_]
matches a digit, letter, or underscore.[0-9a-zA-Z\_]+
matches strings of at least one digit, letter, or underscore.[a-zA-Z\_\$][0-9a-zA-Z\_\$]*
matches valid JavaScript variable names.
A|B
matches A or B, so (J|j)ava(S|s)cript
matches different cases of "JavaScript."
^
denotes the start of a line, while $
denotes the end. For example, ^\d
means the string must start with a digit.
You may have noticed that js
can match 'jsp'
, but adding ^js$
means it must exactly match 'js'
.
Using RegExp in JavaScript
JavaScript has two ways to create regex:
- Directly using
/regex/
. - Using
new RegExp('regex')
.
Both methods are equivalent:
javascript
let re1 = /ABC\-001/;
let re2 = new RegExp('ABC\\-001');
When checking if a regex matches, use:
javascript
let re = /^\d{3}\-\d{3,8}$/;
re.test('010-12345'); // true
re.test('010-1234x'); // false
re.test('010 12345'); // false
Splitting Strings
Using regex to split strings is more flexible than using fixed characters:
javascript
'a b c'.split(' '); // ['a', 'b', '', '', 'c']
'a b c'.split(/\s+/); // ['a', 'b', 'c']
'a,b, c d'.split(/[\s\,]+/); // ['a', 'b', 'c', 'd']
'a,b;; c d'.split(/[\s\,\;]+/); // ['a', 'b', 'c', 'd']
Grouping
Besides simple matching, regex can extract substrings using groups, denoted by ()
:
javascript
let re = /^(\d{3})-(\d{3,8})$/;
re.exec('010-12345'); // ['010-12345', '010', '12345']
re.exec('010 12345'); // null
Using exec()
, you can extract matched substrings.
Greedy Matching
Regex matches greedily by default, meaning it tries to match as much as possible. For instance:
javascript
let re = /^(\d+)(0*)$/;
re.exec('102300'); // ['102300', '102300', '']
To make it non-greedy, add ?
:
javascript
let re = /^(\d+?)(0*)$/;
re.exec('102300'); // ['102300', '1023', '00']
Global Search
JavaScript regex has special flags, with g
for global matching:
javascript
let r1 = /test/g;
let r2 = new RegExp('test', 'g');
Using g
allows multiple calls to exec()
on a string:
javascript
let s = 'JavaScript, VBScript, JScript and ECMAScript';
let re = /[a-zA-Z]+Script/g;
re.exec(s); // ['JavaScript']
re.lastIndex; // 10
Flags i
and m
are for case-insensitive and multiline matching, respectively.
Summary
Regex is powerful, and covering everything in one session is impossible. For frequent regex issues, consider a dedicated reference book.
Exercise
Try writing a regex for validating email addresses. Version one should validate similar emails:
javascript
let re = /^$/;
// Tests:
let i, success = true, should_pass = ['someone@gmail.com', 'bill.gates@microsoft.com', 'tom@voyager.org', 'bob2015@163.com'], should_fail = ['test#gmail.com', 'bill@microsoft', 'bill%gates@ms.com', '@voyager.org'];
for (i = 0; i < should_pass.length; i++) {
if (!re.test(should_pass[i])) {
console.log('Test failed: ' + should_pass[i]);
success = false;
break;
}
}
for (i = 0; i < should_fail.length; i++) {
if (re.test(should_fail[i])) {
console.log('Test failed: ' + should_fail[i]);
success = false;
break;
}
}
if (success) {
console.log('Test passed!');
}
Version two can validate and extract named email addresses:
javascript
let re = /^$/;
// Test:
let r = re.exec('<Tom Paris> tom@voyager.org');
if (r === null || r.toString() !== ['<Tom Paris> tom@voyager.org', 'Tom Paris', 'tom@voyager.org'].toString()) {
console.log('Test failed!');
} else {
console.log('Test succeeded!');
}