Appearance
RegExp
Strings are one of the most commonly involved data structures in programming, and the need to manipulate them is almost ubiquitous. For instance, to determine if a string is a valid email address, one could extract the substring before and after the '@' and check each part separately. However, this approach is cumbersome and makes code less reusable.
Regular expressions are a powerful tool for matching strings. They define rules for strings in a descriptive language; any string that matches the rules is considered valid.
To check if a string is a valid email, we:
- Create a regex pattern for email matching.
- Use this pattern to validate user input.
Since regular expressions are also strings, we must first understand how to describe characters using them.
In regex, specifying a character directly means an exact match. \d matches a digit, and \w matches a letter or digit. For example:
'00\d'matches'007', but not'00A'.'\d\d\d'matches'010'.'\w\w'matches'js'..matches any character, so'js.'matches'jsp','jss','js!', etc.
To match variable-length characters, regex uses:
*for any number of characters (including zero),+for at least one character,?for zero or one character,{n}for exactly n characters,{n,m}for between n and m characters.
For a complex example: \d{3}\s+\d{3,8}.
Breaking it down:
\d{3}matches three digits, e.g.,'010'.\smatches a space (including tabs), so\s+matches at least one space.\d{3,8}matches between three to eight digits, e.g.,'1234567'.
Together, this regex matches phone numbers with an area code separated by any number of spaces.
For a format like '010-12345', since '-' is a special character in regex, we escape it: \d{3}\-\d{3,8}.
However, it won't match '010 - 12345' due to the space, requiring a more complex pattern.
Advanced Matching
For more precise matching, you can use [] to define ranges:
[0-9a-zA-Z\_]matches a digit, letter, or underscore.[0-9a-zA-Z\_]+matches strings of at least one digit, letter, or underscore.[a-zA-Z\_\$][0-9a-zA-Z\_\$]*matches valid JavaScript variable names.
A|B matches A or B, so (J|j)ava(S|s)cript matches different cases of "JavaScript."
^ denotes the start of a line, while $ denotes the end. For example, ^\d means the string must start with a digit.
You may have noticed that js can match 'jsp', but adding ^js$ means it must exactly match 'js'.
Using RegExp in JavaScript
JavaScript has two ways to create regex:
- Directly using
/regex/. - Using
new RegExp('regex').
Both methods are equivalent:
javascript
let re1 = /ABC\-001/;
let re2 = new RegExp('ABC\\-001');When checking if a regex matches, use:
javascript
let re = /^\d{3}\-\d{3,8}$/;
re.test('010-12345'); // true
re.test('010-1234x'); // false
re.test('010 12345'); // falseSplitting Strings
Using regex to split strings is more flexible than using fixed characters:
javascript
'a b c'.split(' '); // ['a', 'b', '', '', 'c']
'a b c'.split(/\s+/); // ['a', 'b', 'c']
'a,b, c d'.split(/[\s\,]+/); // ['a', 'b', 'c', 'd']
'a,b;; c d'.split(/[\s\,\;]+/); // ['a', 'b', 'c', 'd']Grouping
Besides simple matching, regex can extract substrings using groups, denoted by ():
javascript
let re = /^(\d{3})-(\d{3,8})$/;
re.exec('010-12345'); // ['010-12345', '010', '12345']
re.exec('010 12345'); // nullUsing exec(), you can extract matched substrings.
Greedy Matching
Regex matches greedily by default, meaning it tries to match as much as possible. For instance:
javascript
let re = /^(\d+)(0*)$/;
re.exec('102300'); // ['102300', '102300', '']To make it non-greedy, add ?:
javascript
let re = /^(\d+?)(0*)$/;
re.exec('102300'); // ['102300', '1023', '00']Global Search
JavaScript regex has special flags, with g for global matching:
javascript
let r1 = /test/g;
let r2 = new RegExp('test', 'g');Using g allows multiple calls to exec() on a string:
javascript
let s = 'JavaScript, VBScript, JScript and ECMAScript';
let re = /[a-zA-Z]+Script/g;
re.exec(s); // ['JavaScript']
re.lastIndex; // 10Flags i and m are for case-insensitive and multiline matching, respectively.
Summary
Regex is powerful, and covering everything in one session is impossible. For frequent regex issues, consider a dedicated reference book.
Exercise
Try writing a regex for validating email addresses. Version one should validate similar emails:
javascript
let re = /^$/;
// Tests:
let i, success = true, should_pass = ['someone@gmail.com', 'bill.gates@microsoft.com', 'tom@voyager.org', 'bob2015@163.com'], should_fail = ['test#gmail.com', 'bill@microsoft', 'bill%gates@ms.com', '@voyager.org'];
for (i = 0; i < should_pass.length; i++) {
if (!re.test(should_pass[i])) {
console.log('Test failed: ' + should_pass[i]);
success = false;
break;
}
}
for (i = 0; i < should_fail.length; i++) {
if (re.test(should_fail[i])) {
console.log('Test failed: ' + should_fail[i]);
success = false;
break;
}
}
if (success) {
console.log('Test passed!');
}Version two can validate and extract named email addresses:
javascript
let re = /^$/;
// Test:
let r = re.exec('<Tom Paris> tom@voyager.org');
if (r === null || r.toString() !== ['<Tom Paris> tom@voyager.org', 'Tom Paris', 'tom@voyager.org'].toString()) {
console.log('Test failed!');
} else {
console.log('Test succeeded!');
}