Appearance
Matching Rules
Regular expressions match patterns from left to right according to specified rules. First, let's look at how to perform exact matches using regular expressions.
For the regular expression abc
, it can only exactly match the string "abc"
and cannot match any other strings like "ab"
, "Abc"
, "abcd"
, etc.
If a regular expression contains special characters, they need to be escaped using \
. For example, the regular expression a\&c
, where \&
is used to match the special character &
, can exactly match the string "a&c"
but cannot match "ac"
, "a-c"
, "a&&c"
, etc.
Note: Regular expressions in Java are also strings, so for the regular expression a\&c
, the corresponding Java string is "a\\&c"
because \
is also an escape character in Java strings. The two \\
actually represent a single \
.
java
// regex
public class Main {
public static void main(String[] args) {
String re1 = "abc";
System.out.println("abc".matches(re1)); // true
System.out.println("Abc".matches(re1)); // false
System.out.println("abcd".matches(re1)); // false
String re2 = "a\\&c"; // Corresponds to the regex a\&c
System.out.println("a&c".matches(re2)); // true
System.out.println("a-c".matches(re2)); // false
System.out.println("a&&c".matches(re2)); // false
}
}
If you want to match non-ASCII characters, such as Chinese characters, use their hexadecimal Unicode representation with \u###
. For example, a\u548cc
matches the string "a和c"
, where the Chinese character "和" has the Unicode code 548c
.
Matching Any Character
Exact matches are not very useful since we can directly use String.equals()
. In most cases, we want more flexible matching rules. We can use .
to match any single character.
For example, the regular expression a.c
can match any string where the middle character is any single character:
"abc"
, because.
matches the characterb
."a&c"
, because.
matches the character&
."acc"
, because.
matches the characterc
.
However, it cannot match "ac"
or "a&&c"
because .
matches exactly one character.
Matching Digits
Using .
can match any character, which is too broad. If we only want to match digits 0-9
, we can use \d
. For example, the regular expression 00\d
can match:
"007"
, because\d
matches the character7
."008"
, because\d
matches the character8
.
It cannot match "00A"
or "0077"
because \d
only matches a single digit character.
Matching Word Characters
Using \w
matches a letter, digit, or underscore. The w
stands for "word". For example, the regular expression java\w
can match:
"javac"
, because\w
matches the letterc
."java9"
, because\w
matches the digit9
."java_"
, because\w
matches the underscore_
.
It cannot match "java#"
, "java "
, etc., because \w
does not match #
, space, and other characters.
Matching Whitespace Characters
Using \s
matches a whitespace character. Note that whitespace characters include spaces and tab characters (represented as \t
in Java). For example, the regular expression a\sc
can match:
"a c"
, because\s
matches the space character."a\tc"
, because\s
matches the tab character.
It cannot match "ac"
or "abc"
, etc.
Matching Non-Digits
While \d
matches a digit, \D
matches a non-digit. For example, the regular expression 00\D
can match:
"00A"
, because\D
matches the non-digit characterA
."00#"
, because\D
matches the non-digit character#
.
However, 00\D
cannot match "007"
or "008"
, etc., because \D
does not match digits.
Similarly, \W
matches characters that \w
does not match, and \S
matches characters that \s
does not match. These are exact opposites.
java
// regex
public class Main {
public static void main(String[] args) {
String re1 = "java\\d"; // Corresponds to the regex java\d
System.out.println("java9".matches(re1)); // true
System.out.println("java10".matches(re1)); // false
System.out.println("javac".matches(re1)); // false
String re2 = "java\\D";
System.out.println("javax".matches(re2)); // true
System.out.println("java#".matches(re2)); // true
System.out.println("java5".matches(re2)); // false
}
}
Repeating Matches
Using \d
matches a single digit, for example, A\d
can match "A0"
, "A1"
. If you want to match multiple digits, such as "A380"
, what should you do?
The modifier
*
matches any number of characters, including zero. UsingA\d*
can match:"A"
, because\d*
matches zero digits."A0"
, because\d*
matches one digit0
."A380"
, because\d*
matches multiple digits380
.
The modifier
+
matches at least one character. UsingA\d+
can match:"A0"
, because\d+
matches one digit0
."A380"
, because\d+
matches multiple digits380
.
But it cannot match
"A"
, because the+
modifier requires at least one character.The modifier
?
matches zero or one character. UsingA\d?
can match:"A"
, because\d?
matches zero digits."A0"
, because\d?
matches one digit0
.
But it cannot match
"A380"
, because the?
modifier cannot match more than one character.
If you want to specify exactly n
characters, use the modifier {n}
. For example, A\d{3}
can exactly match:
"A380"
, because\d{3}
matches three digits380
.
If you want to specify a range of n
to m
characters, use the modifier {n,m}
. For example, A\d{3,5}
can exactly match:
"A380"
, because\d{3,5}
matches three digits380
."A3800"
, because\d{3,5}
matches four digits3800
."A38000"
, because\d{3,5}
matches five digits38000
.
If there is no upper limit, the modifier {n,}
can match at least n
characters.
Exercise
Please write a regular expression to match domestic Chinese phone numbers: 3-4 digit area code followed by a 7-8 digit phone number, separated by a hyphen. For example: 010-12345678
.
java
// regex
import java.util.*;
public class Main {
public static void main(String[] args) throws Exception {
String re = "\\d{3,4}-\\d{7,8}";
for (String s : List.of("010-12345678", "020-9999999", "0755-7654321")) {
if (!s.matches(re)) {
System.out.println("Test Failed: " + s);
return;
}
}
for (String s : List.of("010 12345678", "A20-9999999", "0755-7654.321")) {
if (s.matches(re)) {
System.out.println("Test Failed: " + s);
return;
}
}
System.out.println("All tests passed!");
}
}
Download Exercise
Advanced
Domestic area codes must start with 0
, and phone numbers cannot start with 0
. Modify the regular expression to more precisely match these rules.
Hint: Simple rules like \d
and \D
are not sufficient for this. We need more complex rules, which will be explained in detail later.
Summary
Single Character Matching Rules:
Regular Expression | Rule | Can Match |
---|---|---|
A | Specific character | A |
\u548c | Specific Unicode character | 和 |
. | Any character | a, b, &, 0 |
\d | Digit 0-9 | 0-9 |
\w | Letter, digit, or underscore | a-z, A-Z, 0-9, _ |
\s | Whitespace (space, Tab) | Space, Tab |
\D | Non-digit | a, A, &, _, ... |
\W | Non-\w character | &, @, 中, ... |
\S | Non-\s character | a, A, &, _, ... |
Multiple Character Matching Rules:
Regular Expression | Rule | Can Match |
---|---|---|
A* | Any number of characters | Empty, A, AA, AAA, ... |
A+ | At least one character | A, AA, AAA, ... |
A? | Zero or one character | Empty, A |
A | Exactly three characters | AAA |
A | Between two and three characters | AA, AAA |
A | At least two characters | AA, AAA, AAAA, ... |
A | At most three characters | Empty, A, AA, AAA |
By understanding and utilizing these regular expression rules, you can efficiently perform complex string matching and validation tasks in your Java programs without the need for extensive and repetitive code.