Skip to content

Matching Rules

Regular expressions match patterns from left to right according to specified rules. First, let's look at how to perform exact matches using regular expressions.

For the regular expression abc, it can only exactly match the string "abc" and cannot match any other strings like "ab", "Abc", "abcd", etc.

If a regular expression contains special characters, they need to be escaped using \. For example, the regular expression a\&c, where \& is used to match the special character &, can exactly match the string "a&c" but cannot match "ac", "a-c", "a&&c", etc.

Note: Regular expressions in Java are also strings, so for the regular expression a\&c, the corresponding Java string is "a\\&c" because \ is also an escape character in Java strings. The two \\ actually represent a single \.

java
// regex
public class Main {
    public static void main(String[] args) {
        String re1 = "abc";
        System.out.println("abc".matches(re1));    // true
        System.out.println("Abc".matches(re1));    // false
        System.out.println("abcd".matches(re1));   // false

        String re2 = "a\\&c"; // Corresponds to the regex a\&c
        System.out.println("a&c".matches(re2));    // true
        System.out.println("a-c".matches(re2));    // false
        System.out.println("a&&c".matches(re2));   // false
    }
}

If you want to match non-ASCII characters, such as Chinese characters, use their hexadecimal Unicode representation with \u###. For example, a\u548cc matches the string "a和c", where the Chinese character "和" has the Unicode code 548c.

Matching Any Character

Exact matches are not very useful since we can directly use String.equals(). In most cases, we want more flexible matching rules. We can use . to match any single character.

For example, the regular expression a.c can match any string where the middle character is any single character:

  • "abc", because . matches the character b.
  • "a&c", because . matches the character &.
  • "acc", because . matches the character c.

However, it cannot match "ac" or "a&&c" because . matches exactly one character.

Matching Digits

Using . can match any character, which is too broad. If we only want to match digits 0-9, we can use \d. For example, the regular expression 00\d can match:

  • "007", because \d matches the character 7.
  • "008", because \d matches the character 8.

It cannot match "00A" or "0077" because \d only matches a single digit character.

Matching Word Characters

Using \w matches a letter, digit, or underscore. The w stands for "word". For example, the regular expression java\w can match:

  • "javac", because \w matches the letter c.
  • "java9", because \w matches the digit 9.
  • "java_", because \w matches the underscore _.

It cannot match "java#", "java ", etc., because \w does not match #, space, and other characters.

Matching Whitespace Characters

Using \s matches a whitespace character. Note that whitespace characters include spaces and tab characters (represented as \t in Java). For example, the regular expression a\sc can match:

  • "a c", because \s matches the space character.
  • "a\tc", because \s matches the tab character.

It cannot match "ac" or "abc", etc.

Matching Non-Digits

While \d matches a digit, \D matches a non-digit. For example, the regular expression 00\D can match:

  • "00A", because \D matches the non-digit character A.
  • "00#", because \D matches the non-digit character #.

However, 00\D cannot match "007" or "008", etc., because \D does not match digits.

Similarly, \W matches characters that \w does not match, and \S matches characters that \s does not match. These are exact opposites.

java
// regex
public class Main {
    public static void main(String[] args) {
        String re1 = "java\\d"; // Corresponds to the regex java\d
        System.out.println("java9".matches(re1));    // true
        System.out.println("java10".matches(re1));   // false
        System.out.println("javac".matches(re1));    // false

        String re2 = "java\\D";
        System.out.println("javax".matches(re2));     // true
        System.out.println("java#".matches(re2));     // true
        System.out.println("java5".matches(re2));     // false
    }
}

Repeating Matches

Using \d matches a single digit, for example, A\d can match "A0", "A1". If you want to match multiple digits, such as "A380", what should you do?

  • The modifier * matches any number of characters, including zero. Using A\d* can match:

    • "A", because \d* matches zero digits.
    • "A0", because \d* matches one digit 0.
    • "A380", because \d* matches multiple digits 380.
  • The modifier + matches at least one character. Using A\d+ can match:

    • "A0", because \d+ matches one digit 0.
    • "A380", because \d+ matches multiple digits 380.

    But it cannot match "A", because the + modifier requires at least one character.

  • The modifier ? matches zero or one character. Using A\d? can match:

    • "A", because \d? matches zero digits.
    • "A0", because \d? matches one digit 0.

    But it cannot match "A380", because the ? modifier cannot match more than one character.

If you want to specify exactly n characters, use the modifier {n}. For example, A\d{3} can exactly match:

  • "A380", because \d{3} matches three digits 380.

If you want to specify a range of n to m characters, use the modifier {n,m}. For example, A\d{3,5} can exactly match:

  • "A380", because \d{3,5} matches three digits 380.
  • "A3800", because \d{3,5} matches four digits 3800.
  • "A38000", because \d{3,5} matches five digits 38000.

If there is no upper limit, the modifier {n,} can match at least n characters.

Exercise

Please write a regular expression to match domestic Chinese phone numbers: 3-4 digit area code followed by a 7-8 digit phone number, separated by a hyphen. For example: 010-12345678.

java
// regex
import java.util.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String re = "\\d{3,4}-\\d{7,8}";
        for (String s : List.of("010-12345678", "020-9999999", "0755-7654321")) {
            if (!s.matches(re)) {
                System.out.println("Test Failed: " + s);
                return;
            }
        }
        for (String s : List.of("010 12345678", "A20-9999999", "0755-7654.321")) {
            if (s.matches(re)) {
                System.out.println("Test Failed: " + s);
                return;
            }
        }
        System.out.println("All tests passed!");
    }
}

Download Exercise

Advanced

Domestic area codes must start with 0, and phone numbers cannot start with 0. Modify the regular expression to more precisely match these rules.

Hint: Simple rules like \d and \D are not sufficient for this. We need more complex rules, which will be explained in detail later.

Summary

Single Character Matching Rules:

Regular ExpressionRuleCan Match
ASpecific characterA
\u548cSpecific Unicode character
.Any charactera, b, &, 0
\dDigit 0-90-9
\wLetter, digit, or underscorea-z, A-Z, 0-9, _
\sWhitespace (space, Tab)Space, Tab
\DNon-digita, A, &, _, ...
\WNon-\w character&, @, 中, ...
\SNon-\s charactera, A, &, _, ...

Multiple Character Matching Rules:

Regular ExpressionRuleCan Match
A*Any number of charactersEmpty, A, AA, AAA, ...
A+At least one characterA, AA, AAA, ...
A?Zero or one characterEmpty, A
AExactly three charactersAAA
ABetween two and three charactersAA, AAA
AAt least two charactersAA, AAA, AAAA, ...
AAt most three charactersEmpty, A, AA, AAA

By understanding and utilizing these regular expression rules, you can efficiently perform complex string matching and validation tasks in your Java programs without the need for extensive and repetitive code.

Matching Rules has loaded