Regular Expressions

Strings are one of the most commonly involved data structures in programming, and the need to manipulate them is ubiquitous. For instance, to check if a string is a valid email address, one could extract the substrings before and after the '@' symbol and verify each part separately. However, this approach is cumbersome and makes the code less reusable.

Regular expressions are a powerful tool for matching strings. The idea is to define a rule for strings using a descriptive language. Any string that conforms to this rule is considered "matched"; otherwise, it's deemed invalid.

To determine if a string is a valid email, we can:

Create a regular expression that matches email formats.
Use this regex to check if the user input is valid.

Since regular expressions are also represented as strings, we first need to understand how characters can describe characters.

In regex, directly specifying a character means an exact match. Using \d matches a digit, and \w matches a letter or digit. For example:

'00\d' matches '007' but not '00A'.
'\d\d\d' matches '010'.
'\w\w\d' matches 'py3'.
. matches any character, so 'py.' matches 'pyc', 'pyo', 'py!', etc.

To match variable-length characters, regex uses:

* for any number of characters (including zero).
+ for at least one character.
? for zero or one character.
{n} for exactly n characters.
{n,m} for between n and m characters.

For a complex example: \d{3}\s+\d{3,8} can be interpreted as:

\d{3} matches 3 digits, e.g., '010'.
\s+ matches at least one whitespace (space, tab, etc.).
\d{3,8} matches 3 to 8 digits, e.g., '1234567'.

Overall, this regex matches phone numbers with an area code separated by any number of spaces.

To match a number like '010-12345', we escape the special character '-', resulting in \d{3}\-\d{3,8}.

However, this still won't match '010 - 12345' due to the space. Thus, we need a more complex pattern.

Advanced Techniques

To match precisely, we can use [] to define ranges, such as:

[0-9a-zA-Z\_] matches a digit, letter, or underscore.
[0-9a-zA-Z\_]+ matches strings composed of at least one digit, letter, or underscore, e.g., 'a100', '0_Z', 'Py3000'.
[a-zA-Z\_][0-9a-zA-Z\_]* matches valid Python variable names.

A|B matches either A or B, so (P|p)ython matches 'Python' or 'python'.

^ indicates the start of a line; for instance, ^\d means it must start with a digit.

$ indicates the end of a line; for example, \d$ means it must end with a digit.

Note that py can match 'python', but adding ^py$ changes it to match only 'py'.

The `re` Module

With this knowledge, we can use regex in Python. The re module provides all regex functionalities. Since Python strings also use \ for escaping, it’s crucial to be cautious:

python

s = 'ABC\\-001'  # Python string
# Corresponding regex string becomes 'ABC\-001'

We strongly recommend using Python's r prefix to avoid escape issues:

python

s = r'ABC\-001'  # Python string
# Corresponding regex string remains: 'ABC\-001'

To check if a regex matches:

python

import re
re.match(r'^\d{3}\-\d{3,8}$', '010-12345')

The match() method returns a Match object if successful or None otherwise. A common way to check is:

python

test = 'User input string'
if re.match(r'Regex pattern', test):
    print('ok')
else:
    print('failed')

Splitting Strings

Using regex to split strings is more flexible than using fixed characters:

python

'a b   c'.split(' ')  # Result: ['a', 'b', '', '', 'c']

This fails to handle multiple spaces. Using regex:

python

re.split(r'\s+', 'a b   c')  # Result: ['a', 'b', 'c']

This works regardless of the number of spaces. Adding more characters:

python

re.split(r'[\s\,]+', 'a,b, c  d')  # Result: ['a', 'b', 'c', 'd']
re.split(r'[\s\,\;]+', 'a,b;; c  d')  # Result: ['a', 'b', 'c', 'd']

If users input tags, remember to convert irregular input into a proper array using regex.

Grouping

Regex also allows extracting substrings using groups, indicated by ():

python

^(\d{3})-(\d{3,8})$

This defines two groups for area code and local number. For example:

python

m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
m.group(0)  # '010-12345'
m.group(1)  # '010'
m.group(2)  # '12345'

group(0) always returns the entire match, while group(1), group(2), etc., return the respective substrings.

Extracting substrings can be very useful. Here's a more complex example:

python

t = '19:05:30'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
m.groups()  # ('19', '05', '30')

This regex can validate time. However, some things, like date validation, can be tricky with regex alone.

Greedy Matching

Regex matching is greedy by default, meaning it matches as many characters as possible. For instance, matching zeros after digits:

python

re.match(r'^(\d+)(0*)$', '102300').groups()  # ('102300', '')

Here, \d+ matches all digits, leaving 0* to match an empty string. To use non-greedy matching (the least possible), add a ?:

python

re.match(r'^(\d+?)(0*)$', '102300').groups()  # ('1023', '00')

Compilation

When using regex in Python, the re module does two things:

Compiles the regex; an error occurs if the regex string is invalid.
Matches the string using the compiled regex.

If a regex is used repeatedly, it’s efficient to pre-compile it:

python

import re
# Compile:
re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# Usage:
re_telephone.match('010-12345').groups()  # ('010', '12345')

Compiling creates a Regular Expression object that includes the regex, so you don’t need to provide the regex string again when calling methods.

Summary

Regular expressions are incredibly powerful, and covering everything in a single section is impossible. A comprehensive guide to regex could fill a book. If you frequently deal with regex issues, consider getting a reference book on the topic.

Exercise

Try to write a regex to validate email addresses. Version one should validate emails like:

python

import re

def is_valid_email(addr):
    return True

# Tests:
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('ok')

Version two should extract names from email addresses:

<Tom Paris> tom@voyager.org => Tom Paris
bob@example.com => bob

python

import re

def name_of_email(addr):
    return None

# Tests:
assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')

Regular Expressions ​

Advanced Techniques ​

The re Module ​

Splitting Strings ​

Grouping ​

Greedy Matching ​

Compilation ​

Summary ​

Exercise ​

Regular Expressions

Advanced Techniques

The `re` Module

Splitting Strings

Grouping

Greedy Matching

Compilation

Summary

Exercise