Appearance
Regular Expressions
Strings are one of the most commonly involved data structures in programming, and the need to manipulate them is ubiquitous. For instance, to check if a string is a valid email address, one could extract the substrings before and after the '@' symbol and verify each part separately. However, this approach is cumbersome and makes the code less reusable.
Regular expressions are a powerful tool for matching strings. The idea is to define a rule for strings using a descriptive language. Any string that conforms to this rule is considered "matched"; otherwise, it's deemed invalid.
To determine if a string is a valid email, we can:
- Create a regular expression that matches email formats.
- Use this regex to check if the user input is valid.
Since regular expressions are also represented as strings, we first need to understand how characters can describe characters.
In regex, directly specifying a character means an exact match. Using \d
matches a digit, and \w
matches a letter or digit. For example:
'00\d'
matches'007'
but not'00A'
.'\d\d\d'
matches'010'
.'\w\w\d'
matches'py3'
..
matches any character, so'py.'
matches'pyc'
,'pyo'
,'py!'
, etc.
To match variable-length characters, regex uses:
*
for any number of characters (including zero).+
for at least one character.?
for zero or one character.{n}
for exactly n characters.{n,m}
for between n and m characters.
For a complex example: \d{3}\s+\d{3,8}
can be interpreted as:
\d{3}
matches 3 digits, e.g.,'010'
.\s+
matches at least one whitespace (space, tab, etc.).\d{3,8}
matches 3 to 8 digits, e.g.,'1234567'
.
Overall, this regex matches phone numbers with an area code separated by any number of spaces.
To match a number like '010-12345'
, we escape the special character '-'
, resulting in \d{3}\-\d{3,8}
.
However, this still won't match '010 - 12345'
due to the space. Thus, we need a more complex pattern.
Advanced Techniques
To match precisely, we can use []
to define ranges, such as:
[0-9a-zA-Z\_]
matches a digit, letter, or underscore.[0-9a-zA-Z\_]+
matches strings composed of at least one digit, letter, or underscore, e.g.,'a100'
,'0_Z'
,'Py3000'
.[a-zA-Z\_][0-9a-zA-Z\_]*
matches valid Python variable names.
A|B
matches either A or B, so (P|p)ython
matches 'Python'
or 'python'
.
^
indicates the start of a line; for instance, ^\d
means it must start with a digit.
$
indicates the end of a line; for example, \d$
means it must end with a digit.
Note that py
can match 'python'
, but adding ^py$
changes it to match only 'py'
.
The re
Module
With this knowledge, we can use regex in Python. The re
module provides all regex functionalities. Since Python strings also use \
for escaping, it’s crucial to be cautious:
python
s = 'ABC\\-001' # Python string
# Corresponding regex string becomes 'ABC\-001'
We strongly recommend using Python's r
prefix to avoid escape issues:
python
s = r'ABC\-001' # Python string
# Corresponding regex string remains: 'ABC\-001'
To check if a regex matches:
python
import re
re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
The match()
method returns a Match object if successful or None
otherwise. A common way to check is:
python
test = 'User input string'
if re.match(r'Regex pattern', test):
print('ok')
else:
print('failed')
Splitting Strings
Using regex to split strings is more flexible than using fixed characters:
python
'a b c'.split(' ') # Result: ['a', 'b', '', '', 'c']
This fails to handle multiple spaces. Using regex:
python
re.split(r'\s+', 'a b c') # Result: ['a', 'b', 'c']
This works regardless of the number of spaces. Adding more characters:
python
re.split(r'[\s\,]+', 'a,b, c d') # Result: ['a', 'b', 'c', 'd']
re.split(r'[\s\,\;]+', 'a,b;; c d') # Result: ['a', 'b', 'c', 'd']
If users input tags, remember to convert irregular input into a proper array using regex.
Grouping
Regex also allows extracting substrings using groups, indicated by ()
:
python
^(\d{3})-(\d{3,8})$
This defines two groups for area code and local number. For example:
python
m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
m.group(0) # '010-12345'
m.group(1) # '010'
m.group(2) # '12345'
group(0)
always returns the entire match, while group(1)
, group(2)
, etc., return the respective substrings.
Extracting substrings can be very useful. Here's a more complex example:
python
t = '19:05:30'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
m.groups() # ('19', '05', '30')
This regex can validate time. However, some things, like date validation, can be tricky with regex alone.
Greedy Matching
Regex matching is greedy by default, meaning it matches as many characters as possible. For instance, matching zeros after digits:
python
re.match(r'^(\d+)(0*)$', '102300').groups() # ('102300', '')
Here, \d+
matches all digits, leaving 0*
to match an empty string. To use non-greedy matching (the least possible), add a ?
:
python
re.match(r'^(\d+?)(0*)$', '102300').groups() # ('1023', '00')
Compilation
When using regex in Python, the re
module does two things:
- Compiles the regex; an error occurs if the regex string is invalid.
- Matches the string using the compiled regex.
If a regex is used repeatedly, it’s efficient to pre-compile it:
python
import re
# Compile:
re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# Usage:
re_telephone.match('010-12345').groups() # ('010', '12345')
Compiling creates a Regular Expression object that includes the regex, so you don’t need to provide the regex string again when calling methods.
Summary
Regular expressions are incredibly powerful, and covering everything in a single section is impossible. A comprehensive guide to regex could fill a book. If you frequently deal with regex issues, consider getting a reference book on the topic.
Exercise
Try to write a regex to validate email addresses. Version one should validate emails like:
python
import re
def is_valid_email(addr):
return True
# Tests:
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('ok')
Version two should extract names from email addresses:
<Tom Paris> tom@voyager.org
=>Tom Paris
bob@example.com
=>bob
python
import re
def name_of_email(addr):
return None
# Tests:
assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')