Understanding Python Regex Functions, with Examples

    Ini Arthur
    Share

    Regular expressions (regex) are special sequences of characters used to find or match patterns in strings, as this introduction to regex explains. We’ve previously shown how to use regular expressions with JavaScript and PHP. The focus of this article is Python regex, with the goal of helping you better understand how to manipulate regular expressions in Python.

    You’ll learn how to use Python regex functions and methods effectively in your programs as we cover the nuances involved in handling Python regex objects.

    python regex

    Regular Expression Modules in Python: re and regex

    Python has two modules — re and regex — that facilitate working with regular expressions. The re module is built in to Python, while the regex module was developed by Matthew Barnett and is available on PyPI. The regex module by Barnett is developed using the built-in re module, and both modules have similar functionalities. They differ in terms of implementation. The built-in re module is the more popular of the two, so we’ll be working with that module here.

    Python’s Built-in re Module

    More often than not, Python developers use the re module when executing regular expressions. The general construct of regular expression syntax remains the same (characters and symbols), but the module provides some functions and method to effectively execute regex in a Python program.

    Before we can use the re module, we have to import it into our file like any other Python module or library:

    import re
    

    This makes the module available in the current file so that Python’s regex functions and methods are easily accessible. With the re module, we can create Python regex objects, manipulate matched objects, and apply flags where necessary.

    A Selection of re Functions.

    The re module has functions such as re.search(), re.match(), and re.compile(), which we’ll discuss first.

    re.search(pattern, string, flags=0) vs re.match(pattern, string, flags=0)

    The re.search() and re.match() search through a string for a Python regex pattern and return a match if found or None if no match object is found.

    Both functions always return the first matched substring found in a given string and maintain a default value 0 for flag. But while the search() function scans through an entire string to find a match, match() only searches for a match at the beginning of a string.

    Python’s re.search() documentation:

    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

    Python’s re.match() documentation:

    If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

    Let’s see some code examples to further clarify:

    search_result = re.search(r'\d{2}', 'I live at 22 Garden Road, East Legon')
    
    print(search_result)
    
    print(search_result.group())
    
    >>>>
    
    <re.Match object; span=(10, 12), match='22'>
    
    22
    
    match_result = re.match(r'\d{2}', 'I live at 22 Garden Road, East Legon')
    
    print(match_result)
    
    print(match_result.group())
    
    >>>>
    
    None
    
    Traceback (most recent call last):
    
    File "/home/ini/Dev./sitepoint/regex.py", line 4, in <module>
    
    print(match_result.group())
    
    AttributeError: 'NoneType' object has no attribute 'group'
    

    From the above example, None was returned because there was no match at the beginning of the string. An AttributeError was raised when the group() method was called, because there’s no match object:

    match_result = re.match(r'\d{2}', "45 cars were used for the president's convoy")
    
    print(match_result)
    
    print(match_result.group())
    
    >>>>
    
    <re.Match object; span=(0, 2), match='45'>
    
    45
    

    With 45, the match object at the beginning of the string, the match() method works just fine.

    re.compile(pattern, flags=0)

    The compile() function takes a given regular expression pattern and compiles it into a regular expression object used in finding a match in a string or text. It also accepts a flag as an optional second argument. This method is useful because the regex object can be assigned to a variable and used later in our Python code. Always remember to use a raw string r"..." when creating a Python regex object.

    Here’s an example of how it works:

    regex_object = re.compile(r'b[ae]t')
    
    mo = regex_object.search('I bet, you would not let a bat be your president')
    
    print(regex_object)
    
    >>>>
    
    re.compile('b[ae]t')
    

    re.fullmatch(pattern, string, flags=0)

    This function takes two arguments: a string passed as a regular expression pattern, a string to search, and an optional flag argument. A match object is returned if the entire string matches the given regex pattern. If there’s no match, it returns None:

    regex_object = re.compile(r'Tech is the future')
    
    mo = regex_object.fullmatch('Tech is the future, join now')
    
    print(mo)
    
    print(mo.group())
    
    >>>>
    
    None
    
    Traceback (most recent call last):
    
    File "/home/ini/Dev./sitepoint/regex.py", line 16, in <module>
    
    print(mo.group())
    
    AttributeError: 'NoneType' object has no attribute 'group'
    

    The code raises an AttributeError, because there’s no string matching.

    re.findall(pattern, string, flags=0)

    The findall() function returns a list of all match objects found in a given string. It traverses the string left to right, until all matches are returned. See the code snippet below:

    regex_object = re.compile(r'[A-Z]\w+')
    
    mo = regex_object.findall('Pick out all the Words that Begin with a Capital letter')
    
    print(mo)
    
    >>>>
    
    ['Pick', 'Words', 'Begin', 'Capital']
    

    In the code snippet above, the regex consists of a character class and a word character, which ensures that the matched substring begins with a capital letter.

    re.sub(pattern, repl, string, count=0, flags=0)

    Parts of a string can be substituted with another substring with the help of the sub() function. It takes at least three arguments: the search pattern, the replacement string, and the string to be worked on. The original string is returned unchanged if no matches are found. Without passing a count argument, by default the function finds one or more occurrences of the regular expression and replaces all the matches.

    Here’s an example:

    regex_object = re.compile(r'disagreed')
    
    mo = regex_object.sub('agreed',"The founder and the CEO disagreed on the company's new direction, the investors disagreed too.")
    
    print(mo)
    
    >>>>
    
    The founder and the CEO agreed on the company's new direction, the investors agreed too.
    

    subn(pattern, repl, string, count=0, flags=0)

    The subn() function performs the same operation as sub(), but it returns a tuple with the string and number of replacement done. See the code snippet below:

    regex_object = re.compile(r'disagreed')
    
    mo = regex_object.subn('agreed',"The founder and the CEO disagreed on the company's new direction, the investors disagreed too.")
    
    print(mo)
    
    >>>>
    
    ("The founder and the CEO agreed on the company's new direction, the investors agreed too.", 2)
    

    Match Objects and Methods

    A match object is returned when a regex pattern matches a given string in the regex object’s search() or match() method. Match objects have several methods that prove useful while maneuvering regex in Python.

    Match.group([group1, …])

    This method returns one or more subgroups of a match object. A single argument will return a signal subgroup; multiple arguments will return multiple subgroups, based on their indexes. By default, the group() method returns the entire match substring. When the argument in the group() is more than or less than the subgroups, an IndexError exception is thrown.

    Here’s an example:

    regex_object = re.compile(r'(\+\d{3}) (\d{2} \d{3} \d{4})')
    
    mo = regex_object.search('Pick out the country code from the phone number: +233 54 502 9074')
    
    print(mo.group(1))
    
    >>>>
    
    +233
    

    The argument 1 passed into the group(1) method — as seen in the above example — picks out the country code for Ghana +233. Calling the method without an argument or 0 as an argument returns all subgroups of the match object:

    regex_object = re.compile(r'(\+\d{3}) (\d{2} \d{3} \d{4})')
    
    mo = regex_object.search('Pick out the phone number: +233 54 502 9074')
    
    print(mo.group())
    
    >>>>
    
    +233 54 502 9074
    

    Match.groups(default=None)

    groups() returns a tuple of subgroups that match the given string. Regex pattern groups are always captured with parentheses — () — and these groups are returned when there’s a match, as elements in a tuple:

    regex_object = re.compile(r'(\+\d{3}) (\d{2}) (\d{3}) (\d{4})')
    
    mo = regex_object.search('Pick out the phone number: +233 54 502 9074')
    
    print(mo.groups())
    
    >>>>
    
    ('+233', '54', '502', '9074')
    

    Match.start([group]) & Match.end([group])

    The start() method returns the start index, while the end() method returns the end index of the match object:

    regex_object = re.compile(r'\s\w+')
    
    mo = regex_object.search('Match any word after a space')
    
    print('Match begins at', mo.start(), 'and ends', mo.end())
    
    print(mo.group())
    
    >>>>
    
    Match begins at 5 and ends 9
    
    any
    

    The example above has a regex pattern for matching any word character after a whitespace. A match was found — ' any' — starting from position 5 and ending at 9.

    Pattern.search(string[, pos[, endpos]])

    The pos value indicates the index position where the search for a match object should begin. endpos indicates where the search for a match should stop. The value for both pos and endpos can be passed as arguments in the search() or match() methods after the string. This is how it works:

    regex_object = re.compile(r'[a-z]+[0-9]')
    
    mo = regex_object.search('find the alphanumeric character python3 in the string', 20 , 30)
    
    print(mo.group())
    
    >>>>
    
    python3
    

    The code above picks out any alphanumeric character in the search string.

    The search begins at string index position of 20 and stops at 30.

    re Regex Flags

    Python allows the use of flags when using re module methods like search() and match(), which gives more context to regular expressions. The flags are optional arguments that specify how the Python regex engine finds a match object.

    re.I (re.IGNORECASE)

    This flag is used when performing a case-insentive match. The regex engine will ignore uppercase or lowercase variation of regular expression patterns:

    regex_object = re.search('django', 'My tech stack comprises of python, Django, MySQL, AWS, React', re.I)
    
    print(regex_object.group())
    
    >>>>
    
    Django
    

    The re.I ensures that a match object is found, regardless of whether it’s in uppercase or lowercase.

    re.S (re.DOTALL)

    The '.' special character matches any character except a newline. Introducing this flag will also match a newline in a block of text or string. See the example below:

    regex_object= re.search('.+', 'What is your favourite coffee flavor \nI prefer the Mocha')
    
    print(regex_object.group())
    
    >>>>
    
    What is your favourite coffee flavor
    

    The '.' character only finds a match from the beginning of the string and stops at the newline. Introducing the re.DOTALL flag will match a newline character. See the example below:

    regex_object= re.search('.+', 'What is your favourite coffee flavor \nI prefer the Mocha', re.S)
    
    print(regex_object.group())
    
    >>>>
    
    What is your favourite coffee flavor
    
    I prefer the Mocha
    

    re.M (re.MULTILINE)

    By default the '^' special character only matches the beginning of a string. With this flag introduced, the function searches for a match at the beginning of each line. The '$' character only matches patterns at the end of the string. But the re.M flag ensures it also finds matches at the end of each line:

    regex_object = re.search('^J\w+', 'Popular programming languages in 2022: \nPython \nJavaScript \nJava \nRust \nRuby', re.M)
    
    print(regex_object.group())
    
    >>>>
    
    JavaScript
    

    re.X (re.VERBOSE)

    Sometimes, Python regex patterns can get long and messy. The re.X flag helps out when we need to add comments within our regex pattern. We can use the ''' string format to create a multiline regex with comments:

    email_regex = re.search(r'''
    
    [a-zA-Z0-9._%+-]+ # username composed of alphanumeric characters
    
    @ # @ symbol
    
    [a-zA-Z0-9.-]+ # domain name has word characters
    
    (\.[a-zA-Z]{2,4}) # dot-something
    
    ''', 'extract the email address in this string kwekujohnson1@gmail.co and send an email', re.X)
    
    print(email_regex.group())
    
    >>>>
    
    kwekujohnson1@gmail.co
    

    Practical Examples of Regex in Python

    Let’s now dive in to some more practical examples.

    Python password strength test regex

    One of the most popular use cases for regular expressions is to test for password strength. When signing up for any new account, there’s a check to ensure we input an appropriate combination of letters, numbers, and characters to ensure a strong password.

    Here’s a sample regex pattern for checking password strength:

    password_regex = re.match(r"""
    
    ^(?=.*?[A-Z]) # this ensures user inputs at least one uppercase letter
    
    (?=.*?[a-z]) # this ensures user inputs at least one lowercase letter
    
    (?=.*?[0-9]) # this ensures user inputs at least one digit
    
    (?=.*?[#?!@$%^&*-]) # this ensures user inputs one special character
    
    .{8,}$ #this ensures that password is at least 8 characters long
    
    """, '@Sit3po1nt', re.X)
    
    print('Your password is' ,password_regex.group())
    
    >>>>
    
    Your password is @Sit3po1nt
    

    Note the use of '^' and '$' to ensure the input string (password) is a regex match.

    Python search and replace in file regex

    Here’s our goal for this example:

    • Create a file ‘pangram.txt’.
    • Add a simple some text to file, "The five boxing wizards climb quickly."
    • Write a simple Python regex to search and replace “climb” to “jump” so we have a pangram.

    Here’s some code for doing that:

    #importing the regex module
    
    import re
    
    file_path="pangram.txt"
    
    text="climb"
    
    subs="jump"
    
    #defining the replace method
    
    def search_and_replace(filePath, text, subs, flags=0):
    
    with open(file_path, "r+") as file:
    
    #read the file contents
    
    file_contents = http://file.read()
    
    text_pattern = re.compile(re.escape(text), flags)
    
    file_contents = text_pattern.sub(subs, file_contents)
    
    file.seek(0)
    
    file.truncate()
    
    file.write(file_contents)
    
    #calling the search_and_replace method
    
    search_and_replace(file_path, text, subs)
    

    Python web scraping regex

    Sometimes you might need to harvest some data on the Internet or automate simple tasks like web scraping. Regular expressions are very useful when extracting certain data online. Below is an example:

    import urllib.request
    
    phone_number_regex = r'\(\d{3}\) \d{3}-\d{4}'
    
    url = 'https://www.summet.com/dmsi/html/codesamples/addresses.html'
    
    # get response
    
    response = urllib.request.urlopen(url)
    
    # convert response to string
    
    string_object = response.read().decode("utf8")
    
    # use regex to extract phone numbers
    
    regex_object = re.compile(phone_regex)
    
    mo = regex_object.findall(string_object)
    
    # print top 5 phone numbers
    
    print(mo[: 5])
    
    >>>>
    
    ['(257) 563-7401', '(372) 587-2335', '(786) 713-8616', '(793) 151-6230', '(492) 709-6392']
    

    Conclusion

    Regular expressions can vary from simple to complex. They’re a vital part of programming, as the examples above demonstrate. To better understand regex in Python, it’s good to begin by getting familiar with things like character classes, special characters, anchors, and grouping constructs.

    There’s a lot further we can go to deepen our understanding of regex in Python. The Python re module makes it easier to get up and running quickly.

    Regex significantly reduces the amount of code we need write to do things like validate input and implement search algorithms.

    It’s also good to be able to answer questions about the use of regular expressions, as they often come up in technical interviews for software engineers and developers.