AlgoDaily - An Interactive Introduction to Regular Expressions

Home > Coding Fundamentals Refresher > Intermediate Coding Patterns > An Interactive Introduction to Regular Expressions

Have you ever wondered how search engines look up data in such a short amount of time? Or how text editors instantly find words when their search option is used? Or ever thought about how they deal with such a large amount of data and find only the relevant information?

Search engines or text editors ease this process by using regular expressions.

In this lesson, we will learn about regular expressions, with a focus on following key points:

What are regular expressions and where are they used?
Special characters and common patterns that are used in regular expressions.
Implementing simple regular expressions in Python.

What are Regular Expressions?

Regular expressions, also known by the shorthand regex, are a special sequence of characters that define a certain pattern. This sequence is then used to match text according to the defined pattern.

You might be wondering-- why do we need another method to perform text matching, when we can already do it using:

the equal operator (==)
by indexing, or
via string methods in any programming language?

It's because regular expressions are a more powerful (and much shorter!) method of performing the string matching operation, one where you can match strings even with custom-defined patterns, a functionality that you might not be able to perform using other methods.

Regular expressions can ease our day to day programming as well. Suppose you receive the data of personal information of people in your university, and you want to extract all the emails from it. If this is done manually, it would surely take quite some time. But don't worry! Regular expressions can make it easy, by solving this problem in a single line.

Build your intuition. Is this statement true or false?

Python provides support for regular expressions in the re module. To use regex in a program, we need to import this module at the start of our program. This module provides helpful functions that make the matching process easier.

Let's review basic Python with a short question! Would import re be a correct method to import the re module in Python?

Press true if you believe the statement is correct, or false otherwise.

We start our introduction to regular expressions in Python by using a basic search method to match strings provided by the re module.

.search(regex, string) takes a regular expression and a string as an input, and scans through the entire input string, and looks for a location where the given regex pattern finds a match.

Since this is an introduction to regular expressions, we will only use this method from re module. There are other methods from the module which provide more functionality. However, hold that thought for now.

Let's look at a simple example of this search method.

Since the string contained '123', the pattern '123' is instantly matched with the string s in this example.

xxxxxxxxxx
 
// Adding a function to wrap the existing code
function findMatch() {
    let s = 'hello123';
    if (s.search('123') != -1) {
        console.log("Found a match!");
    } else {
        console.log("Did not find a match.");
    }
}
​
// Driver code to execute the function
findMatch();

Sure, the previous example could've also easily been replicated using a string method. The difference between normal search functions and regular expressions becomes much more evident when special characters (or metacharacters) are used in regular expressions. They provide a unique meaning to the expression when used. In this lesson, we will discuss some of the commonly used metacharacters.

The most basic among these is the use of square brackets ([ ]) to define a character class. Any character inside the class is matched with the given string.

xxxxxxxxxx
 
let s = 'john479';
if (/[0-9]/.test(s))
    console.log("Found a match!");
else
    console.log("Did not find a match.");

This code snippet prints "Found a match!" as a single-digit value (in the range of 0 to 9) is present in the string. We can combine these square brackets to obtain more interesting results (such as matching consecutive characters) as illustrated below.

[a-z] matches any character between 'a' and 'z', [0-9] matches any character between '0' and '9'. Since they are placed right next to each other, the match is found consecutively.

Here the third part of the expression, [a-z] did not match the third consecutive character of the pattern. As a result, a matching string was not found.

Characters and digits can also be combined for matching, such as in the following example.

Let's see another metacharacter, a period (.). A period matches any single character occurring at that specific place in a string (except newline character).

The regex 'chips.dip' matches any string which has any single character in-between 'chips' and 'dip', such as the one given in input 'chipsndip' or 'chipsmdip'.

xxxxxxxxxx
 
// Adding a function to wrap the existing code
function findMatch() {
    let s = 'chipsndip';
    if (/chips.dip/.test(s)) {
        console.log("Found a match!");
    } else {
        console.log("Did not find a match.");
    }
}
​
// Driver code to execute the function
findMatch();

A caret (^) is another metacharacter. It matches characters at the start of the string. This is helpful when we need to match multiple strings that start with similar characters.

All the given strings begin from 'It', and hence are successfully matched with the given regular expression '^It'.

xxxxxxxxxx
 
// Adding a function to wrap the existing code
function findMatch() {
    var s1 = 'It is rainy.';
    var s2 = 'It is cloudy.';
    var s3 = 'It is sunny.';
    if (/^It/.test(s1) && /^It/.test(s2) && /^It/.test(s3)) {
        console.log("Found a match!");
    } else {
        console.log("Did not find a match.");
    }
}
​
// Driver code to execute the function
findMatch();

Let's look at a repetition based metacharacter, plus (+). It checks if the previous character (from the position of +) in the string appears one or more times from that position.

xxxxxxxxxx
 
// Adding a function to wrap the existing code
function findMatch() {
    let s = "Zoooooootopia";
    let pattern = /Zo+topia/;
    if (pattern.test(s)) {
       console.log("Found a match!");
    } else {
       console.log("Did not find a match.");
    }
}
​
// Driver code to execute the function
findMatch();

In this example, it is important to write 'o' before '+', because this metacharacter checks if there is a preceding character present (there must be one of the repeating characters to check!). However, you could avoid this by using the asterisk ('*') metacharacter.

This matches the string perfectly without having to know about the preceding character beforehand.

xxxxxxxxxx
 
// Adding a function to wrap the existing code
function findMatch() {
    let s = 'Zoooooootopia';
    if (/Z*topia/.test(s)) {
        console.log("Found a match!");
    } else {
        console.log("Did not find a match.");
    }
}
​
// Driver code to execute the function
findMatch();

This lesson only lists down the basics of regular expressions. There are many other metacharacters available, and their combinations allow us to match complex patterns in texts.

If you are interested in learning more, below is a list of some more metacharacters (and the ones we studied in this lesson) used in regular expressions, along with their usage. Try experimenting with these metacharacters to create unique regular expressions.

Metacharacter	Character Name	Usage
[ ]	Square brackets	Matches set of characters specified within them
.	Period	Matches any single character except newline
^	Caret	Matches the start of string
$	Dollar	Matches the end of string
*	Asterik	Matches if there are zero or more repetitions
+	Plus	Matches if there are one or more repetitions
\w	Lowercase w	Matches a single letter, digit, or underscore
\W	Uppercase W	Matches any character which is not a part of \w
\s	Lowercase s	Matches single whitespace character
\S	Uppercase S	Matches any character which is not a part of \s
\d	Lowercase d	Matches decimal digit in the range 0-9
\D	Uppercase D	Matches any character which is not a part of \d
\t	Lowercase t	Matches tab
\n	Lowercase n	Matches newline character

One Pager Cheat Sheet

Search engines and text editors use **regular expressions** to quickly locate data and filter out the relevant information.
Regular expressions, or regex, are a special sequence of characters that define a pattern to match text, which can be more powerful and shorter than the ‘equal’ operator (==), indexing, or other string methods.
The import keyword is used to load the re module in Python, and therefore import re is a valid statement.
The re module provides a basic search method, .search(regex, string), which scans through a string and looks for a match to the provided regular expression, as demonstrated in a simple example.
Using special characters (or metacharacters) in regular expressions adds unique meaning to the expression, typically demonstrated by defining a character class with square brackets ([ ]).
Using Square Brackets, [a-z], [0-9], and . we can create Regex patterns to match character ranges, digits, and consecutive characters to achieve desired results.
The ^ metacharacter allows us to match multiple strings that start with the same characters.
The metacharacter + checks if the previous character appears one or more times.
By using the * metacharacter, one can avoid having to specify the preceding character when searching for a repeating character in a string.
The basics of regular expressions were just covered, but with metacharacters such as Square Brackets, Period, Caret, Dollar, Asterisk, Plus, \w, \W, \s, \S, \d, \D, \t and \n, one can match complex patterns in text.

What are Regular Expressions?

Build your intuition. Is this statement true or false?

One Pager Cheat Sheet

Programming Categories

Popular Lessons