Mark As Completed Discussion

Have you ever wondered how search engines look up data in such a short amount of time? Or how text editors instantly find words when their search option is used? Or ever thought about how they deal with such a large amount of data and find only the relevant information?

Search engines or text editors ease this process by using regular expressions.

In this lesson, we will learn about regular expressions, with a focus on following key points:

  1. What are regular expressions and where are they used?
  2. Special characters and common patterns that are used in regular expressions.
  3. Implementing simple regular expressions in Python.

What are Regular Expressions?

Regular expressions, also known by the shorthand regex, are a special sequence of characters that define a certain pattern. This sequence is then used to match text according to the defined pattern.

You might be wondering-- why do we need another method to perform text matching, when we can already do it using:

  1. the equal operator (==)
  2. by indexing, or
  3. via string methods in any programming language?

It's because regular expressions are a more powerful (and much shorter!) method of performing the string matching operation, one where you can match strings even with custom-defined patterns, a functionality that you might not be able to perform using other methods.

Regular expressions can ease our day to day programming as well. Suppose you receive the data of personal information of people in your university, and you want to extract all the emails from it. If this is done manually, it would surely take quite some time. But don't worry! Regular expressions can make it easy, by solving this problem in a single line.

What are Regular Expressions?

Build your intuition. Is this statement true or false?

Python provides support for regular expressions in the re module. To use regex in a program, we need to import this module at the start of our program. This module provides helpful functions that make the matching process easier.

Let's review basic Python with a short question! Would import re be a correct method to import the re module in Python?

Press true if you believe the statement is correct, or false otherwise.

We start our introduction to regular expressions in Python by using a basic search method to match strings provided by the re module.

.search(regex, string) takes a regular expression and a string as an input, and scans through the entire input string, and looks for a location where the given regex pattern finds a match.

Since this is an introduction to regular expressions, we will only use this method from re module. There are other methods from the module which provide more functionality. However, hold that thought for now.

Let's look at a simple example of this search method.

Since the string contained '123', the pattern '123' is instantly matched with the string s in this example.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

Sure, the previous example could've also easily been replicated using a string method. The difference between normal search functions and regular expressions becomes much more evident when special characters (or metacharacters) are used in regular expressions. They provide a unique meaning to the expression when used. In this lesson, we will discuss some of the commonly used metacharacters.

The most basic among these is the use of square brackets ([ ]) to define a character class. Any character inside the class is matched with the given string.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

This code snippet prints "Found a match!" as a single-digit value (in the range of 0 to 9) is present in the string. We can combine these square brackets to obtain more interesting results (such as matching consecutive characters) as illustrated below.

Special Characters Part 2

[a-z] matches any character between 'a' and 'z', [0-9] matches any character between '0' and '9'. Since they are placed right next to each other, the match is found consecutively.

Special Characters Part 2

Here the third part of the expression, [a-z] did not match the third consecutive character of the pattern. As a result, a matching string was not found.

Characters and digits can also be combined for matching, such as in the following example.

Special Characters Part 2

Let's see another metacharacter, a period (.). A period matches any single character occurring at that specific place in a string (except newline character).

The regex 'chips.dip' matches any string which has any single character in-between 'chips' and 'dip', such as the one given in input 'chipsndip' or 'chipsmdip'.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

A caret (^) is another metacharacter. It matches characters at the start of the string. This is helpful when we need to match multiple strings that start with similar characters.

All the given strings begin from 'It', and hence are successfully matched with the given regular expression '^It'.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

Let's look at a repetition based metacharacter, plus (+). It checks if the previous character (from the position of +) in the string appears one or more times from that position.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

In this example, it is important to write 'o' before '+', because this metacharacter checks if there is a preceding character present (there must be one of the repeating characters to check!). However, you could avoid this by using the asterisk ('*') metacharacter.

This matches the string perfectly without having to know about the preceding character beforehand.

JAVASCRIPT
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

This lesson only lists down the basics of regular expressions. There are many other metacharacters available, and their combinations allow us to match complex patterns in texts.

If you are interested in learning more, below is a list of some more metacharacters (and the ones we studied in this lesson) used in regular expressions, along with their usage. Try experimenting with these metacharacters to create unique regular expressions.

MetacharacterCharacter NameUsage
[ ]Square bracketsMatches set of characters specified within them
.PeriodMatches any single character except newline
^CaretMatches the start of string
$DollarMatches the end of string
*AsterikMatches if there are zero or more repetitions
+PlusMatches if there are one or more repetitions
\wLowercase wMatches a single letter, digit, or underscore
\WUppercase WMatches any character which is not a part of \w
\sLowercase sMatches single whitespace character
\SUppercase SMatches any character which is not a part of \s
\dLowercase dMatches decimal digit in the range 0-9
\DUppercase DMatches any character which is not a part of \d
\tLowercase tMatches tab
\nLowercase nMatches newline character

One Pager Cheat Sheet

  • Search engines and text editors use **regular expressions** to quickly locate data and filter out the relevant information.
  • Regular expressions, or regex, are a special sequence of characters that define a pattern to match text, which can be more powerful and shorter than the ‘equal’ operator (==), indexing, or other string methods.
  • The import keyword is used to load the re module in Python, and therefore import re is a valid statement.
  • The re module provides a basic search method, .search(regex, string), which scans through a string and looks for a match to the provided regular expression, as demonstrated in a simple example.
  • Using special characters (or metacharacters) in regular expressions adds unique meaning to the expression, typically demonstrated by defining a character class with square brackets ([ ]).
  • Using Square Brackets, [a-z], [0-9], and . we can create Regex patterns to match character ranges, digits, and consecutive characters to achieve desired results.
  • The ^ metacharacter allows us to match multiple strings that start with the same characters.
  • The metacharacter + checks if the previous character appears one or more times.
  • By using the * metacharacter, one can avoid having to specify the preceding character when searching for a repeating character in a string.
  • The basics of regular expressions were just covered, but with metacharacters such as Square Brackets, Period, Caret, Dollar, Asterisk, Plus, \w, \W, \s, \S, \d, \D, \t and \n, one can match complex patterns in text.