Python regular expression is one of the most difficult concepts for the beginners to wrap their heads around. It needs some practice to master it and can be very confusing in the beginning. But once mastered, it is a very handy toolto have and can be used in a lot of different cases: –
- Web Scraping
- Working with date time format
- Natural language processing
- Extracting information from text and reformatting it
While learning python regular expression, I find that there are too many different parts to learn about. So, I divided the whole process into different steps. These steps are: –
- Package requirement
- Defining regular expression pattern
- Creating a regular expression object using the above pattern
- Matching the regular expression pattern
- Extracting those matched strings
Now, let’s look at these steps and methods associated with details.
Package Requirement
Like most of the python, we will need the required package to work with python regular expression. Required package for it is “re” and we will import it.
import re
Defining python regular expression pattern
Now, this is the most important steps in the whole process. Unless or until we write the accurate and suitable pattern for our search, we won’t be able to extract the relevant information.
Let first define the tools available for writing patterns and then we will look at some usage.
- \d – It represents any numeric digit i.e. 0 to 9.
- \D – It represents any character that is not a numeric digit i.e. apart from 0 to 9.
- \w – It represents any letter, numeric digit or the underscore character (_).
- \W – It represents any character that is not a letter, numeric digit or the underscore character (_).
- \s – It represents any space, tab or newline character.
- \S – It represents any character that is not space, tab or newline.
It is quite clear that small-cap alphabet is used for finding something and the large-cap alphabet is used for finding anything apart from its small-cap representative.
- \n – It represents a single newline character.
- () – It is used to create different groups. Creating different groups give us flexibility in treating different groups differently. It makes pattern matching very flexible.
- ? – It is used for the optional search. Optional search means finding a match whether or not that text is present there. The ? character makes any group preceding it as an optional part of the pattern.
- Star or asterisk (*) – It is used to match repetition of any particular group. It will match zero or more occurrence of the preceding group. It will match when either that group is not present or it is present once or a multiple numbers of times.
- Plus (+) – It is also used to match repetition of any particular group. It will match one or more occurrence of the preceding group. It is different from asterisk as it will ensure that preceding group is present at least once.
- {} – It is also used to match repetition of any group a certain number of times. We can also define range in {} e.g. {3,5}. It will match three, four and five repetitions of the group. We can play with range by leaving either lower or upper range empty. It will leave the search unbounded in that direction
- [] – It can create our own character class which can be used to match any text.
- Pipe (|) – It is like “or” operator and can be used in the pattern to look for more than one expression. It will match if any one of the expressions matches.
- Caret character (^) – There are two different usages of caret character. [^] – If used just after the character class’s opening bracket, then it makes that character class a negative character class. ^ – If used at the start of the pattern, we can indicate that a match must occur at the beginning of the searched text.
- A dollar sign ($) – If used at the end of the regex, it indicates that the string must end with this regex
- Dot(.) – It is called a wildcard and will match any character except for a newline. It can be used in combination with * and {} to match more than one numbers of characters.
Creating a python regular expression object using the above pattern
We can create a python regular expression pattern object by using re.compile() function. This object can be used to match the desired pattern.
re.compile()
Matching the regular expression pattern
We can use different functions to match the above-created object in the text. One thing we should take care of is that by default it is case sensitive.
- search () – It returns a match object after searching for the regex pattern object in the given text. If there is more than one match, it will return strings of the first match only.
- findall () – It will return the strings of every match in the searched string in a list.
Search is greedy by default. It will always return the longest string possible. A non-greedy version will return the shortest string possible.
Re.compile(r’(ha){3,5}’) – greedy search
Re.compile(r’(ha){3,5}?’) – non-greedy search
Extracting those matched strings
Now, we have created the regular expression pattern and converted it to regex object and have searched for it. Next step will be to extract those matched strings.
- group () – It returns the actual matched string from the text
- group(0) – It returns the entire matched string from the text
- group(1)- it returns the matched text for the first group in the regex object.
- groups() – It can be used to extract all of the matched group together.
Now we know how to work with python expression. In the next post, we will work out some examples.