In the last post (Beginner’s Guide to Python Regular Expression), we learnt about python regular expression. If you don’t know the basic syntax and structure of it, then it will be better to read the mentioned post. If you know, then let’s practice some of the concept mentioned. We will work out some examples and can try to understand how to use python regular expression to solve our problem.
We will need a text set using which we can practice and get results. I have chosen a Wikipedia article (on Game of Thrones). Now, let’s copy the first paragraph from it. You can also copy it from below:-
'Game of Thrones is an American fantasy drama television series created by David Benioff and D. B. Weiss. It is an adaptation of A Song of Ice and Fire, George R. R. Martin series of fantasy novels, the first of which is A Game of Thrones. It is filmed in Belfast and elsewhere in the United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the United States. The series premiered on HBO in the United States on April 17, 2011, and its seventh season ended on August 27, 2017. The series will conclude with its eighth season premiering in 2019.'
First of all, lets save it in a variable named as ‘para’. It will be easier to call it in any function.
# Saving the text in a variable para='''Game of Thrones is an fantasy drama television series created by David Benioff and D. B. Weiss. It is an adaptation of A Song of Ice and Fire, George R. R. Martin series of fantasy novels, the first of which is A Game of Thrones. It is filmed in Belfast and elsewhere in the United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the United States. The series premiered on HBO in the United States on April 17, 2011, and its seventh season ended on August 27, 2017. The series will conclude with its eighth season premiering in 2019.'''
Example 1 : Extract all characters from the paragraph using Python Regular Expression.
import re pattern = r'.' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['G', 'a', 'm', 'e', ' ', 'o', 'f', ' ', 'T', 'h', # 'r', 'o', 'n', 'e', 's', ' ', 'i' etc...]
- Above code will extract all of the characters and display it.
- To use python regular expression, we need re package.
- One thing you should notice is that pattern is saved as a raw string (r’.’). Raw string makes it easier to create pattern as it doesn’t escapes any character.
- I have constructed the code so that its flow is consistent with the flow mentioned in the last tutorial blog post on Python Regular Expression.
- We are first defining the pattern than converting it into a regex object and searching for this pattern in the stored paragraph/text.
- We are using findall() as we need to extract all of the matches. Using search() will just gives us the first match.
- It is also finding space and adding it in output. It is happening because dot(.) represents character and space is also a character.
Example 2: Change above code so that it extracts only word character
import re pattern = r'\w' patter_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['G', 'a', 'm', 'e', 'o', 'f', 'T', 'h', 'r', 'o',etc..]
- This will extract all of the word characters i.e. alphabets and numbers.
- We are using ‘\w’ to represent any letter, numeric digit or the underscore character.
Example 3: Extract only numeric digits from it
import re pattern = r'\d' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:-['1', '7', '2', '0', '1', '1', '2', '7' etc..]
- We are using \d to represent any numeric digits.
Example 4: Extract all of the words and numbers
import re pattern = r'\w+' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['Game', 'of', 'Thrones', 'is', 'an', 'American',etc..]
- We have use ‘+’ . it denotes one or more. So, it will extract any word which has one or more characters.
- It will extract both numbers and words.
Example 5: Extract only numbers
import re pattern = r'\d+' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['17', '2011', '27', '2017', '2019']
Example 6: Extract the beginning word
import re pattern = r'^\w+' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['Game']
Example 7: Extract first two characters from each word (not the numbers)
import re pattern = r'\b[a-zA-Z].' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output:- ['Ga', 'of', 'Th', 'is', 'an', 'Am' etc..]
- We didn’t use \w as it would have also matched digits.
- We have defined our own character class using [].
- As search is case-sensitive, we had to mention both upper and lower case alphabets
- We are also using \b, which define the word boundary and only search in the beginning of each words
Example 8: Make above search case-insensitive so that we can define our character class using only lower case
import re pattern = r'\b[a-z].' pattern_regex = re.compile(pattern, re.IGNORECASE) matched_object = pattern_regex.findall(para) print(matched_object)
- This will give same output as example 7.
- We have used a second argument in re.compile() to make it case insensitive.
- re.compile can take only one second argument. If you need to use more than one second argument, then use pipe (|) to combine different options.
Example 9: Find out all of the words, which start with a vowel.
import re pattern = r'\b[aeiou]\w+' pattern_regex = re.compile(pattern, re.IGNORECASE) result = pattern_regex.findall(para) print(result) #output:- ['of', 'is', 'an', 'American', 'and', 'It', etc..]
- Check the use of re.IGNORECASE and \b.
Example 10: Find out all of the words, which start with a consonant
import re pattern = r'\b[^aeiou0-9 ]\w+' pattern_regex = re.compile(pattern, re.IGNORECASE) result = pattern_regex.findall(para) print(result) #output :- ['Game', 'Thrones', 'fantasy', 'drama', etc..]
- We have use ^ to create our own negative character class.
- we have left consonant for the selection by excluding vowels, numbers and space.
Example 11: Count total numbers of a, an and the
import re pattern = r'the|a|an' pattern_regex = re.compile(pattern, re.IGNORECASE) result = pattern_regex.findall(para) len(result)
- This will give you count of a, an and the.
- findall() gives output in list format. So, we can use list functions for calculations.
Example 12: We can see that there are dates in Months Day, year format. Let’s extract these dates.
import re pattern = r'(\w+)(\s)(\d+)([,]\s)(\d+)' pattern_regex = re.compile(pattern) result = pattern_regex.findall(para) print(result) #output :- [('April', ' ', '17', ', ', '2011'), # ('August', ' ', '27', ', ', '2017')]
- This examples also show the strength of python regular expression in information retrieval from unstructured data.
- Output is not looking like a date and might need some working to change it in any format you want.
- Let’s change the format of date and print it in a more readable format.
#printing the dates dates_found=[] for dates in result: dates_found.append("-".join([dates[2],dates[0],dates[4]])) print(dates_found) #output:- [17-April-2011', '27-August-2017']
- we are using join to re-format dates.
- As it was a list, we can use for loop and iterate over each elements and work on it.
This was few examples which will help you in understanding python regular expression clearly. I will also try to build one practical application of it. Let me know, if you have anything specific in mind. Also, don’t forget to share, like and comment on this post