Scraping FIFA Men’s Ranking data using python


When the world cup started, I got curious about teams and their FIFA World Ranking. I also wanted to see how the ranks of different countries have been throughout the history. This ranking was introduced in December 1992 and a point based system is used for this. Points are awarded based on the results of international matches (FIFA recognised). Also, FIFA revamped this ranking system after the 2006 World Cup and first new ranking was issued in July 2006. FIFA Men’s Ranking can be seen at their website. There is no API to download this data so we will need to scrape it from each ranking page. So in this project, we will scrape FIFA Men’s rankings data from 1992 to the latest one. You can check FIFA Men’s ranking here.

 

FIFA Men's ranking
FIFA Men’s ranking

Before starting to code, let’s try to understand the structure of this website and its ranking pages. We will need to understand the logic so that we can write our code to load only the required pages and scrape data from it. URL for current FIFA ranking  page is  https://www.fifa.com/fifa-world-ranking/ranking-table/men/index.html. Let’s browse to some of the past rankings by clicking the arrow on the right-hand side of the page and check the URL.

  • 17th May 2018 – https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank=286/index.html
  • 12th Apr 2018 –  https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank=285/index.html
  • 15th Mar 2018 – https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank=284/index.html
  • 15th Feb 2018 – https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank=283/index.html

We can clearly see the pattern here which is “https://www.fifa.com/fifa-world-ranking/ranking-table/men/ rank = (number)/index.html”.

We can use this logic to create a variable and loop over page where rank=2 to rank = 287. We will start from 2 because if we use 0 and 1 we don’t see any ranking.

So, now we know the structure of pages. We will use this to scrape data from the ranking pages. The process of scraping data from any website can be divided into few steps. These steps will be:-

  1. Navigating to each page and rendering all the information
  2. Converting page into a suitably parsed format
  3. Finding the required data from the parsed format and storing it

 

Python Code for Scraping FIFA Men’s Ranking

So, let’s write a python code to perform all the above steps.

# loading required packages
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import datetime

# initializing needed variables
fifa_url = 'https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank={}/index.html'
first_page = 2
last_page = 287
list_downloaded_page = []
full_data = []

# launching chrome webdriver
driver = webdriver.Chrome()

try:
for i in range(first_page, last_page+1):
target_url = fifa_url.format(i)
driver.get(target_url)
time.sleep(10)
driver.find_element_by_link_text('201-211').click()
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(1)
rank_date = soup.find('div', {'class':['slider-wrap']}).find('li').text.strip()
rank_date = datetime.datetime.strptime(rank_date, '%d %B %Y')

for table in soup.find_all('table', 'table tbl-ranking table-striped'):
for tr in table.find_all('tr', 'anchor'):
row_data = []
for td in tr.find_all('td'):
try:
res = td.text.strip()
row_data += [res]
except TypeError:
pass
row_data += [rank_date]
full_data.append(row_data)
finally:
driver.quit()
fifa_rank = pd.DataFrame(full_data)
fifa_rank.to_csv('fifa_ranking.csv', index=False, encoding='utf-8')

Let’s look at different parts of the code and see what all is happening here. 

 

scraping FIFA Men's rank code

 

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import datetime

In this part of the code, I am importing required packages for this project.

  • pandas:- It provides easy to use data structure and data analysis tools for the python language.
  • bs4 or BeautifulSoup:- It helps in navigating, searching and pulling data out of HTML and XML languages.
  • selenium:- It helps in automating web browsers. Here, we have used it to automate the process of rendering different pages.
  • time:- This module provides different time-related functions. I use it mostly for its sleep() function.
  • datetime:- It provides classes for manipulating dates in times.
fifa_url ='https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank={}/index.html'
first_page = 2
last_page = 287
list_downloaded_page =[]
full_data =[] 

Initializing some variables here. I have created the fifa_url variable without any rank numbers. I will use this variable to browse through all of the rankings pages using a FOR loop. Also, we are scrapping pages from 2 to 287. You can change the value of the last_page variable based on the date you are running it and available ranks. I will store all my scrapped data in the full_data list.

driver = webdriver.Chrome()

We will need a web browser to browse through and render all the ranking pages. webdriver.Chrome() will launch an instance of chrome browser which will be controlled by our code. This instance of chrome web browser can be referenced by the variable name driver. This version of chrome is different from your normal chrome browser and you will need to download the chrome.exe file from here. Before running this code make sure to put the downloaded .exe file in the working directory. Even though you can reference it from any location but I find it easier to run like this. There are different options available for webdriver i.e. firefox and internet explorer etc. Just make sure you have downloaded the right webdriver and using the right function to call that particular browser.

try:
# Main code 
finally:
driver.quit()

I am using the try-finally method to make sure that even in the case of any error, I am closing chrome browser. It is a good practice to follow.

for i in range(first_page, last_page+1):
# code lines 

I am using a FOR loop to loop through all the numbers between first_page and last_page i.e. 2 and 287. We will use these numbers to browse through different pages by inputting this into the fifa_url variable.

target_url = fifa_url.format(i)

In this line, I am placing the value of “i” in the string.  We are changing our fifa_url variable and adding the number from the FOR loop to it. Combined with the number, target_url will direct the driver to different rankings pages. For e.g. for i = 2, value of target_url will be “https://www.fifa.com/fifa-world-ranking/ranking-table/men/rank=2/index.html”.

driver.get(target_url)

In the previous line, we have created the URL which we would like to render. We will need to direct our driver to that URL. For this, I am using the driver.get() method. This navigates the driver to the target_url which is supplied as an argument.

time.sleep(10)

time.sleep() pauses our python code for the given number of seconds. It can be very useful as it gives ur driver sufficient time to render the web page. It can be very useful in web scrapping projects as it can account for the internet speed and other factors. Pause time can be increased or decreased based on internet connection etc.

driver.find_element_by_link_text('201-211').click()

We are using two different functions in this line i.e. driver.find_element_by_link_text() and click(). In HTML its different components are called as elements. When we know the link text used within an anchor text, we can use find_element_by_link_text(). It will return the first element matching with the link text.  So, after getting the element link we can click on it using click(). It is just like clicking on any link using ur mouse.

So, now we need to understand the importance of this line. Any web pages only rendered the visible portion of the webpage i.e if any data is behind any link it won’t be rendered. Before proceeding further, we need to make sure that all the information is there on the web pages. If you visit any FIFA ranking pages, you will notice that only top 50 ranks are there. We need to make sure that all the ranks and its information are there on the pages. It can be done by clicking 201-211 link which will show all the rankings for that particular month.

 soup = BeautifulSoup(driver.page_source, 'html.parser')

Till now, we have found our URL and have navigated to that. We have also made sure that all the information are there on the web pages by using mostly selenium package. Now we need to parse the whole page in a suitable format and store the required data from it. We will use BeautifulSoup package for this.

We are using two methods here i.e. BeautifulSoup() and driver.page_source().

  • driver.page_source – It gets the source of the currently open page in the driver.
  • BeautifulSoup() – It transforms a complex HTML document into a complex tree of python objects by parsing it using the mentioned parser in the argument. We are using the “html.parser” here but it has other available options too. It returns a beautifulsoup object.
rank_date = soup.find('div', {'class' :['slider-wrap']}).find('li').text.strip()
rank_date = datetime.datetime.strptime(rank_date, '%d %B %Y')

We will be downloading the whole history of FIFA ranking. We need to make sure that we are able to identify the month in which each ranking was released. For this, we are also saving the date in the rank_date variable. soup.find() will find the first matching object and will return it. And we are striping date from it. As it will be in string format, we are converting it into datetime format.

for table in soup.find_all('table', 'table tbl-ranking table-striped'):
for tr in table.find_all('tr', 'anchor'):
row_data =[]
for td in tr.find_all('td'):
try:
res = td.text.strip()
row_data += [res]
except TypeError:
pass
row_data += [rank_date]
full_data.append(row_data)
Before looking at this nested FOR loop, let’s summarize what we have done till now and what is left to do. So, we have navigated to the required FIFA ranking page and have converted the page source into a BeautifulSoup object. We have also stored the date on which that particular FIFA ranking was released. What is left to do is to download the ranking data and store it in our full_data variable. You can see that ranking data is stored in a table object so we need to first find all the table objects and then go through each row separately and scrap data from each data element which will represent values of different columns in a particular row. There is a single table with the mentioned class in each web page, but still, I am finding it first to give boundary for each row and making it convenient to read. We have used find() before. In this loop, we are using find_all(), because we need to identify all the matching objects, not just the first one.
After finding all the data points in the innermost loop, we are using strip() to pull out those data. We have used try-finally to make sure that code moves on to another value in case it get null and throws an error because of that.  When we have pulled out all the data of a single row, we are adding the date of release of ranking in it and then appending this row to our full_data list. We will convert full_data into a dataframe after all the loops are done. It will make it easier to write it in a CSV file and work on the dataset.

 

Now we have the whole history of FIFA Men’s ranking. We will need to do few data manipulation on the dataframe, to make it a better dataset. In the next post, I will do data exploration and manipulation on this dataset. This will give a glimpse of what all needed to be do after starting with a new dataset to understand it better. If you want to practice, you can scrape Women’s ranking from FIFA website. Let me know how did you find this post and whether it was helpful or not. Also, don’t forget to share it on your social media pages.


2 responses to “Scraping FIFA Men’s Ranking data using python”

  1. Hi, thanks for your post. I am trying to scrap data from FIFA website and see your post. It looks like FIFA recently changed their rule for pageID and paginated pages. I am having some issues with for example, when I update the new url with get method. the browser url is updated but the information within the webpage remained from previous one, I guess it is due to cache. that’s why scapper doesn’t work properly in some pages

Leave a Reply

Your email address will not be published. Required fields are marked *