Project from scratch – Scrapping real estate prices & visualizing it


real_estate_pricing_toronto_Web_Scrapping

Web scraping is one of the most important tools in collecting data for data science projects. Python (Selenium & BeautifulSoup) can be very useful in scrapping websites. I was looking to code one project to practice web-scrapping & data visualizations.

The aim of this project is to scrap real estate prices from realtor website for the Toronto area and analyze it at the postal code or FSA level. I wanted to see which area has costlier listings. I have limited my project to the Toronto area but the code can be used for any of the cities (if data is available).

Project Details

We will do the following actions –

  • Scrap real estate listings prices from the website for the Toronto area
  • As the postal code is missing from the website, we will add it by scrapping data from google map search results
  • Data manipulation and data visualizations

Packages Used

  • Pandas
  • OS
  • time
  • selenium
  • BeautifulSoup
  • seaborn

Scrapping realtor.ca website

For scraping any website we need to first understand the page flow so that we can direct our actions to relevant pages. Luckily for us, realtor.ca has custom pages for popular cities that help us as we can navigate directly to the relevant city pages.

Once we are on the relevant city page, we can see a few of the real estate listings for that area.  We will need to select the next page numbers for subsequent pages. As we won’t know the total number of page numbers beforehand, we will need to code it to go through all the available pages.

import pandas as pd
import os
import time
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup

So, we are importing all the relevant packages. These packages will be used for –

  • os – for accessing our folder structures
  • time – for putting sleep functionality which will pause our code for a few seconds
  • selenium – for automating website navigation flow and downloading HTML content
  • bs4 – for extracting relevant data from HTML page source
def scrapping_function(current_url, click_next_button):
    driver.get(current_url)
    i = 1
    next_page_true = True
    while next_page_true:
        page = driver.page_source
        file_ = open('saved_pages/toronto/page_{}.html'.format(i), 'w')
        file_.write(page)
        file_.close()
        page_url = driver.current_url
        elem = driver.find_element_by_xpath(click_next_button)
        elem.click()
        time.sleep(2)
        page_url_next = driver.current_url
        if page_url == page_url_next:
            next_page_true = False
        else:
            i += 1

The above code will download all the real estate listing pages for any particular city in our local folder. I am saving pages so that we can process each page separately offline. The following actions are being taken in this function –

  • driver.get is opening the relevant website page.  We can pass the desired area link as current_url to the function.
  • while loop to save all the available pages. I am matching the current page URL to the next page URL for finding the last page. After the last page, nothing will happen even if we click on the next arrow button, and the URL will remain the same.
  • driver.page_source is saving all the HTML content of the page in the variable which is saved as an HTML file in the local folder.
  • driver.current_url returns the current page URL.
  • find_element_by_xpath is finding the next arrow button by the given XPath (in the function parameter). I am saving this element in elem variable and then clicking on it using elem.click
# initializing needed variables
realtor_url = 'https://www.realtor.ca/on/toronto/real-estate-for-sale'
click_next_button = "/html/body/form/div[5]/div[2]/span/div/div[4]/
div[3]/span/span/div/a[3]"
# launching firefox wedriver
binary = FirefoxBinary('Location_for _firefox_installation')
driver = webdriver.Firefox(firefox_binary=binary)
# calling scrapping function
scrapping_function(realtor_url, click_next_button)
driver.close()
  • Saving relevant URL in the realtor_url variable. If you want to scrap different city data, change the URL.
  • Saving XPath for the next arrow button element on the page. We will navigate to the next page by clicking on it.
  • Using Firefox web driver for this. I am passing the firefox installation folder location to the binary variable.
  • Calling the scrapping function and passing URL and XPath to it.
  • driver .close to close the web driver. Always remember to close it.

The above codes will store all the listing pages in a folder. Now, we can go through each page and extract relevant data from it.

# getting data from each pages

toronto_df = pd.DataFrame()
for file in os.listdir('saved_pages/toronto'):
    html_file = 'saved_pages/toronto/{}'.format(file)
    soup = BeautifulSoup(open(html_file), "html.parser")
    data = soup.find_all(class_="cardCon")

    for i, elem in enumerate(data):
        price = data[i].find(class_="listingCardPrice").get_text()
        address = data[i].find(class_="listingCardAddress").get_text()
        rooms = data[i].find_all(class_="listingCardIconNum")
        bedrooms = rooms[0].get_text()
        bathrooms = rooms[1].get_text()
        try:
            row_df = pd.DataFrame({'price': [price],
                        'address': [address],
                        'bedrooms': [bedrooms],
                        'bathrooms': [bathrooms],
                        'page': [file]})
            toronto_df = pd.concat([toronto_df, row_df])
            del row_df
        except:
            print("Error in record_{}".format(i))


# cleaning dataframe 
def cleaning_address(x):
    return (x.replace("\n", "").strip())

toronto_df['address'] = toronto_df['address'].apply(cleaning_address)
    

# saving data file 
toronto_df.to_csv('output/toronto.csv', index=False)
  • Iterating through each saved page using os.listdir.
  • I am using the HTML parser of the BeautifulSoup
  • We can inspect each real estate card in the browser. All these cards have the “cardCon” class which contains the required data i.e. price, address, rooms, no of bedrooms & bathrooms.
  • We are selecting all the elements with the class as “cardCon” using soup.find_all and then extracting data one by one.
  • Adding data row-wise in a pandas DataFrame and saving it as a CSV file for later use.

Finding Postal Code Using Google Map

As we can see in the data, the postal code is missing for addresses. I wanted it in our dataset so the search continues. I tried the Canada Post address completion page and google search but wasn’t able to automate the process for every address. Finally, google map search worked and extracted postal code for the whole dataset.

Once we have finalized the navigation flow, the whole process becomes simple. For getting postal code from the search result of google map, the flow is –

  •  Iterate through the address list
  • Open the google map URL
  • Enter the address in the search box
  • Press Enter
  • Extract the postal code from the search result
  • Save it in the data set
def google_map_postal_code(x):
    search_string = x[1]
    binary = FirefoxBinary('C:/Users/Prince.Atul/AppData/Local/Mozilla Firefox/firefox')
    driver = webdriver.Firefox(firefox_binary=binary)
    map_url = 'https://www.google.ca/maps/'
    driver.get(map_url)
    time.sleep(2)
    driver.find_element_by_xpath('//*[@id="searchboxinput"]').send_keys(search_string)
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="searchboxinput"]').send_keys(u'\ue007')
    time.sleep(3)
    try:
        postal_code = driver.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[8]/div/div[1]/div/div/div[2]/div[1]/h2/span').text
        driver.close()
        return(postal_code)
    except:
        driver.close()
        return('No PostalCode')

# modifying address and keeping only street names 
def add_correction(x):
    x = x.split('-')
    return (x[-1])


def find_postal_code (file_path):
    data = pd.read_csv(file_path)
    data['modified_address'] = 'no_address'
    data['modified_address'] = data['address'].apply(add_correction)
    data['postal_code'] = 'NA'
    data['postal_code'] = data[['postal_code', 'modified_address']].apply(google_map_postal_code, axis=1)
    return(data)

# calling the function and saving results in a csv file
output_df = find_postal_code ('output/toronto.csv')
output_df.to_csv('output/t_2.csv', index=False)
  • Passing address as a parameter to the function
  • starting the Firefox web driver
  • Navigating to the Google map URL
  • Finding the search box by using XPath and posting our address there
  • Pressing Enter while our selection is still in the search box
  • Selecting the Postal code from the search result by using its XPath
  • Closing the driver
  • I am doing some address corrections before passing it to Google Maps. I am removing Unit no or house number and keeping only street or area names. My assumption is for finding the postal code, won’t matter and will help us to get more search hits.

Reformating data and visualizing it

Congratulation, we have the data now and can use it any way we want. Just be careful before using scrapped data for any commercial purposes as I am unclear about fair use policy of it.

I am using seaborn to visualize data like no. of listings area-wise, average price, etc. I have also used Folium to plot the geoJson area map with price as an indicator. This part is not my original work and has copied code from here (also check his GitHub repo ).

You can check some of the images of the graphs below. If you are interested in the code for that, please check this GitHub Repo.

If you like this project, please follow the GitHub Repo so that you can always get the updated code. Also if you have any comments or questions, use the comment section of Github. 


%d bloggers like this: