Screen scraping libraries are popular and easy to use in a variety of languages (such as Symfony Crawler in PHP, Scrapy in Python etc). As these libraries are HTML parsers and not browsers, they cannot render JavaScript. As web pages get more sophisticated, JavaScript has become a core part of the user experience; therefore the tools we use for scraping these pages must be able to emulate this experience.

Introducing Selenium

Selenium is a tool for browser automation; theoretically, anything the user can see and interact with can also be automated with Selenium. Selenium has a browser IDE for basic automation, and also has bindings for a variety of programming language. We are going to be using the Python language bindings, but there are also bindings for Java, C#, Ruby, and Node.

Installing the Python Language Bindings and the Chrome Driver

The easiest way to install the Python Bindings is via pip. The following command will get them installed on your system:

pip install selenium

The latest versions of the Chrome Driver can be found here. This can be extracted anywhere on your system, but make a note of the path. For the sake of this guide, we will assume the folder path is /home/michael/Downloads/chromedriver. It is also possible to automate headless browsers such as PhantomJS.

Creating a basic script

Now we need to create a Python script to bring these components together. The script will initialise ChromeDriver and navigate to google.com

# Import the Selenium web driver
from selenium import webdriver

# The path to the Chrome web driver
path_to_chromedriver = '/home/michael/Downloads/chromedriver'
# Initialise the Chrome driver
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
# Navigate to a website address
browser.get('http://www.google.com')

When you run this script, it will launch Chrome, and navigate to google.com.

Scraping complex pages

For this guide, we need a JavaScript heavy website. I’ve chosen https://tradeblock.com, which provides real time market data from various Bitcoin exchanges, and is a good example of a site that wouldn’t be “scrape-able” without browser automation. Selenium has similar functions to other HTML parsers to find elements and navigate pages, below are some of the most useful:

browser.find_element_by_css_selector('selector')  # Return a single element matching a CSS selector
browser.find_elements_by_css_selector('selector')  # Return multiple elements matching a CSS selector
browser.find_element_by_xpath('selector')  # Return a single element matching an XPath selector
browser.find_elements_by_xpath('selector')  # Return multiple elements matching an XPath selector

browser.find_element_by_id('id')  # Return a single element matching an ID
browser.find_elements_by_tag_name('tag')  # Return multiple elements matching an HTML tag
browser.find_element_by_link_text('Link')  # Return a single element with specific link text

Below is the code we will use to visit the site, navigate to the markets page, and extract a piece of data. The code has been annotated line-by-line with comments.

# Import the Selenium web driver
from selenium import webdriver

# The path to the Chrome web driver
path_to_chromedriver = '/home/michael/Downloads/chromedriver'
# Initialise the Chrome driver
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
# Navigate to a website address
browser.get('https://tradeblock.com')
# Maximise the window, otherwise it displays the mobile site
browser.maximize_window()
# Find the markets link, and click
browser.find_element_by_link_text('Markets').click()
# Wait a couple of seconds for the page to completely load
browser.implicitly_wait(2)
# Print the current price from the Bitstamp exchange
print browser.find_element_by_css_selector('.exchanges tr:first-child td:nth-child(4)').text

# Output at the time of writing: 530.93

When you run this script, it will open Chrome and navigate to https://tradeblock.com. It will then maximise the window, and click the “Markets” link. Finally, it will wait for the page to load, then output a value. At the time of writing the script outputs the value “530.93” to the terminal (the exact value will change frequently).

The same techniques can be used to interact and extract any data you like from pages with lots of JavaScript/AJAX content.

Tagged in:
,