Press enter to see results or esc to cancel.

Screen scraping JavaScript and AJAX heavy pages with Selenium

Screen scraping libraries are popular and easy to use in a variety of languages (such as Symfony Crawler in PHP, Scrapy in Python etc). As these libraries are HTML parsers and not browsers, they cannot render JavaScript. As web pages get more sophisticated, JavaScript has become a core part of the user experience; therefore the tools we use for scraping these pages must be able to emulate this experience.

Introducing Selenium

Selenium is a tool for browser automation; theoretically, anything the user can see and interact with can also be automated with Selenium. Selenium has a browser IDE for basic automation, and also has bindings for a variety of programming language. We are going to be using the Python language bindings, but there are also bindings for Java, C#, Ruby, and Node.

Installing the Python Language Bindings and the Chrome Driver

The easiest way to install the Python Bindings is via pip. The following command will get them installed on your system:

The latest versions of the Chrome Driver can be found here. This can be extracted anywhere on your system, but make a note of the path. For the sake of this guide, we will assume the folder path is /home/michael/Downloads/chromedriver. It is also possible to automate headless browsers such as PhantomJS.

Creating a basic script

Now we need to create a Python script to bring these components together. The script will initialise ChromeDriver and navigate to google.com

When you run this script, it will launch Chrome, and navigate to google.com.

Scraping complex pages

For this guide, we need a JavaScript heavy website. I’ve chosen https://tradeblock.com, which provides real time market data from various Bitcoin exchanges, and is a good example of a site that wouldn’t be “scrape-able” without browser automation. Selenium has similar functions to other HTML parsers to find elements and navigate pages, below are some of the most useful:

Below is the code we will use to visit the site, navigate to the markets page, and extract a piece of data. The code has been annotated line-by-line with comments.

When you run this script, it will open Chrome and navigate to https://tradeblock.com. It will then maximise the window, and click the “Markets” link. Finally, it will wait for the page to load, then output a value. At the time of writing the script outputs the value “530.93” to the terminal (the exact value will change frequently).

The same techniques can be used to interact and extract any data you like from pages with lots of JavaScript/AJAX content.