Screen scraping libraries are popular and easy to use in a variety of languages (such as Symfony Crawler in PHP, Scrapy in Python etc). As these libraries are HTML parsers and not browsers, they cannot render JavaScript. As web pages get more sophisticated, JavaScript has become a core part of the user experience; therefore the tools we use for scraping these pages must be able to emulate this experience.
Introducing Selenium
Selenium is a tool for browser automation; theoretically, anything the user can see and interact with can also be automated with Selenium. Selenium has a browser IDE for basic automation, and also has bindings for a variety of programming language. We are going to be using the Python language bindings, but there are also bindings for Java, C#, Ruby, and Node.
Installing the Python Language Bindings and the Chrome Driver
The easiest way to install the Python Bindings is via pip. The following command will get them installed on your system:
pip install selenium
The latest versions of the Chrome Driver can be found here. This can be extracted anywhere on your system, but make a note of the path. For the sake of this guide, we will assume the folder path is /home/michael/Downloads/chromedriver. It is also possible to automate headless browsers such as PhantomJS.
Creating a basic script
Now we need to create a Python script to bring these components together. The script will initialise ChromeDriver and navigate to google.com
# Import the Selenium web driver from selenium import webdriver # The path to the Chrome web driver path_to_chromedriver = '/home/michael/Downloads/chromedriver' # Initialise the Chrome driver browser = webdriver.Chrome(executable_path=path_to_chromedriver) # Navigate to a website address browser.get('http://www.google.com')
When you run this script, it will launch Chrome, and navigate to google.com.
Scraping complex pages
For this guide, we need a JavaScript heavy website. I’ve chosen https://tradeblock.com, which provides real time market data from various Bitcoin exchanges, and is a good example of a site that wouldn’t be “scrape-able” without browser automation. Selenium has similar functions to other HTML parsers to find elements and navigate pages, below are some of the most useful:
browser.find_element_by_css_selector('selector') # Return a single element matching a CSS selector browser.find_elements_by_css_selector('selector') # Return multiple elements matching a CSS selector browser.find_element_by_xpath('selector') # Return a single element matching an XPath selector browser.find_elements_by_xpath('selector') # Return multiple elements matching an XPath selector browser.find_element_by_id('id') # Return a single element matching an ID browser.find_elements_by_tag_name('tag') # Return multiple elements matching an HTML tag browser.find_element_by_link_text('Link') # Return a single element with specific link text
Below is the code we will use to visit the site, navigate to the markets page, and extract a piece of data. The code has been annotated line-by-line with comments.
# Import the Selenium web driver from selenium import webdriver # The path to the Chrome web driver path_to_chromedriver = '/home/michael/Downloads/chromedriver' # Initialise the Chrome driver browser = webdriver.Chrome(executable_path=path_to_chromedriver) # Navigate to a website address browser.get('https://tradeblock.com') # Maximise the window, otherwise it displays the mobile site browser.maximize_window() # Find the markets link, and click browser.find_element_by_link_text('Markets').click() # Wait a couple of seconds for the page to completely load browser.implicitly_wait(2) # Print the current price from the Bitstamp exchange print browser.find_element_by_css_selector('.exchanges tr:first-child td:nth-child(4)').text # Output at the time of writing: 530.93
When you run this script, it will open Chrome and navigate to https://tradeblock.com. It will then maximise the window, and click the “Markets” link. Finally, it will wait for the page to load, then output a value. At the time of writing the script outputs the value “530.93” to the terminal (the exact value will change frequently).
The same techniques can be used to interact and extract any data you like from pages with lots of JavaScript/AJAX content.