Scraping data from a webpage after Javascript/AJAX requests have run (PHP/PhantomJS Cloud)

Scraping data from static webpages is generally quite straightforward, but what if we need to extract data from a page that generates additional content through Javascript or AJAX requests after it is rendered. If we used PHP’s file_get_contents() function we would get the HTML of the page before any scripts have had a chance to run.

In this short post I will show you how to easily overcome this problem, in a few lines of PHP.

The best way to do this is to use a headless browser, which is basically a fully functional browser without a GUI and so needs to be controlled through command line or network requests. A headless browser will be able to determine if Javascript is running on the webpage, and though network monitoring will also know if there are AJAX requests still outstanding. This enables us to grab the page contents (the raw HTML) after any scripts or asynchronous requests have finished executing.

One of the most popular headless browsers is the open source PhantomJS. To make your life even easier though, we will be using PhantomJS Cloud, which provides the functionality of PhantomJS as a service, so we don’t have to go through the installation process. We can make requests to PhantomJS Cloud, and work with the response – if you sign up you can make up to 500 requests a day for free, however we will be using the generic API key which allows for up to 100 requests a day per IP address, so no need to sign up.

To demonstrate this, we will be using OptiMap Route Planner. It allows us to enter a number of addresses and returns the optimal (quickest) route that visits each place, which is handy for route planning. We can specify the addresses in the URL, and I have chosen five restaurants in Scotland for our example.

The code below makes a request to PhantomJS Cloud with our specified URL:

 

 

If you inspect the URL in line 2 you can see it is made up of five addresses, denoted by loc0 to loc4. If you copy the URL and paste it into the address bar of your browser you will see that after the page loads, a popup ‘Calculating route’ appears and it takes a few seconds to generate and display the route – this is why simply using file_get_contents() won’t. We specify the render type as HTML and package it as a JSON object (line 5).

On line 7 you will see the PhantomJS Cloud URL with the generic API key at the end, and lines 9 to 15 create an array, $options, which is used in creating a stream context (line 17). There’s a great post on SitePoint if you want to learn about streams and stream contexts in PHP. Finally in line 18 we make the request and load the HTML we get back into $result;

That’s it. If you’re already comfortable with scraping webpages in PHP you can tell all your friends about this tutorial and move on. Otherwise, feel free to keep reading – I’ll give you a quick example of extracting some data from the response we just got.

We now need to sift through the HTML in $result to extract the data we want. If you copied the URL in line 2 into your browser and looked at the calculated route, you will have seen that in the textual route which is displayed below the map, each location name is displayed in a dark gray box, and the directions are displayed between these location names. Let’s suppose we want to scrape the names of the locations and we’re not interested in the directions, just the order in which we have to visit the places.

We are going to be using XPath to parse the HTML data in $result. XPath allows us to navigate XML/HTML documents and PHP has built in XPath functions. If all of this is new to you, feel free to have a read of this article to learn the basics of web scraping in PHP. Note that there’s no error checking in my code, to make the example clearer and shorter.

Take a look at the following code:

 

 

In line 20 we disable libxml errors so they aren’t output to the screen – as we are parsing an HTML document there is a high chance that it isn’t fully XML compliant, but we aren’t interested in this so we can ignore any errors. In lines 22-24 we go on to create a DomXPath object from the HTML in $result. Line 26 runs our XPath query, creating a DOMNodeList object, $nodes. I’m not going to cover how to form an XPath query as the article I just mentioned touches on these, and I also won’t be talking about how to determine the CSS selectors to identify the required data.

In lines 29-31 we iterate over $nodes, and for each DOMNode object we access its respective text (the location) via the nodeValue attribute, and print it to the webpage. The output will look like this:

  1. The Rocks, Marine Road, Dunbar
  2. Pizza Express, North Bridge, Edinburgh
  3. The Voodoo Rooms, West Register Street, Edinburgh
  4. Lourenzos, St. Margarets Street, Dunfermline
  5. Nandos, Livingston
  6. The Rocks, Marine Road, Dunbar

There you have it; if you copy the code into a PHP file and run it on localhost you should get the same result. Hopefully this example has showed that scraping data from a webpage which relies on Javascript/AJAX to dynamically generate content can be straightforward. Just remember that there can be legality issues surrounding web scraping so it’s always a good idea to check the websites Ts & Cs or get in touch with the owner.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.