Thursday, August 9, 2012

Webscraping with CasperJS and PhantomJS

Update: check out this short CasperJS screencast with accompanying exercises

I needed to scrape some data from a website without an API for my latest project. The target website used JavaScript for all of its navigation, working hard against the grain of HTML and other standards. This was anti-hypermedia: there were absolutely no regular HTML anchor tags that I could use to navigate the interface (all of them were hooked to onclick JavaScript events which computed the next URL to open dynamically) and there were barely any CSS selectors I could target. To make this work, I needed my webscraper to execute real JavaScript code to simulate a real user navigating with a real browser.

I thought for awhile I was going to have to use Selenium for this task. Because the project runs on Heroku, I thought this was going to add some very heavyweight dependencies to the project, or force me to use dedicated hosting. I know you can run Selenium in a headless mode via Xvfb, but I'd still have to get a GUI-based browser compiled for Heroku - if that sounds like what you need to do check out the Vulcan build server.

PhantomJS to the Rescue

Happily, I found another way: PhantomJS, a truely headless Webkit browser that runs JavaScript code, and has no dependencies. I downloaded the binaries from the website and dumped 'em right into my app's vendor directory.

I loved how easy it was to get started with PhantomJS but I found it to be cumbersome for a multi-step webscraping project. You have to code via a series of callbacks that can get hard to manage: for example when you ask for a new page to be fetched, you have to register a callback to be notified when and if that page load succeeds. The best feature of PhantomJS, besides being able to execute JavaScript, is its screen capture mode - very helpful for debugging.

The Missing Piece: CasperJS 

In my googling for a solution to this pain I found another awesome project built on top of PhantomJS: CasperJS. Casper gives you a nice DSL to control PhantomJS and hide the callback spaghetti code - it reminds me how EM-Synchrony wraps up EventMachine's callbacks so you can write asynchronous code in a synchronous style.

Once I got the hang of Casper my code basically wrote itself. Below is a sanitized and minimally-formatted version of the production code, also posted as a GitHub gist. Hope it helps you get started with these powerful tools.

var system = require('system');

if (system.args.length < 5) {
  console.info("You need to pass in account name, username, password, and path to casperJS as arguments to this code.");
  phantom.exit();
}

var account = system.args[1];
var username = system.args[2];
var password = system.args[3];
var base_uri = "https://example.com/" + account;
phantom.casperPath = system.args[4];

phantom.injectJs(phantom.casperPath + '/bin/bootstrap.js');

var utils = require('utils');

var casper = require('casper').create({
  verbose: true,
  logLevel: 'debug'
});

casper.on('error', function(msg,backtrace) {
  this.echo("=========================");
  this.echo("ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.on("page.error", function(msg, backtrace) {
  this.echo("=========================");
  this.echo("PAGE.ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.start(base_uri + "/login", function () {
  this.fill("form[name='login_form']", { username: username, password: password },true);
});

// can't click the reports button as that causes a weird file:// link problem
casper.thenOpen(base_uri + "reports");

casper.then(function() {
  var url = this.evaluate(function() {
    return __utils__.getElementByXPath("//a[contains(@href,'Account') and contains(@href,'Report')]").href;
  });

  // winOpenTransform is a function provided by the page; it's brittle for us to invoke it 
  // this way instead of clicking a button, but when you click the button, it pops up a new
  // window, and PhantomJS doesn't currently support popup windows. Thus I have to call
  // this function directly to avoid a popup.
  url = winOpenTransform(url.match(/http:.+?(?=')/)[0]);
  this.open(url);
});

casper.then(function() {
  casper.page.injectJs('jquery.min.js'); // so we can pick an option with the select item below
});

casper.thenEvaluate(function() {
  document.report.runtimeCondition.value = "ExampleField IS NOT NULL";
  document.report.condition.value = "'','','examplename IS NOT NULL','21'";
  document.report.target = "report";
  document.report.submit();
});

casper.thenOpen("https://example.com/" + account + "/Report/reports");

casper.then(function() {
  var url = this.evaluate(function() {
    return __utils__.getElementByXPath("//a[./text()='CSV']")["href"];
  });

  this.echo("GETTING " + url);
  this.download(url,"data.csv","GET");
});

casper.run();

6 comments:

Anonymous said...

Hi . Thanks for the article . But bo way to write the result into a file ? I mean the download file is only used when we specify the url but what if I want to store the content of a variable into a file ?

Anonymous said...

you can use PhantomJS filesystem module in your code

var fs = require('fs');
fs.write('outfile.txt','Write this', 'w');

Unknown said...

How to use Phantomjs in my shared hosting through PHP, when i use shell_exec i get permission denied message, any work around ?

tomelam@gmail.com said...

This article explains how to do *exactly* what I want to do on Heroku. Thanks, Mike! -Tom

Altryne said...

Hi Mike,
I want to understand something,
How exatcly are you running this application on Heroku?
I have a casper script that works fine locally, I want to publish it to heroku so I can trigger it via a get/post from another service.
Is this possible?

Wojtek Zeglin said...

That's fantastic, Mike. Your method actually enabled me to use proxy servers with CasperJS. I simply pass the --proxy attribute to PhantomJS and run the CasperJS script after that. Perhaps someone else will find it useful.