Showing posts with label Javascript. Show all posts
Showing posts with label Javascript. Show all posts

Thursday, August 9, 2012

Webscraping with CasperJS and PhantomJS

Update: check out this short CasperJS screencast with accompanying exercises

I needed to scrape some data from a website without an API for my latest project. The target website used JavaScript for all of its navigation, working hard against the grain of HTML and other standards. This was anti-hypermedia: there were absolutely no regular HTML anchor tags that I could use to navigate the interface (all of them were hooked to onclick JavaScript events which computed the next URL to open dynamically) and there were barely any CSS selectors I could target. To make this work, I needed my webscraper to execute real JavaScript code to simulate a real user navigating with a real browser.

I thought for awhile I was going to have to use Selenium for this task. Because the project runs on Heroku, I thought this was going to add some very heavyweight dependencies to the project, or force me to use dedicated hosting. I know you can run Selenium in a headless mode via Xvfb, but I'd still have to get a GUI-based browser compiled for Heroku - if that sounds like what you need to do check out the Vulcan build server.

PhantomJS to the Rescue

Happily, I found another way: PhantomJS, a truely headless Webkit browser that runs JavaScript code, and has no dependencies. I downloaded the binaries from the website and dumped 'em right into my app's vendor directory.

I loved how easy it was to get started with PhantomJS but I found it to be cumbersome for a multi-step webscraping project. You have to code via a series of callbacks that can get hard to manage: for example when you ask for a new page to be fetched, you have to register a callback to be notified when and if that page load succeeds. The best feature of PhantomJS, besides being able to execute JavaScript, is its screen capture mode - very helpful for debugging.

The Missing Piece: CasperJS 

In my googling for a solution to this pain I found another awesome project built on top of PhantomJS: CasperJS. Casper gives you a nice DSL to control PhantomJS and hide the callback spaghetti code - it reminds me how EM-Synchrony wraps up EventMachine's callbacks so you can write asynchronous code in a synchronous style.

Once I got the hang of Casper my code basically wrote itself. Below is a sanitized and minimally-formatted version of the production code, also posted as a GitHub gist. Hope it helps you get started with these powerful tools.

var system = require('system');

if (system.args.length < 5) {
  console.info("You need to pass in account name, username, password, and path to casperJS as arguments to this code.");
  phantom.exit();
}

var account = system.args[1];
var username = system.args[2];
var password = system.args[3];
var base_uri = "https://example.com/" + account;
phantom.casperPath = system.args[4];

phantom.injectJs(phantom.casperPath + '/bin/bootstrap.js');

var utils = require('utils');

var casper = require('casper').create({
  verbose: true,
  logLevel: 'debug'
});

casper.on('error', function(msg,backtrace) {
  this.echo("=========================");
  this.echo("ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.on("page.error", function(msg, backtrace) {
  this.echo("=========================");
  this.echo("PAGE.ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.start(base_uri + "/login", function () {
  this.fill("form[name='login_form']", { username: username, password: password },true);
});

// can't click the reports button as that causes a weird file:// link problem
casper.thenOpen(base_uri + "reports");

casper.then(function() {
  var url = this.evaluate(function() {
    return __utils__.getElementByXPath("//a[contains(@href,'Account') and contains(@href,'Report')]").href;
  });

  // winOpenTransform is a function provided by the page; it's brittle for us to invoke it 
  // this way instead of clicking a button, but when you click the button, it pops up a new
  // window, and PhantomJS doesn't currently support popup windows. Thus I have to call
  // this function directly to avoid a popup.
  url = winOpenTransform(url.match(/http:.+?(?=')/)[0]);
  this.open(url);
});

casper.then(function() {
  casper.page.injectJs('jquery.min.js'); // so we can pick an option with the select item below
});

casper.thenEvaluate(function() {
  document.report.runtimeCondition.value = "ExampleField IS NOT NULL";
  document.report.condition.value = "'','','examplename IS NOT NULL','21'";
  document.report.target = "report";
  document.report.submit();
});

casper.thenOpen("https://example.com/" + account + "/Report/reports");

casper.then(function() {
  var url = this.evaluate(function() {
    return __utils__.getElementByXPath("//a[./text()='CSV']")["href"];
  });

  this.echo("GETTING " + url);
  this.download(url,"data.csv","GET");
});

casper.run();

Saturday, August 22, 2009

Vote up JavaScript talks for SXSW 2010

Attending JSConf 2009 was a real eye-opener for me: there are a lot of people using JavaScript in really interesting ways to do phenomenal things inside and outside of the browser. Projects like Phonegap, Titanium, CouchDB, SproutCore, and Cappuccino have the potential to transform many aspects of computing or at least point the way to some new paradigms.

To keep the conversation going, this led me to propose one talk and one panel for SXSW 2010:

Building Rich Web Apps with SproutCore
The JavaScript Applications Renaissance

If those sound intriguing, please consider voting at the above links, or better yet leave a comment expressing your support.

There are several other JavaScript related talks that I hope you will check out as well:

http://panelpicker.sxsw.com/ideas/index/4/q:javascript

Monday, April 27, 2009

On Saturday I gave an "Introduction to SproutCore" talk at JSConf, the world's first all-JavaScript conference.  It was a great conference and highly recommended.  Videos will be posted over the next several weeks.  For now, for those interested, my slides are below. 

Tuesday, February 12, 2008

Unobtrusive Firefox Plugin Click-to-Install

I've been working on a really cool new project, to be announced soon, where I've built a Rails-based web app that has two interfaces: one for humans to interact with inside of a browser, and a RESTful API for browser plugins to interact with via GETs and POSTs.  We want people to be able to interact with our site while visiting other sites.


I had seen other Firefox plugins that were click-to-install, but I had a hard time figuring out how to make it work for our plugin. Firefox users were having to "Save Link As.." and open the downloaded .xpi file manually.  Very old-fashioned.  So here's a quick note to help future Mozilla or Firefox developers who need to create a click-to-install plugin:


1) It's all done through Javascript, so anyone without Javascript will have to install your plugin the old-fashioned way.  The Mozilla site documents the API call you need to make.


2) I'm a huge proponent of unobtrusive javascript (UJS), which I learned by using Dan Webb's excellent LowPro framework. Thus I wanted to make sure that the click-to-install javascript was offered as a progressive enhancement to the normal HTML links we provided.  That way, everyone could have a link to the plugin file itself for manual installation, but people with Javascript could enjoy click-to-install.


In this part of the site, we weren't using any other Javascript libraries, so it seemed like overkill to include Prototype and LowPro just for this one effect.  So it was a great chance to learn how to roll my own UJS without library support.  I whipped up a quick UJS click-to-install technique following inspiration from this presentation.  Here's what I came up with:

<script defer="defer" type="text/javascript">
//<![CDATA[
function doXPITrigger() {
if (!document.getElementsByTagName) return false;
var links = document.getElementsByTagName("a");
for (var i=0; i < links.length; i++) {
if (links[i].className.match("firefox")) {
links[i].onclick = function() {
xpi={'Awesome New Project Toolbar':'/downloads/awesome_project.xpi'};
InstallTrigger.install(xpi);
return false;
}
}
}
}

window.onload = doXPITrigger
//]]>

3) I've seen other advice recommending you configure your web server to recognize the .xpi mimetype appropriately.  I did this but it didn't make much difference in my case.  Still, it's probably worth doing.  I added this line to our Apache config:


AddType application/x-xpinstall .xpi