Last night I gave a short talk on using PhantomJS and CasperJS for webscraping at the Baltimore JavaScript User's Group. The talk is embedded below, and the exercises are posted to my Github account.
Thursday, August 30, 2012
Wednesday, August 29, 2012
Use a long passphrase for your wi-fi network
I was really chilled by this article on how easy it is to crack even seemingly-unguessable wi-fi passwords. I haven't kept up with the state of the art, and didn't realize how convenient the password cracking tools have become, nor did I know about the "deauth" trick the author describes.
The bottom line is, your basic Internet security hygiene now needs to include using uncrackable wi-fi passwords, particularly if you live or work in a wi-fi dense area that presents a "target-rich" environment to an attacker (such as the one where the author's office is located).
But what are you supposed to do, tell every house or business guest who wants to get on your wi-fi that the password is "Heyw00Dj@buZZoffe!!!69ass353q1"? That's lame. Instead I recommend choosing a random, distinctive passphrase of easy-to-spell unrelated words, separated by spaces.
For a business, I would also recommend changing the password on a regular basis. Many routers also have a "guest network" function that provides an isolated network to guests. That way you don't have to give guests access to your internal network in order for them to hop on the Internet.
The bottom line is, your basic Internet security hygiene now needs to include using uncrackable wi-fi passwords, particularly if you live or work in a wi-fi dense area that presents a "target-rich" environment to an attacker (such as the one where the author's office is located).
But what are you supposed to do, tell every house or business guest who wants to get on your wi-fi that the password is "Heyw00Dj@buZZoffe!!!69ass353q1"? That's lame. Instead I recommend choosing a random, distinctive passphrase of easy-to-spell unrelated words, separated by spaces.
For a business, I would also recommend changing the password on a regular basis. Many routers also have a "guest network" function that provides an isolated network to guests. That way you don't have to give guests access to your internal network in order for them to hop on the Internet.
Friday, August 24, 2012
My Fall 2012 technical speaking schedule
I'm excited about the chance to present a bunch of tech talks this fall about things I've been hacking on this year. Hope to see you there!
- 8/29/12: Remote Control Browsing at Baltimore JavaScript Users Group
- 9/11/12 Ruby Dependencies, Notifications, and Adjustments at Bmoreonrails
- 9/15/12: I'll be an instructor at RailsGirlsDC
- 9/22: Intro to Arduino at the Betascape physical computing workshop
- 10/11/12: Ruby Dependencies, Notifications, and Adjustments at DCRUG
Wednesday, August 22, 2012
Security tip: don't send passwords over email or IM
I always insist that people not email passwords or other sensitive data to me, or use instant messaging. Chances are good that your email or my email will someday get hacked, and one of the first things an attacker will do is search your mailboxes for words like "password", "account ID", and "credit card number". It's not a wise risk to take.
Here's what I do instead:
KEYVAULT
I usually ask people to use Keyvault which lets you send encrypted, short-lived, self-destructing messages accessible only by a unique key. The recipient receives a message form Keyvault with a link to the message referencing the unique key.
Of course this requires you to trust the people who make Keyvault to not be a bad actor. I'm not so worried about this because the service was recommended to me by a friend, Scott Paley.
You also have to trust Keyvault to handle their security properly, to really delete the messages after the self-destruct time limit expires, and so forth. So for the most sensitive data, I recommend sending the most sensitive data out-of-band, using one of the below methods.
This is a tradeoff of course; there's still a risk in trusting Keyvault, but it's a lot less risky than trusting your email provider.
I've thought about building an open-source version of Keyvault that comes with instructions for deploying it to your own personal Heroku installation, so that you don't have to trust anyone else.
SMS
This isn't as secure of an option as it once was, now that people can back up their smartphones to the cloud or to a computer, but if you don't identify what the password is in the text message I feel it's fairly secure. The way to do this is send an email that says "my account ID is ####. I'll text the password to you shortly." Then you just sent a text with your password, by itself.
SNAIL MAIL
For really sensitive stuff, like the passphrase to unlock your company's PGP keys, I write the sensitive data on a piece of paper, stuff it in an envelope, and mail it to the recipient, making sure they know to keep the envelope in a safe place.
Here's what I do instead:
KEYVAULT
I usually ask people to use Keyvault which lets you send encrypted, short-lived, self-destructing messages accessible only by a unique key. The recipient receives a message form Keyvault with a link to the message referencing the unique key.
Of course this requires you to trust the people who make Keyvault to not be a bad actor. I'm not so worried about this because the service was recommended to me by a friend, Scott Paley.
You also have to trust Keyvault to handle their security properly, to really delete the messages after the self-destruct time limit expires, and so forth. So for the most sensitive data, I recommend sending the most sensitive data out-of-band, using one of the below methods.
This is a tradeoff of course; there's still a risk in trusting Keyvault, but it's a lot less risky than trusting your email provider.
I've thought about building an open-source version of Keyvault that comes with instructions for deploying it to your own personal Heroku installation, so that you don't have to trust anyone else.
SMS
This isn't as secure of an option as it once was, now that people can back up their smartphones to the cloud or to a computer, but if you don't identify what the password is in the text message I feel it's fairly secure. The way to do this is send an email that says "my account ID is ####. I'll text the password to you shortly." Then you just sent a text with your password, by itself.
SNAIL MAIL
For really sensitive stuff, like the passphrase to unlock your company's PGP keys, I write the sensitive data on a piece of paper, stuff it in an envelope, and mail it to the recipient, making sure they know to keep the envelope in a safe place.
Monday, August 20, 2012
Some useful ImageMagick snippets
I try and stay out of Photoshop as much as possible. For most non-design image manipulation tasks I've found ImageMagick to be very useful, though it can be hard to sort through the documentation to find the transformation you need. Here are the ImageMagick commands I find most useful:
I also once needed to add numbers to a individual parts of a tiled image. I found it easier to script this with Ruby rather than use the shell's looping constructs:
# Flatten a transparent image with a white background: convert -flatten img1.png img1-white.png # Make an image transparent convert -transparent '#FFFFFF' nontransparent.gif transparent.png # convert an image into tiles convert -size 3200x3200 tile:single_tile.png final.png # making a montage from a collection of images montage -geometry +0+0 -background transparent *png montage.png # inverting colors convert before.png -negate after.png # generating a favicon convert large_image.png -resize 16x16! favicon.ico
I also once needed to add numbers to a individual parts of a tiled image. I found it easier to script this with Ruby rather than use the shell's looping constructs:
# adding numbers to a tiled image cmd = (0..324).to_a.inject([]) do |cmd,n| y=(n/25*32)+15; x=((n%25)*32)+15 cmd << "-draw 'fill red text #{x},#{y} \"#{n}\"'" end `convert img.png #{cmd.join(' ')} annotated_img.png`
Monday, August 13, 2012
How to create OmniAuth provider aliases
I'm working on a project that uses OmniAuth to authenticate with many third-party services, including a few different Google products. There's an omniauth-google-oauth2 strategy that makes this easy, but I wanted to make an alias for this strategy so I could treat access to a user's AdWords account separately from their Analytics or Profile data. Some users will only be connecting our service to one of these products, and I didn't want to request access to everything at once.
My solution was to make aliases for the google_oauth2 strategy. I didn't see a built-in way to do this, nor did web searches reveal other solutions, so here's what I came up with:
My solution was to make aliases for the google_oauth2 strategy. I didn't see a built-in way to do this, nor did web searches reveal other solutions, so here's what I came up with:
# lib/omniauth-adwords-oauth2.rb require "omniauth-google-oauth2" class AdwordsOauth2 < OmniAuth::Strategies::GoogleOauth2 option :name, 'adwords_oauth2' end # config/initializers/omniauth.rb require "omniauth-adwords-oauth2" Rails.application.config.middleware.use OmniAuth::Builder do provider :adwords_oauth2, ENV['GOOGLE_API_CLIENT_ID'], ENV['GOOGLE_API_CLIENT_SECRET'], { scope: "https://adwords.google.com/api/adwords/" # could also be adwords-sandbox.google.com } end OmniAuth.config.logger = Rails.logger
Thursday, August 9, 2012
Webscraping with CasperJS and PhantomJS
Update: check out this short CasperJS screencast with accompanying exercises
I needed to scrape some data from a website without an API for my latest project. The target website used JavaScript for all of its navigation, working hard against the grain of HTML and other standards. This was anti-hypermedia: there were absolutely no regular HTML anchor tags that I could use to navigate the interface (all of them were hooked to
I thought for awhile I was going to have to use Selenium for this task. Because the project runs on Heroku, I thought this was going to add some very heavyweight dependencies to the project, or force me to use dedicated hosting. I know you can run Selenium in a headless mode via Xvfb, but I'd still have to get a GUI-based browser compiled for Heroku - if that sounds like what you need to do check out the Vulcan build server.
PhantomJS to the Rescue
Happily, I found another way: PhantomJS, a truely headless Webkit browser that runs JavaScript code, and has no dependencies. I downloaded the binaries from the website and dumped 'em right into my app's
I loved how easy it was to get started with PhantomJS but I found it to be cumbersome for a multi-step webscraping project. You have to code via a series of callbacks that can get hard to manage: for example when you ask for a new page to be fetched, you have to register a callback to be notified when and if that page load succeeds. The best feature of PhantomJS, besides being able to execute JavaScript, is its screen capture mode - very helpful for debugging.
The Missing Piece: CasperJS
In my googling for a solution to this pain I found another awesome project built on top of PhantomJS: CasperJS. Casper gives you a nice DSL to control PhantomJS and hide the callback spaghetti code - it reminds me how EM-Synchrony wraps up EventMachine's callbacks so you can write asynchronous code in a synchronous style.
Once I got the hang of Casper my code basically wrote itself. Below is a sanitized and minimally-formatted version of the production code, also posted as a GitHub gist. Hope it helps you get started with these powerful tools.
I needed to scrape some data from a website without an API for my latest project. The target website used JavaScript for all of its navigation, working hard against the grain of HTML and other standards. This was anti-hypermedia: there were absolutely no regular HTML anchor tags that I could use to navigate the interface (all of them were hooked to
onclick
JavaScript events which computed the next URL to open dynamically) and there were barely any CSS selectors I could target.
To make this work, I needed my webscraper to execute real JavaScript code to simulate a real user navigating with a real browser.I thought for awhile I was going to have to use Selenium for this task. Because the project runs on Heroku, I thought this was going to add some very heavyweight dependencies to the project, or force me to use dedicated hosting. I know you can run Selenium in a headless mode via Xvfb, but I'd still have to get a GUI-based browser compiled for Heroku - if that sounds like what you need to do check out the Vulcan build server.
PhantomJS to the Rescue
Happily, I found another way: PhantomJS, a truely headless Webkit browser that runs JavaScript code, and has no dependencies. I downloaded the binaries from the website and dumped 'em right into my app's
vendor
directory.I loved how easy it was to get started with PhantomJS but I found it to be cumbersome for a multi-step webscraping project. You have to code via a series of callbacks that can get hard to manage: for example when you ask for a new page to be fetched, you have to register a callback to be notified when and if that page load succeeds. The best feature of PhantomJS, besides being able to execute JavaScript, is its screen capture mode - very helpful for debugging.
The Missing Piece: CasperJS
In my googling for a solution to this pain I found another awesome project built on top of PhantomJS: CasperJS. Casper gives you a nice DSL to control PhantomJS and hide the callback spaghetti code - it reminds me how EM-Synchrony wraps up EventMachine's callbacks so you can write asynchronous code in a synchronous style.
Once I got the hang of Casper my code basically wrote itself. Below is a sanitized and minimally-formatted version of the production code, also posted as a GitHub gist. Hope it helps you get started with these powerful tools.
var system = require('system'); if (system.args.length < 5) { console.info("You need to pass in account name, username, password, and path to casperJS as arguments to this code."); phantom.exit(); } var account = system.args[1]; var username = system.args[2]; var password = system.args[3]; var base_uri = "https://example.com/" + account; phantom.casperPath = system.args[4]; phantom.injectJs(phantom.casperPath + '/bin/bootstrap.js'); var utils = require('utils'); var casper = require('casper').create({ verbose: true, logLevel: 'debug' }); casper.on('error', function(msg,backtrace) { this.echo("========================="); this.echo("ERROR:"); this.echo(msg); this.echo(backtrace); this.echo("========================="); }); casper.on("page.error", function(msg, backtrace) { this.echo("========================="); this.echo("PAGE.ERROR:"); this.echo(msg); this.echo(backtrace); this.echo("========================="); }); casper.start(base_uri + "/login", function () { this.fill("form[name='login_form']", { username: username, password: password },true); }); // can't click the reports button as that causes a weird file:// link problem casper.thenOpen(base_uri + "reports"); casper.then(function() { var url = this.evaluate(function() { return __utils__.getElementByXPath("//a[contains(@href,'Account') and contains(@href,'Report')]").href; }); // winOpenTransform is a function provided by the page; it's brittle for us to invoke it // this way instead of clicking a button, but when you click the button, it pops up a new // window, and PhantomJS doesn't currently support popup windows. Thus I have to call // this function directly to avoid a popup. url = winOpenTransform(url.match(/http:.+?(?=')/)[0]); this.open(url); }); casper.then(function() { casper.page.injectJs('jquery.min.js'); // so we can pick an option with the select item below }); casper.thenEvaluate(function() { document.report.runtimeCondition.value = "ExampleField IS NOT NULL"; document.report.condition.value = "'','','examplename IS NOT NULL','21'"; document.report.target = "report"; document.report.submit(); }); casper.thenOpen("https://example.com/" + account + "/Report/reports"); casper.then(function() { var url = this.evaluate(function() { return __utils__.getElementByXPath("//a[./text()='CSV']")["href"]; }); this.echo("GETTING " + url); this.download(url,"data.csv","GET"); }); casper.run();
Wednesday, August 8, 2012
Giving Away $4500 In Ignition Grants at Ignite Baltimore #11
Just a quick note to alert everyone to the fact that we're giving away $4500 in Ignition Grants at the next Ignite Baltimore. Kate Bladow, Scott Burkholder, and Andrea Snyder have taken over leadership of this project, along with our new fiscal sponsor gb.tc.
If you know someone that's got an idea for a new product, service, or project that would make Baltimore a better place to live and work, please encourage them to apply. More details here.
Also, I recently appeared on gb.tc's weekly podcast discussing the call for proposals to speak at Ignite. In case you're wondering what kinds of talks we're looking for, or if you just want to know what's up with Baltimore tech in general, check out the video of the show:
If you know someone that's got an idea for a new product, service, or project that would make Baltimore a better place to live and work, please encourage them to apply. More details here.
Also, I recently appeared on gb.tc's weekly podcast discussing the call for proposals to speak at Ignite. In case you're wondering what kinds of talks we're looking for, or if you just want to know what's up with Baltimore tech in general, check out the video of the show:
Monday, August 6, 2012
Sidekiq, Ruby's non-atomic require, and multithreading
I just moved the workers for my newest project from Resque to Sidekiq, and it's working beautifully. I'm saving memory and enjoying a performance boost. This is part of my overall goal to use multi-threading in Ruby as much as possible (I'm also moving the web component of this project from Unicorn to Puma). I've never liked EventMachine and I've been great influenced to favor threads over processes and over EM by David Bryant Copeland, who gave this excellent talk about multithreaded Ruby at RubyNation 2012.
But today I stumbled onto an interesting, tricky bug, which exemplifies one of the downsides of multithreaded programming.
Background
I had created a new worker that used the mechanize gem for webscraping. The worker was complicated and used several different classes to get the work done. I had to
The Bug
This worker would just get stuck with zero information in the log files - the whole thread would just deadlock. Sidekiq has a TTIN signal handler that helps you figure out where your code is stuck, but unfortunately the workers run on Heroku, and Heroku does not let you send arbitrary signals to your processes, so I couldn't use it. Instead I had to insert a bunch of logging probes in my code to see exactly what line of code was causing things to freeze.
It turns out my code was freezing on a
The Solution
Once I moved the
Lesson Learned
Quoting this Stack Overflow answer, because of the potential for
But today I stumbled onto an interesting, tricky bug, which exemplifies one of the downsides of multithreaded programming.
Background
I had created a new worker that used the mechanize gem for webscraping. The worker was complicated and used several different classes to get the work done. I had to
require "mechanize"
in a few different files, mainly so I could reference Mechanize::Error in a couple of exception handlers. This was super well-tested code that worked great on my dev machine, but things went to hell in production.The Bug
This worker would just get stuck with zero information in the log files - the whole thread would just deadlock. Sidekiq has a TTIN signal handler that helps you figure out where your code is stuck, but unfortunately the workers run on Heroku, and Heroku does not let you send arbitrary signals to your processes, so I couldn't use it. Instead I had to insert a bunch of logging probes in my code to see exactly what line of code was causing things to freeze.
It turns out my code was freezing on a
require
statement, where I required the first class which required the mechanize gem. I remembered that in Ruby require is not atomic, so I was able to zero in on the problem.The Solution
Once I moved the
require "mechanize"
statement into an initialization step, before my workers were loaded, everything performed beautifully.Lesson Learned
Quoting this Stack Overflow answer, because of the potential for
require
to cause deadlocks like this:"require everything you need before starting a thread if there's any potential for deadlock in your app."
Subscribe to:
Posts (Atom)