Mike Subelsky: ruby

Showing posts with label ruby. Show all posts

Monday, May 12, 2014

Fixing "SocketError: getaddrinfo: Name or service not known" with Ruby's resolv-replace.rb

Update 6/3/2014: there are other advantages to using a pure-Ruby DNS implementation.

Recently we had a problem with DNS lookups for a particular API that only afflicted our Heroku-based workers (host name changed to protect the innocent):

irb(main):009:0> Socket.gethostbyname("example.net")
SocketError: getaddrinfo: Name or service not known

That line of code works fine in other places I tried, like my dev machine. The specific error had to do with a call that Ruby's net/http library was making. I filed a ticket with Heroku support using the above example, and they suggested I try using Ruby's Resolv library, something I had not encountered before. This post is a little breadcrumb to help others who may have the same problem.

Resolv uses Ruby code to do DNS lookups, instead of relying on the system's libc installation like gethostbyname does. Rather than having to monkeypatch net/http, all I had do to was add require "resolv-replace.rb" to our startup code which automatically added this fix. I never found out why the normal lookup process wasn't working on Heroku, but this worked. Thanks to the thoughtful soul who wrote resolv-replace.rb!

Friday, May 9, 2014

Looking for a Rubyist

Staq is growing fast. There is so much demand for our products that we need to add another developer to our team to keep up. If you are a Ruby programmer who has experience building large web apps, who is excited about working in a culture of delegated responsibility, who can help coach the rest of the team and help with system design, please get in touch!

We offer all of the usual startup perquisites: salary, plenty of equity, autonomy, and the opportunity to grow your career along with the company.

PS We also write a lot of JavaScript and even a smattering of Python.

Thursday, February 20, 2014

Slides for Writing Clean, Concise, and Confident Ruby Code

Today I will be giving a talk at NET/WORK called Writing Clean, Concise, and Confident Ruby Code. Since it's very code-focused, I once again used Stefan Otte's presenting.vim. The slides (which include links to all of my references) are all up on Github.

I'm indebted to Avdi Grimm for providing much of the inspiration for this approach to Ruby, and to my Staq colleagues for letting me try out my armchair programming philosophy on them!

Monday, February 3, 2014

Staq's Ruby Apprentice Program Was a Great Success!

Back in September 2013 we announced Staq's apprentice program, where we offered paid work to novice Ruby programmers. Today we're announcing that we just completed the program by hiring four of the apprentices who went through the program as full-time software developers at Staq!

Me, my partner James (2nd from left) and the new developers

Over the fall and winter we fielded many questions about how the training went, whether we will do it again, and so on, so I thought the Internet might like to find out how everything worked out.

We received over 100 applications from around the world, many of which were from very well-qualified candidates. I was very surprised at the number of people who had a strong computer science and engineering backgrounds, who saw this a chance to jumpstart their careers or re-engage with programming after a period of time doing something else.

We selected 12 candidates to attend in-person training sessions at our office in Hampden. We conducted four sessions of two hours each where we dived right in and showed the students the guts of how Staq's data extraction technology works, along with the basics of Ruby and rspec. In retrospect, our syllabus was far too ambitious; the next time we do this we will just focus on using Ruby to write web-and-API scrapers using mechanize and typhoeus, along with a gentle introduction to CasperJS. I should have introduced Staq's proprietary framework later in the process.

This experience also opened my eyes to the importance of good documentation: we have written a lot of specialty code that would be a lot easier to understand with good in-line documentation. So now everything we write includes extensive YARD annotations.

I regret not having more resources and time to devote to the training, since we're still a small company at this point. The next time we do this, I want to involve a professional Ruby trainer who could give more structure and rigor to the program (are you reading this, Jeff Casimir?). It would also be cool to coordinate our efforts with Betamore Academy. I wish I had been able to prepare a more thoughtful syllabus, or present in a less harried, rapid-fire manner, but such are the exigencies of startup life.

We chose five students to become Ruby apprentices, based on their performance during the training classes. The apprentices started out as hourly, part-time employees, who scheduled their work hours around other commitments. We started weekly training sessions for them (which every programmer in the company began attending), but most of the training was hands-on. The apprentices made many heroic contributions to our company, cleaning up messes, digging into details, writing critical revenue-generating code, and exhibiting great professionalism in a demanding, low-supervision work environment.

The apprentices graduated in January and we made full-time job offers to all five of them. One graduate decided to move to New York, and got a great job offer with a well-known, successful software company. The other four accepted.

We absolutely will do this again, but we've got to grow the business some more first. Watch this space for details!

Thursday, November 1, 2012

Ruby Dependencies, Notifications, and Adjustments

Note: I originally submitted this article as a draft to the Practicing Ruby online journal. Gregory liked my concept but ultimately ask if I would mind if he wrote his own version. He did an awesome job and if you're interested in this subject I recommend you definitely read his article.

Introduction

Object-oriented programming helps organize your code in a resilient, easy-to-change way. This article aims to explore one of the concepts that trips up beginner and more experienced object-oriented programmers: how to sensibly connect a set of objects together to perform a complex task. How do you put instances of your information-hiding, single-responsibility-discharging, message-passing classes in touch with one another?

I became confused about the smartest ways to do this when I started building Ruby apps that involved fetch large amounts of data from external services. In these projects, a Rails or Sinatra web application acted as a facade for workers querying a large set of APIs. Each API was different from the last, requiring different approaches and different dependencies. Some APIs involved five or six different steps, and in some cases each step needed to be handled by a different object.

I felt I understand object-oriented programming pretty well, yet I struggled with specifying the relationships between objects so that each object knew just enough about its peers to get the job done. My style was inconsistent. Sometimes I would inject a dependency using the constructor, and sometimes I would use a setter method. At other times it seemed more natural to have an object directly instantiate new instances of whatever objects it needed, on the fly.

All of the code examples referenced below can be found in this gist.

Object Peer Stereotypes

All of this changed when someone turned me onto the book Growing Object Oriented Software, Guided by Tests by Steve Freeman and Nat Pryce. The book has a chapter on object-oriented design styles, and includes a description of “Object Peer Stereotypes” that addressed my conundrum perfectly.

The authors divide an object’s peers into three categories: Dependencies, Notifications, and Adjustments (DNA). These are rough categories, because an object peer could fit into more than one category, but I found it to be a useful distinction. We’ll explore each of these categories as they pertain to Ruby code using an example from my real production code: a wrapper for Typhoeus I wrote called HttpRequest.

By the way, Gregory wrote about a related topic (what types of arguments to pass into a method) back in Issue 2.14. As your objects become more sophisticated you’ll find you end up passing fewer basic object types like strings, symbols, or numbers, and more of the Argument, Selector, or Curried objects that Gregory describes.

Dependencies

“Services that the object requires from its peers so it can perform its responsibilities. The object cannot function without these services. It should not be possible to create the object without them.”

“...we insist on dependencies being passed in to the constructor, but notifications and adjustment can be set to defaults and reconfigured later.”

I wrote the HttpRequest class so that I could set on_success and on_failure callbacks (where Typhoeus only provides an on_complete callback) and to encapsulate my dependency on the Typhoeus gem, in case I want to switch to another HTTP library later.

HttpRequest objects have two Dependencies: the URL of the request and a set of options for telling Typhoeus how to make the request. Here's a link to an example of HttpRequest code.

Note that I’m supplying a default for the options argument, since it’s just a hash that gets passed onto Typhoeus::Request, and it’s something you’ll have available at the same type you have the URL. Because there is a sensible default (an empty hash), you could argue that this argument is more of an Adjustment, described below.

If options was a more complex object, something that might have peers of its own, I would probably treat it as an Adjustment. I find test-driven development really helpful in cases like this because often the tests can help you feel out which approach is more appropriate (which is the whole premise of the Growing Object Oriented Software book).

Notifications

“Peers that need to be kept up to date with the object’s activity. The object will notify interested peers whenever it changes state or performs a significant action. Notifications are ‘fire and forget’; the object neither knows nor cares which peers are listening.”

In the HttpRequest example, success_callbacks and failure_callbacks are the notifications. Another object can register for success notifications like this.

Logging is another canonical notification example. Here’s a pattern I use a lot for logging.

Notifications can also be sent as arguments to a method call. I often pass a block for error handling. I find this usually involves fewer lines of code than returning a status object that must be tested for success or failure. Here's an example.

Adjustments

“Peers that adjust the object’s behavior to the wider needs of the system. This includes policy objects that make decisions on the object’s behalf...and component parts of the object if it’s a composite.”

Most of my Adjustments involve component parts of a composite object. For the API-intense project where I’m using HttpRequest, I always have one class that has overall responsibility for getting all of the data we need for each API. That “master” class just does one thing: it coordinates the activities of a set of Adjustment peers, are of which are set to sensible defaults.
This also enables simple unit testing because you can so easily set the adjustments to mock objects provided by the tests.

If you use the strategy pattern, where peer objects make decisions for your object, your Adjustment might look like this.

It could be that AdminChecker is more of a Dependency than an Adjustment, depending on how many different kinds of AdminCheckers there are and how central admin-checking is to your code. If there’s no normal default for the admin_checker value, and if you really can’t make a DataFetcher without knowing what kind of checking policies it’ll be working with, you should probably inject your admin_checker in the constructor, marking it as an important Dependency.

HttpRequestService

There’s one other facet to my HttpRequest object that I thought Practicing Ruby readers might find interesting. Because Typhoeus is concurrent, you have to queue up requests onto a shared Typhoeus::Hydra object. The requests don’t run until you invoke the hydra’s #run method. I experimented with storing the Hydra object in various places and ended up creating a factory for HttpRequest objects called HttpRequestService, below. Can you spot the dependencies and adjustments? It doesn’t have notifications, but I could see adding some instrumentation to measure HttpRequest times.

Instances of the HttpRequestService end up as Adjustment peers for the objects responsible for fetching data.

Dependency Injection Containers

Rubyists generally eschew dependency injection containers but they complement the DNA style quite well. I use dependency injection containers as the single place where my code can pull in dependencies from different sources. These dependencies sometimes involve extra setup steps or massaging, depending on whether the code is running in production mode or not, and the container is a convenient place to consolidate that kind of housekeeping code. It often provides the sensible default for notifications and adjustments, and it’s an important part of the boot process for most of my Ruby code.

I’ve created a simple gem for this purpose called dim based on Jim Weirich’s article. If you’re interested in the topic, I highly recommend that article. Here’s a snippet of one of my container definitions.

Conslusion

Don’t hold too rigidly to these classifications; they’re more like heuristics. As Steve Freeman and Nat Pryce wrote:

“What matters most is the context in which the collaborating objects are used. For example, in one application an auditing log could be a dependency, because auditing is a legal requirement for the business and no object should be created without an audit trail. Elsewhere, it could be a notification, because auditing is a user choice and objects will function perfectly well without it.”

When considering how to organize object peers I recommend you favor what’s most understandable and flexible, even if it means deviating from the DNA pattern.

Monday, October 15, 2012

Ruby Dependencies, Notifications, and Adjustments

Here are the slides for the Ruby Dependencies, Notifications, and Adjustments talk I've recently given at DCRUG and Bmore On Rails:

https://github.com/subelsky/tkn/blob/master/examples/ruby_dna.rb

The talk discusses the object peer stereotype concept introduced in the book Growing Object Oriented Software, Guided by Tests. I also mention Jim Weirich's post about Ruby Dependency Injection.

I used the experimental terminal keynote software to make the slides. It was pretty fun writing my slides in Ruby, and I'd definitely recommend using it for code-heavy talks like this one. But if you need diagrams or something more visually-stimulating I'd recommend sticking with GUI-based software.

Wednesday, September 12, 2012

Some great software design and architecture books

A friend and fellow self-taught programmer asked me to recommend some good books he and his team could read to buttress their Ruby design and architecture knowledge. I definitely recognized my friend's situation: after you spend years consuming short-form content like blog posts, tech talks, and screencasts, it becomes hard to organize your store of knowledge. You need labels for concepts, you need semantic layers of knowing. You need to think more deeply and comprehensively, with longer examples. You need books!

So here are some of my favorite software design/architecture books, not all of which are Ruby specific:

Design Patterns in Ruby
Ruby Best Practices
Objects on Rails
The RSpec Book
Code Complete

Clean Code

I haven't read these yet but they're on my list:

Clean Ruby
Practical Object-Oriented Design in Ruby

Also, Building Hypermedia APIs with HTML5 and Node is very relevant to web developers in any language. AdStaq is going to have a hypermedia API; in fact we're making it a central part of our sales pitch.

Monday, August 20, 2012

Some useful ImageMagick snippets

I try and stay out of Photoshop as much as possible. For most non-design image manipulation tasks I've found ImageMagick to be very useful, though it can be hard to sort through the documentation to find the transformation you need. Here are the ImageMagick commands I find most useful:

# Flatten a transparent image with a white background:
convert -flatten img1.png img1-white.png

# Make an image transparent
convert -transparent '#FFFFFF' nontransparent.gif transparent.png

# convert an image into tiles
convert -size 3200x3200 tile:single_tile.png final.png

# making a montage from a collection of images
montage -geometry +0+0 -background transparent *png montage.png

# inverting colors
convert before.png -negate after.png

# generating a favicon
convert large_image.png -resize 16x16! favicon.ico

I also once needed to add numbers to a individual parts of a tiled image. I found it easier to script this with Ruby rather than use the shell's looping constructs:

# adding numbers to a tiled image
cmd = (0..324).to_a.inject([]) do |cmd,n| 
  y=(n/25*32)+15; x=((n%25)*32)+15
  cmd << "-draw 'fill red text #{x},#{y} \"#{n}\"'"
end

`convert img.png #{cmd.join(' ')} annotated_img.png`

Monday, August 13, 2012

How to create OmniAuth provider aliases

I'm working on a project that uses OmniAuth to authenticate with many third-party services, including a few different Google products. There's an omniauth-google-oauth2 strategy that makes this easy, but I wanted to make an alias for this strategy so I could treat access to a user's AdWords account separately from their Analytics or Profile data. Some users will only be connecting our service to one of these products, and I didn't want to request access to everything at once.

My solution was to make aliases for the google_oauth2 strategy. I didn't see a built-in way to do this, nor did web searches reveal other solutions, so here's what I came up with:

# lib/omniauth-adwords-oauth2.rb
require "omniauth-google-oauth2"

class AdwordsOauth2 < OmniAuth::Strategies::GoogleOauth2
  option :name, 'adwords_oauth2'
end

# config/initializers/omniauth.rb
require "omniauth-adwords-oauth2"

Rails.application.config.middleware.use OmniAuth::Builder do
  provider :adwords_oauth2, ENV['GOOGLE_API_CLIENT_ID'], ENV['GOOGLE_API_CLIENT_SECRET'], {
    scope: "https://adwords.google.com/api/adwords/" # could also be adwords-sandbox.google.com
  }
end

OmniAuth.config.logger = Rails.logger

Monday, August 6, 2012

Sidekiq, Ruby's non-atomic require, and multithreading

I just moved the workers for my newest project from Resque to Sidekiq, and it's working beautifully. I'm saving memory and enjoying a performance boost. This is part of my overall goal to use multi-threading in Ruby as much as possible (I'm also moving the web component of this project from Unicorn to Puma). I've never liked EventMachine and I've been great influenced to favor threads over processes and over EM by David Bryant Copeland, who gave this excellent talk about multithreaded Ruby at RubyNation 2012.

But today I stumbled onto an interesting, tricky bug, which exemplifies one of the downsides of multithreaded programming.

Background

I had created a new worker that used the mechanize gem for webscraping. The worker was complicated and used several different classes to get the work done. I had to require "mechanize" in a few different files, mainly so I could reference Mechanize::Error in a couple of exception handlers. This was super well-tested code that worked great on my dev machine, but things went to hell in production.

The Bug

This worker would just get stuck with zero information in the log files - the whole thread would just deadlock. Sidekiq has a TTIN signal handler that helps you figure out where your code is stuck, but unfortunately the workers run on Heroku, and Heroku does not let you send arbitrary signals to your processes, so I couldn't use it. Instead I had to insert a bunch of logging probes in my code to see exactly what line of code was causing things to freeze.

It turns out my code was freezing on a require statement, where I required the first class which required the mechanize gem. I remembered that in Ruby require is not atomic, so I was able to zero in on the problem.

The Solution

Once I moved the require "mechanize" statement into an initialization step, before my workers were loaded, everything performed beautifully.

Lesson Learned

Quoting this Stack Overflow answer, because of the potential for require to cause deadlocks like this:

"require everything you need before starting a thread if there's any potential for deadlock in your app."

Tuesday, July 17, 2012

Video of Coding for Uncertainty at Ruby Nation

Here's the video from my Coding for Uncertainty talk at Ruby Nation earlier this year, where I talk about techniques I use to build durable, changeable Ruby code:

Thursday, June 21, 2012

dependency injection minimal (dim) gem v1.2 released

I just released v1.2 of the dependency injection minimal gem (dim). Originally this project was just a gemification of Jim Weirich's example code, but v1.2 marks my first meaningful feature addition.

dim aims to provide a simple, Ruby-esque way to handle dependency injection. I find myself using it in all of my projects as a way to consolidate into one file all of the configuration that my apps need.

I noticed a common pattern, though. I had started to use dim to encapsulate ENV variables, so that my code would not need to know the source of a configuration variable (usually an API key or a URI for accessing a third-party service); in the test environment, the source might be a hard-coded literal string, but in production it might come from an actual ENV variable.

So I added a register_envmethod to complement the register method that Jim originally added. Below is an example of how I'm using it.

# attempts to read ENV["API_PASSWORD"], otherwise makes sure that the parent container has
# a service named api_password registered
ServerContainer.register_env(:api_password)

The above code will fail if you don't have ENV["API_PASSWORD"] defined, or if ServerContainer doesn't have a parent container with :api_password set. Typically I'm using a YAML file to populate ServerContainer's parent with sensitive values that I want to have in my development environment (and then I make sure to ignore that file in source control).

See the docs or the source code for more details.

Thursday, June 3, 2010

RailsConf Baltimore Things To See and Do

I love Baltimore and I love Ruby, so like my fellow Bmoreonrails colleagues, I am super-psyched for RailsConf 2010 being held at the Baltimore Convention Center!

If you're new to our city, here's a short list of things to see and do, optimized for people staying near the convention center (e.g. this is not necessarily the list I would give you if you were staying with me, had ready access to a car, etc.)

SOUTH

Your best bet in this direction is the Federal Hill neighborhood which has a lot of activity, bars, and nice restaurants. During the day you can climb to the top of Federal Hill itself, and then visit the incredibly awesome American Visionary Art Museum: this is the ideal art museum for hackers, because it celebrates self-taught artists.

On Wednesday or Thursday evening I recommend checking out Illusions Magic Bar & Lounge. If you're around on Friday or Saturday night, there's an awesome magic show featuring an upside-down straitjacket escape. I guarantee there's nothing like this place back where you come from!

(not actually guaranteed)

WEST

Right next to the convention center you'll find Geppi's Entertainment Museum: all about pop culture, comic books, toys, etc.

Paul Barry has organized a RailsConf group attendance package for the Yankees vs. Orioles game on Wednesday night. Camden Yards is a very nice ballpark, so get out there and enjoy it before we get conned into building another one a few years from now! You'll get free beer and hotdogs!

NORTH

On Monday, Bmoreonrails is having its monthly Pub Night at Pratt Street Ale House, right across from the convention center, a very fun place to hang out.

Walking farther north, Maryland's signature dish is the crab cake, and one of the best versions is made at the Faidley's stand in Lexington Market which is worth visiting to soak up all the vibrant activity of a city market.

If you have a car or can spring for a cab ride, definitely visit Mt. Vernon: a beautiful neighborhood with brownstone homes and great restaurants. The Brewer's Art bar was named "Best Bar in America" by Esquire, and they brew an excellent ale called Resurrection.

If you have a car, you may also want to check out the Hampden neighborhood which tends to get a lot of attention by people writing articles like this one - John Waters recently described it as a mix of "hipster culture and redneck culture".

EAST

The Inner Harbor is our ubertouristy area, but it's very nice if you've never been there. Besides our great National Aquarium, you can each catch a water taxi from there to the Fells Point neighborhood farther east, which has a ton of bars and restaurants and cool shops and coffee shops.

If you have access to a car or taxi, after stopping in Fells Point you might want to visit Canton, one of the city's technology hubs. The Beehive coworking facility is well-worth a visit if you have time to kill before or after the conference.

RUNNING

There is a red brick path going all the way around the harbor that makes for a great running route. If you're a runner also check out the Bmoreonrails recommended running route.

THE WIRE

Did you hear about this gritty, realistic, little-known but super amazing cop show on HBO a few years back? It was so much more than a cop show. It was called The Wire and it was a tremendous work of art, but also very entertaining. If you have heard of it you may want to visit some of the shooting locations which I have catalogued previously in "The Wire tour".

MORE IDEAS

Check out the recent New York Times sightseeing guide and our alt-weekly's guide The Baltimanual.

Monday, May 31, 2010

Real World Ruby and Cassandra

Introduction

At OtherInbox I recently built a QA system using the Cassandra datastore. I really like this technology and so far I would recommend it, but the learning curve for Rubyists is still pretty high. There are some good examples online (especially the canonical article by Evan Weaver) but nothing showing more intermediate, real-world usage. Hence, this article.

The system requires us to log millions of events per day. I could have built it using a traditional relational database like MySQL (which we use for the main application), but these factors led me to consider a NoSQL database:

We're only interested in large patterns in data, so we don't need 100% ACID assurance that every single write will succeed. The system would be useful to us even if it only caught 80% of the events.
Since we perform these actions millions of times per day, write speed is the prime consideration.
The QA reports are generated offline, once per day. We don't mind if reads happen more slowly, or if we need to do some extra programming to build reports because we can't use SQL.
The shear volume of events made me less excited about punishing a MySQL table. We already do a lot of extra work to keep MySQL healthy performing OtherInbox's main functions via sharding.
I was curious to see how a schema-less datastore would change the way I solved programming problems.

I will assume you have read Evan's article as well as the very useful 'WTF is a supercolumn?'. You may also want to read through the Twissandra Python code as well as the tests for Evan's cassandra gem.

I've been playing with the technology only for a few months so I'm sure I'll need to correct some parts of this article as I learn more - please comment if something is unclear or incorrect.

It's Sorta Like an Ordered Multi-Dimensional Hash

Rubyists can think of Cassandra like a hash of ordered hashes, or a hash of ordered hashes of ordered hashes, requiring up-front planning to use. You don't have to specify your schema, but you do need to tell Cassandra how your keys and columns will be organized. That affects how the data is stored on disk and how you'll read the data later.

Since the columns are stored in sorted order, Cassandra can answer queries very quickly (which is why it's in use at sites like Facebook and Digg). I had to change the way I built keys and column names several times before I got it right. Anytime you change how the data is stored on disk you need to restart Cassandra.

Columns and ColumnFamilies

ColumFamilies store a set of columns (which you can think of as key-value pairs) partitioned by a row key. The column names can be arbitrary strings, long integers, or UUIDs; at start time you have to tell Cassandra how to sort the column names but beyond that you have complete freedom to create column names that will be useful to you.

If each row has the same data, you might think of it like this:

{ user_id => {'email'=>'sarah@example.com', 'last_name'=>'Jones' }}

Where user_id is the row key (which Cassandra hashes and uses to determine which nodes should store the columns for this piece of data), and 'email', and 'last_name' are column names. Using the gem your code would look like:

@cassandra.insert(:Users,user_id,{ 'email'=>'sarah@example.com', 'last_name' => 'Jones'})

@cassandra.get(:Users,user_id)

But you can also store useful data in the column names. This is useful when there are many columns and you want to be able to select a particular range of columns. For the QA system, we page through a large range of columns within each key, and assigning smart column names helps this go faster. The data might look like this:

{ user_id => { UUID => 'Hey Sarah here's a question for you..', UUID2 => 'When are we going to meet up?' }}

@cassandra.insert(:Users,user_id, { UUID.new => 'Hey Sarah here's a question for you..'})

@cassandra.insert(:Users,user_id, { UUID.new => 'When are we going to meet up?' })

In this case we are storing messages for a particular user, and we're using unique identifiers for columns that we can query later in ranges. If your data has a temporal component you might use time-based UUIDs (where the most significant bits are a timestamp and the less significant bits are entropy) so that you query only columns that fall within a particular range of times.

You do need to tell Cassandra how your column names should be sorted on disk, which happens in the configuration file for each ColumnFamily:

<keyspaces>
  <keyspace name="OtherInbox">
    <columnfamily comparewith="LexicalUUIDType" name="Users">
  </keyspace>
</keyspaces>

In the first example, I'd use "LongType" since user_id is probably an integer. In the second example I'd use "LexicalUUIDType", as shown, or "TimeUUIDType".

SuperColumnFamilies

For more structured, nested data, you should consider using a SuperColumnFamily, which let you store columns of columns. Examples:

{ user_id => { 'details' => { 'email' => 'sarah@example.com', 'last_name' => 'Jones'}, 
                    'preferences' => { 'expert_controls' => 'true' }}}

@cassandra.insert(:Users,user_id,  { 'details' => { 'email' => 'sarah@example.com', 'last_name' => 'Jones'}, 
                    'preferences' => { 'expert_controls' => 'true' }}})

@cassandra.get(:Users,user_id,'details')

@cassandra.get(:Users,user_id,'expert_controls')

'details' and 'preferences' are super columns containing columns 'email', 'last_name', and 'expert_controls'. Just as with regular column families, you can encode arbitrary data in the column names, or just set them to UUIDs. When you define a SuperColumnFamily, you tell Cassandra how to sort and store the column names and the subcolumn names:

<columnfamily columntype="Super" comparesubcolumnswith="UTF8Type" comparewith="LongType" name="Users">

One key consideration: as of this writing the current version of Cassandra (0.6.2) does not do any indexing of the subcolumns, which means when you load a supercolumn, all of its subcolumns are loaded into memory. If you expect to have more than a few thousand subcolumns, you would be better off using a regular column family, and overloading the row keys and columnnames with your nested data. In our example, your column names could be something like "sarah@example.com/Jones/true", and it would be up to you to split the data on retrieval.

There is an open ticket to address this in a future release.

Key Names vs. Column Names

For the QA system, everything we keep track of is associated with a timestamp. The most natural partitioning of the data seems to be by day, hour, and whether we synced the message or not. The reporting system runs once per day, iterating over each hour and each synced state for the previous day. This gives us rows that are small enough for Cassandra to easily distribute across nodes without loading up too many columns in any one row. Our keys look like this:

key = "#{time.strftime("%Y-%m-%d-%H")}*#{is_synced}"

Since I only need to track 4 or 5 properties about each sync/nosync decision, I decided to use supercolumns. Actually, I first used the composite column approach described above, but I found supercolumns made for better-looking, slightly more-efficient code. The columns looks like this:

{ 'example.com*6f1ed002ab5595859014ebf0951522d9' => { 'from_address' => 'marketing@example.com', 'is_system_merchant' => true }}

Each supercolumn name is a composite of the domain name of the message we examined and an MD5 hash of the message header. This ensures we don't store duplicate records if the same message gets processed twice. It also means I can drill down on specific senders in the future if needed by using range queries with partial column names, as shown next.

Range Queries

I don't know the optimal number of columns that Cassandra can serve up in one request, but in our system one row (meaning one hour's worth of sync and nonsync events) could comprise tens of thousands of columns, more than we would want to request at once. But since the columns are stored in sorted order, it's easy to fetch them with a range query. Here is a super simplified version of what we do:

I divide up the previous day into 48 keys (synced events for each hour and nonsynced events for each hour).
I then thread these requests, 24 at a time. According to the docs, "a good rule of thumb is 2 concurrent reads per processor core", so 3 machines times four cores times 2 reads per core = 24 concurrent reads. Each node has its ConcurrentRead property set to 8. I may not be doing the math correctly so feel free to chime in with a correcting comment.
Each thread executes the following (simplified) code:
```
start = ''
    loop do
      # count is completely arbitrary, need to experiment with what's best
      columns = cass.get(:MessageSyncing,key,:start => start,:count => 2500)
      break if columns.empty?

      columns.each do |column_name,column_values|
        fqdn = column_name.split(/\//)[0]
        from_address = column_values['from_address']
        is_system_merchant = column_values['is_system_merchant']
        # increment counters/manipulate data here
      end

      start = columns.keys.last.succ!
    end
```
This code uses a range query to page through all the columns within one row. Since the columns are stored by UTF8Type, I can just increment the key and know that I'll get the next chunk of columns. You can also query with partial range keys, so that if I wanted to see all of the data for a domain, I could range query with the column start as "example.com*". I also have some code that aggregates the results of those 48 queries.

I also have some pretty complicated code that collates the resuls of those 48 threads, which are themselves composites of all the range queries I ran within each row. I realized after writing it that I had essential re-implemented my own half-assed map-reduce.

Happily, while I was implementing this code, Cassandra 0.6 came out, which includes built-in support for Hadoop. Cassandra has a Pig load function, so it should eventually be possible to replace the above code and my half-assed map-reduce with something much more elegant, maybe just a few lines of Pig. For now, this works great. Of course you don't need any of this if you aren't using your datastore for reporting.

Notes on EC2

I don't have enough experience yet to recommend whether you should use EC2 or not. I originally built this to use one xlarge instance, but I found that it could not keep up with the network load. There were a lot of timeouts from the nodes reporting to Cassandra. As soon as I split it into three smaller Cassandra nodes, the timeouts went way down. It might even make more sense to split into six small nodes.

Each node has two EBS volumes, a smaller one for the commit log and a large one for the data. The commit log is append-only and is used to replay writes in case Cassandra crashes before the data in memory can be written to the data disk. Keeping them separate improves throughput so one operation doesn't block the other. It might make more sense to use an ephemeral store for the commitlog; I haven't had time to explore.

I definitely recommend following the recommendation in the documentation: use at least three nodes in production.

Notes on Adding Nodes

Adding each node was easy, and that's one of Cassandra's key features. All you have to do is tell the new node the address of at least one other node and set its AutoBootstrap value to true.

The only problem I had was that the first server was getting hammered so hard by all these requests, many of which were timing-out, that it took awhile for the second node to complete the gossiping with the first node to start bootstrapping.

CassandraObject

Michael Koziarski wrote a cool ActiveModel interface to Cassandra called CassandraObject which I haven't played with yet, but offers a higher level abstraction for accessing data beyond just "a hash of hashes". He's presenting it at RailsConf this year and I'll definitely be in attendance for that talk.

Further Reading

I found these articles/sites particularly helpful:

Incidentally, Stackoverflow is starting to become the site I go to first when I'm searching for the solution to a technical question. Google searches return at least 50% garbage or duplicate mailing list content for a lot of technical topics. That makes me more interested in the future of niche/vertical search engines. It is definitely possible to out-Google Google within a niche.

Wednesday, February 17, 2010

Using Ruby's Queue class to manage inter-thread communication

Bob Potter just showed me an awesome way to use Ruby's Queue class to communicate between two threads. I try to avoid multi-threaded programming as much as possible since I don't feel super confident with concurrency (at least with a language like Ruby). It can add a lot of complexity and headaches even when you know what you're doing, but there are several cases in OtherInbox where it makes a lot of sense.

In this case, I needed to log a very frequently-occurring event to SimpleDB. We don't need these logs to be 100% accurate, and the data gets processed offline for reports, so there's no need to keep the data in our main relational database. My first naive version of this code pushed a new item to SimpleDB each time the event occurred, but this ended up chewing up a lot of memory. To avoid slowing down the main process, I had been handing these one-off SimpleDB calls to an EventMachine deferred block, which caused references to all the objects I was using to generate the SimpleDB items to stick around for too long.

That's when Bob suggested I use a Queue to batch the work into chunks of 25 (the SimpleDB BatchPutAttribute limit) onto a separate standard Ruby thread. The resulting code looks like this (simplified for readability):

class Tracker
def track_event(a,b,c)
 @queue ||= Queue.new
 start_pushing_thread
 hash = { :a => a, :b =>, :c => c }
 @queue << hash
end

private

def start_pushing_thread
 return if @pushing_thread
 @pushing_thread = Thread.new do
      items = []

   loop do           
        # pop will block until there are queue items available
        items << @queue.pop
  
     if items.size > 24
       SdbWrapper::batch_log_to_sdb(items,'tracking_domain')
       items.clear
     end
   end
 end
end

end

Once more than 24 items are in the queue, the thread calls code that we wrote to handle batch puts, then clears the array and waits for more.

We use an inter-process/inter-machine version of this paradigm all the time at OtherInbox: we make heavy use of SQS queues to handoff work from one box to another, but that wouldn't have helped here because then we'd just be introducing yet another web service dependency. So if you're in a situation where you need to keep a high volume process running quickly while handing off less important work in chunks, check out Queues!

Monday, January 18, 2010

Ruby for Startups at Lone Star Ruby Conference 2009

Recently Confreaks posted video from the Lone Star Ruby Conference talk I gave in August, called Ruby for Startups.

I've refined the talk a few times since then but it's still a good representation of my current state of mind about what I've learned about Ruby and about software design while building OtherInbox. Check it out if you'd like to see more! The slides are posted here.

Wednesday, November 11, 2009

Lean startup tools for Rails apps

A few months ago I was invited to dinner with the Geeks on a Plane crew when they stopped in Washington, and had the opportunity to meet one of my heroes, Eric Ries, author of the Startup Lessons Learned blog. His descriptions of lean startup techniques and philosophies have had a big influence on the way I design and build software.

Eric asked me what specific tools Rails developers could use when building a lean startup. Here's a revised version of what I emailed in response, garnered while building OtherInbox and other Rails apps:

First and foremost, I find Rails itself to be very useful for building a product for a lean startup; the expressiveness of Ruby and the conventions of Rails help developers more quickly build a minimum viable product. I'm still able to surprise people who have worked with me for awhile with how quickly I can turn their feature ideas into
something they can play with and show customers.

For continuous integration, like many Rails shops, we use CruiseControl.rb - it's easy to install and customize, but I don't think people are especially enamored of it. But it works. Two very promising alternatives to rolling your own CI server are Devver and RunCodeRun. I've tried both but our app has grown too complex for either of them (we use a lot of gems and have a pretty customized test environment). For Github users, RunCodeRun exposes a post-commit hook so the tests run automatically after every commit which is pretty useful.

Speaking of Github, I'd love for them to implement pre-commit hooks, because then you could prevent developers from making commits any time the build is broken, which prevents people from ignoring problems in a large code base and just tunneling on building their one little feature. Right now if you want to implement this, you need to setup your own source control repository that can talk to the continuous integration system.

To monitor exceptions, we use the excellent Hoptoad app. This tool adds a lot of context to bug discussions, because nonprogrammers can reference the exact Hoptoad URL containing all of the information a programmer needs to fix a bug (including backtraces). Because it has an RSS feed, one could also use "Unresolved Hoptoad Errors" as a metric influencing the continuous deployment system. If a release goes out and suddenly there are new errors, that's a sign that the batch was bad and needs to be reverted. There's a drop-in Rails plugin but this service could be used with any language.

There are two A/B testing plugins for Rails: 7 Minute ABs and A/Bingo. I haven't used either and am not sure there are any better than rolling your own. Update: Charlie Park in the comments mentions the new plugin Vanity which came out just after I wrote this.

There are several good unit testing frameworks for Rails, each with their own devotees. What comes with Ruby and Rails is plenty good enough but some people like the greater expressiveness afforded by things like Rspec and Shoulda. For my next product I'd like to try Shoulda because it uses Ruby's built-in test facilities (making for one less dependency to worry about, as is the case with Rspec) and because I really like the examples I've seen on the Thoughtbot blog.

We love Cucumber for higher level testing. When coupled with Webrat, which simulates a real browser, you get a pretty nice mechanism for exercising the entire app in a test environment before deployment. Webrat even has a Selenium adapter, so you actually can run your tests inside of a browser to make sure all is well before deploying a change.

This isn't Rails-specific, but we use SimpleDB for dumping a lot of important metrics data, like performance measurements, which we then turn into metrics. This keeps us from having to hit our database too much. The RightAWS gem makes interacting with SimpleDB pretty easy.

For our metrics we mainly use Graphite, the graphing and analysis system open-sourced last year by Orbitz. Our whole team is falling more in love with it every day.

New Relic has a nice plugin for Rails that reports all kinds of useful data to their service, which could become part of a cluster immune system for Rails apps. They measure average responsiveness and compute an aggregate statistic called Apdex which can be a good indicator of site health (at minimum, any significant changes to Apdex should be investigated, and they will email you alerts about such changes).

Capistrano is still the standard way most developers are deploying Rails apps (but also check out Vlad the Deployer), but I think there's an opportunity for someone to build a Ruby library to facilitate continuous deployment that would work better than Capistrano. It would definitely be built using Rake like Vlad does.

Update: I haven't used Heroku, but as Ryan points out in the comments it definitely belongs on this list as a compelling, get-up-and-running fast deployment platform.

Friday, August 28, 2009

Ruby for Startups Talk at Lone Star Ruby Conference

Today I gave a talk about my experiences writing the code for OtherInbox, called Ruby for Startups. The slides are below. Hope you find them interesting!

The design principals from the first slide come from Russ Olsen's book, Design Patterns in Ruby (and originally from Design Patterns: Elements of Reusable Object-Oriented Software).

Ruby For Startups

View more documents from subelsky.

Saturday, November 15, 2008

random_data v1.5 Released

The random_data gem provides methods for generating random test data including names, mailing addresses, dates, phone numbers, e-mail addresses, and text. v1.5 includes a primitive Markov text generator and an array "roulette" function.

Thanks to Hugh Sasse for contributing code for the new features!

Tuesday, September 16, 2008

Video of my talk at Lone Star Ruby Conference

Confreaks has posted my entire Ruby in the Cloud talk from Lone Star Ruby Conference. Check it out and let me know what you think!

Here's a 30 second synopsis filmed by Gregg Pollack: