Mike Subelsky

Wednesday, August 11, 2010

Why You Might Not Want That Cybersecurity Job

Update: I receive occasional inquiries for cybersecurity career advice because of this post. I haven't worked in this field in years, so I recommend you read this advice if you're trying to get a cybersecurity job.

Cybersecurity, while offering lucrative job opportunities, might not be an ultimately rewarding career for Maryland technologists. I worked in this sector for about eight years as a military officer, government civilian, and government contractor in a variety of different roles, and here's what I want to say about it.

Maryland's business press, government officials, and various tech organizations have lately been enthusiastically banging the gong for cybersecurity. I can appreciate why - there's a lot of money at stake, and a lot of it comes from Maryland's foremost benefactor, the federal government. This is a recession-proof, guaranteed-to-grow industry, and Maryland is already home to many successful cybersecurity companies like Sourcefire. The government and private companies employ many thousands of people and contribute many millions of dollars to our tax base.

So it makes sense for our government to be pursuing these opportunities, but does it make sense for you, Maryland hacker? Here are some things to consider; these are obviously generalizations extrapolated from my experience. Feel free to leave comments if you feel this is a gross distortion.

Cyber defense is often the opposite of a creative activity; in many of these jobs you're going to find yourself acting as an enforcer, a mere gatekeeper. You'll be telling the creative people in your organization all the things they can't do or aren't allowed to have. Often you'll be restricting them not because of policy reasons but because it's too hard to figure out how to allow them to do what they want within the regime you are enforcing (Naturally this does not apply if you work for a company that builds the tools the enforcers use) or because it's just easier to say "no".
In classified settings, you are severely restricted in the sources and kinds of technologies you use. You'll be leaving your smartphone and your iPad in your car or in a locker outside the SCIF. You won't have admin permissions on the machine you're working on. Forget installing Chrome with the latest extensions, you'll be lucky to get version 2 of Firefox! Or you might not have access to the Internet at all! Also, forget about telecommuting or riding your bike to work; your job will be in a well-defended federal facility or an anonymous office park in the suburbs.
Because cybersecurity is so tied to "the enterprise", you'll almost certainly be living in Microsoft land, which may or may not be a problem for you.
Many of the government organizations in this field are gigantic, top-down, and super-hierarchical. You will made to turn as a soulless cog in a giant machine. There are plenty of smaller, more enlightened companies out there, of course, but the highest paying jobs will probably be offered by big contractors.
The federal government has crazy monopsony power over this sector. Besides the usual and expected bureaucratic games you'll endure, if you work for a private company that does much business with the government you are going to see some brutally depressing market distortions that arise from this monopsony. You may find yourself working on a product or a program that nobody in your client agency cares about, or wants to succeed, except that they need to spend up their budget dollars so Congress doesn't take the money away next year. Or you might find your job in limbo because the sales cycle for getting government contracts is so long, and it can take forever for the company to actually have money in hand. There's some truth to the myths about the Pentagon spending $10K on toilet seats - it probably does cost about $9950 in sales salaries to sell a $50 toilet seat to the Department of Defense!

I was well-paid as a cybersecurity analyst, and often I did enjoy the work, and parts of it involved amazingly cool, James-Bond-like exploits. But those are the reasons I ultimately chose to leave. Now I am working on my own startup. My job is less glamorous (I'm not "saving the world" every day) but because my individual contribution counts thousands of times more in a small company which I own a piece of, and because every second and every dollar counts, it's an infinitely more satisfying way to spend my time. My labors are simply more meaningful. So that's what I wanted you to know.

UPDATE 8/16/10: Please check out @NetSecGuy's post where he further elaborates on these issues.

POSTSCRIPT FOR MARYLAND GOVERNMENT AND BUSINESS LEADERS

I applaud you for positioning the state to take advantage of the "cyber doom boom". I'm sure it will help many of my fellow citizens in the short term. But I wonder how much wealth you think cybersecurity is ultimately going to create in Maryland, especially if it accrues to big consulting companies like Booz-Allen that aren't even based here. Also, what's going to happen when this sector matures, when Internet security gets better, and spending declines? Who's going to fill up those office parks and abandoned SCIFs?

I implore you not to neglect other parts of Maryland's Internet tech economy, because it's product companies like Advertising.com, BillMeLater, Millenial Media, Localist, Ipiqi, Common Curriculum, Figure53, Replyz, Deconstruct Media, and a bunch of others I can't think of right now that are building a new, sustainable post-industrial base in our state.

Thursday, June 3, 2010

RailsConf Baltimore Things To See and Do

I love Baltimore and I love Ruby, so like my fellow Bmoreonrails colleagues, I am super-psyched for RailsConf 2010 being held at the Baltimore Convention Center!

If you're new to our city, here's a short list of things to see and do, optimized for people staying near the convention center (e.g. this is not necessarily the list I would give you if you were staying with me, had ready access to a car, etc.)

SOUTH

Your best bet in this direction is the Federal Hill neighborhood which has a lot of activity, bars, and nice restaurants. During the day you can climb to the top of Federal Hill itself, and then visit the incredibly awesome American Visionary Art Museum: this is the ideal art museum for hackers, because it celebrates self-taught artists.

On Wednesday or Thursday evening I recommend checking out Illusions Magic Bar & Lounge. If you're around on Friday or Saturday night, there's an awesome magic show featuring an upside-down straitjacket escape. I guarantee there's nothing like this place back where you come from!

(not actually guaranteed)

WEST

Right next to the convention center you'll find Geppi's Entertainment Museum: all about pop culture, comic books, toys, etc.

Paul Barry has organized a RailsConf group attendance package for the Yankees vs. Orioles game on Wednesday night. Camden Yards is a very nice ballpark, so get out there and enjoy it before we get conned into building another one a few years from now! You'll get free beer and hotdogs!

NORTH

On Monday, Bmoreonrails is having its monthly Pub Night at Pratt Street Ale House, right across from the convention center, a very fun place to hang out.

Walking farther north, Maryland's signature dish is the crab cake, and one of the best versions is made at the Faidley's stand in Lexington Market which is worth visiting to soak up all the vibrant activity of a city market.

If you have a car or can spring for a cab ride, definitely visit Mt. Vernon: a beautiful neighborhood with brownstone homes and great restaurants. The Brewer's Art bar was named "Best Bar in America" by Esquire, and they brew an excellent ale called Resurrection.

If you have a car, you may also want to check out the Hampden neighborhood which tends to get a lot of attention by people writing articles like this one - John Waters recently described it as a mix of "hipster culture and redneck culture".

EAST

The Inner Harbor is our ubertouristy area, but it's very nice if you've never been there. Besides our great National Aquarium, you can each catch a water taxi from there to the Fells Point neighborhood farther east, which has a ton of bars and restaurants and cool shops and coffee shops.

If you have access to a car or taxi, after stopping in Fells Point you might want to visit Canton, one of the city's technology hubs. The Beehive coworking facility is well-worth a visit if you have time to kill before or after the conference.

RUNNING

There is a red brick path going all the way around the harbor that makes for a great running route. If you're a runner also check out the Bmoreonrails recommended running route.

THE WIRE

Did you hear about this gritty, realistic, little-known but super amazing cop show on HBO a few years back? It was so much more than a cop show. It was called The Wire and it was a tremendous work of art, but also very entertaining. If you have heard of it you may want to visit some of the shooting locations which I have catalogued previously in "The Wire tour".

MORE IDEAS

Check out the recent New York Times sightseeing guide and our alt-weekly's guide The Baltimanual.

Monday, May 31, 2010

Real World Ruby and Cassandra

Introduction

At OtherInbox I recently built a QA system using the Cassandra datastore. I really like this technology and so far I would recommend it, but the learning curve for Rubyists is still pretty high. There are some good examples online (especially the canonical article by Evan Weaver) but nothing showing more intermediate, real-world usage. Hence, this article.

The system requires us to log millions of events per day. I could have built it using a traditional relational database like MySQL (which we use for the main application), but these factors led me to consider a NoSQL database:

We're only interested in large patterns in data, so we don't need 100% ACID assurance that every single write will succeed. The system would be useful to us even if it only caught 80% of the events.
Since we perform these actions millions of times per day, write speed is the prime consideration.
The QA reports are generated offline, once per day. We don't mind if reads happen more slowly, or if we need to do some extra programming to build reports because we can't use SQL.
The shear volume of events made me less excited about punishing a MySQL table. We already do a lot of extra work to keep MySQL healthy performing OtherInbox's main functions via sharding.
I was curious to see how a schema-less datastore would change the way I solved programming problems.

I will assume you have read Evan's article as well as the very useful 'WTF is a supercolumn?'. You may also want to read through the Twissandra Python code as well as the tests for Evan's cassandra gem.

I've been playing with the technology only for a few months so I'm sure I'll need to correct some parts of this article as I learn more - please comment if something is unclear or incorrect.

It's Sorta Like an Ordered Multi-Dimensional Hash

Rubyists can think of Cassandra like a hash of ordered hashes, or a hash of ordered hashes of ordered hashes, requiring up-front planning to use. You don't have to specify your schema, but you do need to tell Cassandra how your keys and columns will be organized. That affects how the data is stored on disk and how you'll read the data later.

Since the columns are stored in sorted order, Cassandra can answer queries very quickly (which is why it's in use at sites like Facebook and Digg). I had to change the way I built keys and column names several times before I got it right. Anytime you change how the data is stored on disk you need to restart Cassandra.

Columns and ColumnFamilies

ColumFamilies store a set of columns (which you can think of as key-value pairs) partitioned by a row key. The column names can be arbitrary strings, long integers, or UUIDs; at start time you have to tell Cassandra how to sort the column names but beyond that you have complete freedom to create column names that will be useful to you.

If each row has the same data, you might think of it like this:

{ user_id => {'email'=>'sarah@example.com', 'last_name'=>'Jones' }}

Where user_id is the row key (which Cassandra hashes and uses to determine which nodes should store the columns for this piece of data), and 'email', and 'last_name' are column names. Using the gem your code would look like:

@cassandra.insert(:Users,user_id,{ 'email'=>'sarah@example.com', 'last_name' => 'Jones'})

@cassandra.get(:Users,user_id)

But you can also store useful data in the column names. This is useful when there are many columns and you want to be able to select a particular range of columns. For the QA system, we page through a large range of columns within each key, and assigning smart column names helps this go faster. The data might look like this:

{ user_id => { UUID => 'Hey Sarah here's a question for you..', UUID2 => 'When are we going to meet up?' }}

@cassandra.insert(:Users,user_id, { UUID.new => 'Hey Sarah here's a question for you..'})

@cassandra.insert(:Users,user_id, { UUID.new => 'When are we going to meet up?' })

In this case we are storing messages for a particular user, and we're using unique identifiers for columns that we can query later in ranges. If your data has a temporal component you might use time-based UUIDs (where the most significant bits are a timestamp and the less significant bits are entropy) so that you query only columns that fall within a particular range of times.

You do need to tell Cassandra how your column names should be sorted on disk, which happens in the configuration file for each ColumnFamily:

<keyspaces>
  <keyspace name="OtherInbox">
    <columnfamily comparewith="LexicalUUIDType" name="Users">
  </keyspace>
</keyspaces>

In the first example, I'd use "LongType" since user_id is probably an integer. In the second example I'd use "LexicalUUIDType", as shown, or "TimeUUIDType".

SuperColumnFamilies

For more structured, nested data, you should consider using a SuperColumnFamily, which let you store columns of columns. Examples:

{ user_id => { 'details' => { 'email' => 'sarah@example.com', 'last_name' => 'Jones'}, 
                    'preferences' => { 'expert_controls' => 'true' }}}

@cassandra.insert(:Users,user_id,  { 'details' => { 'email' => 'sarah@example.com', 'last_name' => 'Jones'}, 
                    'preferences' => { 'expert_controls' => 'true' }}})

@cassandra.get(:Users,user_id,'details')

@cassandra.get(:Users,user_id,'expert_controls')

'details' and 'preferences' are super columns containing columns 'email', 'last_name', and 'expert_controls'. Just as with regular column families, you can encode arbitrary data in the column names, or just set them to UUIDs. When you define a SuperColumnFamily, you tell Cassandra how to sort and store the column names and the subcolumn names:

<columnfamily columntype="Super" comparesubcolumnswith="UTF8Type" comparewith="LongType" name="Users">

One key consideration: as of this writing the current version of Cassandra (0.6.2) does not do any indexing of the subcolumns, which means when you load a supercolumn, all of its subcolumns are loaded into memory. If you expect to have more than a few thousand subcolumns, you would be better off using a regular column family, and overloading the row keys and columnnames with your nested data. In our example, your column names could be something like "sarah@example.com/Jones/true", and it would be up to you to split the data on retrieval.

There is an open ticket to address this in a future release.

Key Names vs. Column Names

For the QA system, everything we keep track of is associated with a timestamp. The most natural partitioning of the data seems to be by day, hour, and whether we synced the message or not. The reporting system runs once per day, iterating over each hour and each synced state for the previous day. This gives us rows that are small enough for Cassandra to easily distribute across nodes without loading up too many columns in any one row. Our keys look like this:

key = "#{time.strftime("%Y-%m-%d-%H")}*#{is_synced}"

Since I only need to track 4 or 5 properties about each sync/nosync decision, I decided to use supercolumns. Actually, I first used the composite column approach described above, but I found supercolumns made for better-looking, slightly more-efficient code. The columns looks like this:

{ 'example.com*6f1ed002ab5595859014ebf0951522d9' => { 'from_address' => 'marketing@example.com', 'is_system_merchant' => true }}

Each supercolumn name is a composite of the domain name of the message we examined and an MD5 hash of the message header. This ensures we don't store duplicate records if the same message gets processed twice. It also means I can drill down on specific senders in the future if needed by using range queries with partial column names, as shown next.

Range Queries

I don't know the optimal number of columns that Cassandra can serve up in one request, but in our system one row (meaning one hour's worth of sync and nonsync events) could comprise tens of thousands of columns, more than we would want to request at once. But since the columns are stored in sorted order, it's easy to fetch them with a range query. Here is a super simplified version of what we do:

I divide up the previous day into 48 keys (synced events for each hour and nonsynced events for each hour).
I then thread these requests, 24 at a time. According to the docs, "a good rule of thumb is 2 concurrent reads per processor core", so 3 machines times four cores times 2 reads per core = 24 concurrent reads. Each node has its ConcurrentRead property set to 8. I may not be doing the math correctly so feel free to chime in with a correcting comment.
Each thread executes the following (simplified) code:
```
start = ''
    loop do
      # count is completely arbitrary, need to experiment with what's best
      columns = cass.get(:MessageSyncing,key,:start => start,:count => 2500)
      break if columns.empty?

      columns.each do |column_name,column_values|
        fqdn = column_name.split(/\//)[0]
        from_address = column_values['from_address']
        is_system_merchant = column_values['is_system_merchant']
        # increment counters/manipulate data here
      end

      start = columns.keys.last.succ!
    end
```
This code uses a range query to page through all the columns within one row. Since the columns are stored by UTF8Type, I can just increment the key and know that I'll get the next chunk of columns. You can also query with partial range keys, so that if I wanted to see all of the data for a domain, I could range query with the column start as "example.com*". I also have some code that aggregates the results of those 48 queries.

I also have some pretty complicated code that collates the resuls of those 48 threads, which are themselves composites of all the range queries I ran within each row. I realized after writing it that I had essential re-implemented my own half-assed map-reduce.

Happily, while I was implementing this code, Cassandra 0.6 came out, which includes built-in support for Hadoop. Cassandra has a Pig load function, so it should eventually be possible to replace the above code and my half-assed map-reduce with something much more elegant, maybe just a few lines of Pig. For now, this works great. Of course you don't need any of this if you aren't using your datastore for reporting.

Notes on EC2

I don't have enough experience yet to recommend whether you should use EC2 or not. I originally built this to use one xlarge instance, but I found that it could not keep up with the network load. There were a lot of timeouts from the nodes reporting to Cassandra. As soon as I split it into three smaller Cassandra nodes, the timeouts went way down. It might even make more sense to split into six small nodes.

Each node has two EBS volumes, a smaller one for the commit log and a large one for the data. The commit log is append-only and is used to replay writes in case Cassandra crashes before the data in memory can be written to the data disk. Keeping them separate improves throughput so one operation doesn't block the other. It might make more sense to use an ephemeral store for the commitlog; I haven't had time to explore.

I definitely recommend following the recommendation in the documentation: use at least three nodes in production.

Notes on Adding Nodes

Adding each node was easy, and that's one of Cassandra's key features. All you have to do is tell the new node the address of at least one other node and set its AutoBootstrap value to true.

The only problem I had was that the first server was getting hammered so hard by all these requests, many of which were timing-out, that it took awhile for the second node to complete the gossiping with the first node to start bootstrapping.

CassandraObject

Michael Koziarski wrote a cool ActiveModel interface to Cassandra called CassandraObject which I haven't played with yet, but offers a higher level abstraction for accessing data beyond just "a hash of hashes". He's presenting it at RailsConf this year and I'll definitely be in attendance for that talk.

Further Reading

I found these articles/sites particularly helpful:

Incidentally, Stackoverflow is starting to become the site I go to first when I'm searching for the solution to a technical question. Google searches return at least 50% garbage or duplicate mailing list content for a lot of technical topics. That makes me more interested in the future of niche/vertical search engines. It is definitely possible to out-Google Google within a niche.

Wednesday, May 26, 2010

Links and slides for Social Media for Everybody

Last night I gave the Social Media for Everybody talk at SKYLOFTS, an amazingly nice space in Highlandtown. As promised, below are the slides and the links I discussed. It was definitely a firehose of information! Thanks to all who attended.

Social Media for Everybody

View more presentations from Mike Subelsky.

Tools and References

http://bmorefiber.com
http://blogger.com
http://wordpress.com
http://campaignmonitor.com
http://ping.fm
http://tubemogul.com
http://postling.com
http://followcost.com
http://pleaserobme.com
http://www.reclaimprivacy.org/facebook
http://reader.google.com
http://readitlaterlist.com/
http://www.tweetdeck.com/
http://cotweet.com
http://www.google.com/alerts
http://feedburner.google.com/
http://www.google.com/analytics/
http://getsatisfaction.com/
http://bit.ly
http://news.ycombinator.com/
http://whoshouldifollow.com/
http://mrtweet.com/
http://www.meetup.com/
http://www.facebook.com/advertising/?pages
http://www.linkedin.com/
http://friendfeed.com/
http://www.ustream.tv/
http://docs.google.com/
http://groups.google.com/

Further Reading and Watching

http://www.technotheory.com/how-to-use-social-media-guide/
http://www.kk.org/thetechnium/archives/2008/03/1000_true_fans.php
http://www.socialmediaexaminer.com
http://www.technotheory.com/2009/01/6-concrete-lessons-learned-in-online-relationship-building-as-presented-to-the-gbtc/
http://www.brandinteractivism.com/2010/05/creating-online-grassroots-movements-lessons-from-the-million-mom-march-10-years-later.html

Turning complainers into champions presentation by Josh Baer:

http://blog.asmartbear.com/how-i-got-6000-rss-subscribers-in-12-months.html
http://amzn.to/bmMN2i
http://twitter.com/chris_ashworth

Thursday, May 13, 2010

Social media for everyone!

I'll be giving a super personal, super passionate talk about social media at the SKYLOFT artistic community in Highlandtown on May 25th. I was invited by two new interesting veteran's groups that have formed in Baltimore, the Veteran Artist Program, of which I am a board member, and The Sixth Branch.

These guys have big, creative plans, so my talk will be about how to get the word out while being a responsible community member (e.g. this is not about marketing per se). What I plan to talk about should be useful to people from all walks of life (entrepreneurs, artists, employees, nonprofit types, etc).

You can see more details or RSVP here.

Monday, April 5, 2010

How Twitter Integrates New Architecture

I've been experimenting with the Cassandra datastore lately and came across an interesting interview with Ryan King discussing Twitter's use of Cassandra. I thought this tidbit was worth sharing even if you don't care about Cassandra:

A philosophical note here — our process for rolling out new major infrastructure can be summed up as “integrate first, then iterate”. We try to get new systems integrated into the application code base as early in their development as possible (but likely only activated for a small number of people). This allows us to iterate on many fronts in parallel: design, engineering, operations, etc.

He also describes in detail how they used that philosophy to migrate Twitter status updates (the toughest part of their app to scale) to Cassandra. This also reminded me of Flickr's Feature Flags ... all of which can be used to support continuous deployment, something I've become very interested in after encountering it in the lean startup literature.

I'm planning to writeup my findings on using Cassandra with Ruby sometime soon if there's interest.

Wednesday, March 24, 2010

Ada Lovelace Day: Profile of Heather Sarkissian

In honor of Ada Lovelace day, I wrote a short profile of Baltimore tech leader Heather Sarkissian, CEO of mp3car.com and founder of BmoreSmart. The full article is here but I've included an excerpt below:

She's the CEO of an important mobile computing technology company here in town, mp3Car.com, which Gus has written about before. The company combines several different facets in a compelling way: it's a popular forum, a crowd-sourced design company, a consultancy with Fortune 500 customers, and a niche ecommerce store. Their office in the Emerging Technology Center includes a small warehouse of electronic parts, making it one of those rare web businesses that has actual inventory and real-world relevance. They do it all with a small staff led by Heather.