Testing your database backups: the test environment database refresh pattern

When did you last try restoring your database backups? A month ago, a week ago? A year ago? Never? When was the last time you refreshed the data in your test environments? When I joined Songkick, one of the first things I asked was when we last tested a restore of our database backups. The answer, pleasingly, was at 03:00 UK time that morning and not coincidentally, that’s when we last refreshed the data in our test environments.

Here’s how we get the warm and fuzzy feeling of knowing that our backups contain data that can be restored and makes sense.

  1. Every morning, our database servers run their scheduled backups, copying the resulting images to a backup server in the data centre.
  2. Overnight those backups get copied to the office, giving us an offsite copy.
  3. In the small hours, when most of us are asleep, each of the database servers in our staging environment retrieve the backups, erase their local data files and then restore the production backups over the top of them.
  4. We perform sanitisation on the data, to make it suitable for use in a testing environment.
  5. And finally, and most importantly, we use the databases in our testing.

By doing this, we identified one case when our backups seemed to work, produced plausible looking backups, but MySQL failed to apply InnoDB log records during recovery. It was inconvenient to discover this problem in our staging environment, but far less inconvenient than discovering it when we needed the backups to put our production system back into operation.

Here are some practical tips based on our experience implementing and managing this system at Songkick:

Back all databases up at the same time

If your system is composed of services backed by independent databases on different machines, it’s possible that there’s some implicit consistency between them. For example, a common situation at Songkick is to have an accounts service responsible for storing user accounts and another service that stores user data keyed against a user, then there’s an expectation that those databases have some degree of consistency.

If you back them up at different times, you’ll find inconsistencies, that a service might have a reference to a user that doesn’t yet exist. If the ID of the user is exposed to other services and that ID can be reused, you may find that newly created users in your test environment have existing data associated with them and this can cause significant problems in testing.

It’s worth noting that, in the case of a production restore, these issues would need to be diagnosed and solved in the heat of the moment. By finding them in your test environment, you’re giving yourself the space to solve them earlier, under less pressure.

Design the backups to be regularly exercised

Some types of backups are more amenable to being restored regularly in test environments. For example, our initial MongoDB database backups performed snapshots of our MongoDB database path. These proved difficult to restore, because they included local databases which contained information on replica set membership. This means that on startup, our staging MongoDB server would forget its existing replica set membership and try to talk to the production servers instead.

We switched to using mongodump to take a logical export of the database, simply so that we could restore it on the primary member of our existing staging replica set and update the entire replica set.

Sanitisation tips

After we’ve restored the databases, there are certain things we do to make them safe and usable in our testing environments.

  • Remove or obfuscate email addresses. We’re not fond of accidentally emailing people with test events we’ve created in staging, so we change people’s email addresses to be unusable, so that can’t happen. We leave people’s email addresses alone if they work at Songkick, so we can test email features by emailing ourselves.
  • Remove or obfuscate payment tokens. If it’s uncool to accidentally email people, accidentally charging them is positively hostile. Anything that’s used for payment needs to be removed.
  • Fix or replace information about the environment. It’s best to avoid keeping references to your technical environment in the same database as your application data, but sometimes it’s tricky to workaround. For example, our MogileFS installation needs to be kept in sync with our production one, to avoid problems with missing media. This means that we need to manually update the database to substitute the hostnames of the mogilefs servers.

Write code that can withstand the database going away

Unless you’ve put some work in, almost no database driver will gracefully handle the disappearance of a database server and then its re-appearance some time later. If the restore in your test environment is the first time you’ve tried this, you may find that you need to manually restart services, even after the database re-appears on the network.

The solution will vary depending on the database client being used, but often it’s a case of catching an exception, or changing some options when you establish the connection.

By making your applications reconnect to the database with no manual input, you are again fixing a problem that will eventually occur in production – a much more stressful time for it to be diagnosed and fixed.

Summary

Testing your database backups by restoring them automatically and regularly in your test environments is a great way to battle-harden your backups and applications and to make sure that your test environment looks like the real production environment.

Safely dealing with magical text

Boy, what a week it’s been. A remote-code-execution bug was discovered in Ruby on Rails, and we’ve all been scrambling to patch our servers (please patch your apps before reading any further, there is an automated exploit out there that gives people a shell on your boxes otherwise).

What the Ruby community, and those of other dynamic languages, must realize from recent Rails security blunders is that very similar problems can easily exist in any non-trivial web application. Indeed, I found a remote-execution bug in my own open-source project Faye yesterday, 3.5 years into the life of the project (again: patch before reading on).

There are a lot of lessons to be had from recent Rails security blunders, since they involve so many co-operating factors: excessive trust of user input, insufficient input validation and output encoding, the behavioural capabilities of Ruby objects and certain Rails classes, ignorance of cryptography and the computational complexity of data transport formats. In this post I’d like to focus on one in particular: safely encoding data for output and execution.

Ugh, do I have to?

I know, I know, booooooring, but so many people are still getting this really badly wrong and it continues punish end users by exposing their data to malicious manipulation.

Robert Hansen and Meredith Patterson have a really good slide deck on stopping injection attacks with computational theory. One core message in that paper is that injection exploits (including SQL injection and cross-site scripting) involve crafting input such that it creates new and unexpected syntactic elements in code executed by the software, essentially introducing new instructions for the software to execute. Let’s look at a simple example.

Learn you a query string

I found the code that prompted me to write this post while updating some Google Maps URLs on our site this afternoon. Some of this code was constructing URLs by doing something like this:

def maps_url(lat, lng, zoom, width, height)
  params = [ "center=#{lat},#{lng}",
             "zoom=#{zoom}",
             "size=#{width}x#{height}" ]

  "http://maps.google.com/?" + params.join("&")
end

maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204,-0.1148897& ...
#                             zoom=15& ...
#                             size=640x220"

You can see the intent here: whoever wrote this code assumes the URL is going to end up being embedded in HTML, and so they have encoded the query string delimiters as & entities. But this doesn’t fix the problem entities are designed to solve, namely: safely representing characters that usually have special meaning in HTML. What is telling is that the comma in the query string should really also be encoded as %2C, but isn’t.

So although the ampersands are being encoded, the actual query data is not, and that means anyone calling this function can use it to inject HTML, for example:

link = '<a href="' +
           maps_url(0, 0, 1, 0, '"><script>alert("Hello!")</script>') +
           '">Link text</a>'

# => '<a href="http://maps.google.com/?center=0,0&amp; ...
#                                      zoom=1&amp; ...
#                                      size=0x"> ...
#         <script>alert("Hello!")</script> ...
#         ">Link text</a>'

By abusing the maps_url() function, I have managed to inject characters with special meaning — <, >, etc. — into the output and thereby added new HTML elements to the output that shouldn’t be there. By passing unexpected input I’ve created a lovely little cross-site scripting exploit and stolen all your users’ sessions!

Note that you cannot cleanly fix this by using an HTML-escaping function like ERB::Util.h() on the output of maps_url(), because this would serve to re-encode the ampersands, leaving strings like &amp;amp; in the href attribute.

Stacks of languages

Meredith Patterson of the above-linked paper gave another presentation at 28C3 called The Science of Insecurity. I’ve been telling absolutely everyone to watch it recently, so here it is.

This talk describes how we should think of data transfer formats, network protocols and the like as languages, because in fact that’s what they are. It covers the different levels of language power – regular languages, context-free languages and Turing-complete languages – and how use of each affects the security of our systems. It also explains why, if your application relies on Turing-complete protocols, it will take an infinite amount of time to secure it.

When you build HTML pages, you are using a handful of languages that all run together in the same document. There’s HTML itself, and embedded URLs, and CSS, and JavaScript, and JavaScript embedded in CSS, and CSS selectors embedded in CSS and JavaScript, and base64 encoded images, and … well this list is long. All of these are languages and have formal definitions about how to parse them, and your browser needs to know which type of data it’s dealing with whenever it’s parsing your code.

Every character of output you generate is an instruction that tells the browser what do next. If it’s parsing an HTML attribute and sees the " character, it truncates the attribute at that point. If it thinks it’s reading a text node and sees a <, it starts parsing the input as an HTML tag.

Instead of thinking of your pages as data, you should think of them as executable language.

Back to reality

Let’s apply this idea to our URL:

http://maps.google.com/?center=51.4651204,-0.1148897&amp;zoom=15&amp;size=640x220

Outside of an HTML document, the meaning of this list of characters changes: those &amp; blobs only have meaning when interpreting HTML, and if we treat this query string verbatim we get these parameters out:

{
  'center'   => '51.4651204,-0.1148897',
  'amp;zoom' => '15',
  'amp;size' => '640x220'
}

(This assumes your URL parser doesn’t treat ; as a value delimiter, or complain that the comma is not encoded.)

We’ve seen what happens when we embed HTML-related characters in the URL: inserting the characters "> chops the <a> tag short and allows injection of new HTML elements. But that behaviour comes from HTML, not from anything about URLs; when the browser is parsing an href attribute, it just reads until it hits the closing quote symbol and then HTML-decodes whatever it read up to that point to get the attribute value. It could be a URL, or any other text value, the browser does not care. At that level of parsing, it only matters that the text is HTML-encoded.

In fact, you could have a query string like foo=true&bar="> and parsing it with a URL parser will give you the data {'foo' => 'true', 'bar' => '">'}. The characters "> mean something in the HTML language, but not in the query string language.

So, we have a stack of languages, each nested inside the other. Symbols with no special meaning at one level can gain meaning at the next. What to do?

Stacks of encodings

What we’re really doing here is taking a value and putting it into a query string inside a URL, then putting that URL inside an HTML document.

                                +-------------------------+
                                | "51.4651204,-0.1148897" |
                                +------------+------------+
                                             |
    +----------------------------------------|--------+
    |                                +-------V------+ |
    | http://maps.google.com/?center=| centre_value | |
    |                                +--------------+ |
    +------------------------+------------------------+
                             |
                       +-----V-----+
              <a href="| url_value |">Link>/a>
                       +-----------+

At each layer, the template views the value being injected in as an opaque string — it deosn’t care what it is, it just needs to make sure it’s encoded properly. The problem with our original example is that it pre-emptively applies HTML encoding to data because it anticipates that the value will be used in HTML, but does not apply encodings relevant to the task at hand, namely URL construction. This is precisely backwards: considering the problem as above we see that we should instead:

  1. Decide what type of string we’re creating — is it a URL, an HTML doc, etc.
  2. Apply all encoding relevant to the type of string being made
  3. Do not apply encodings for languages further up the stack

In other words, we should make a URL-constructing function apply URL-related encoding to its inputs, and an HTML-constructing function should apply HTML encoding. This means each layer’s functions can be recombined with others and still work correctly, becasue their outputs don’t make assumptions about where they will be used. So we would rewrite our code as:

def maps_url(lat, lng, zoom, width, height)
  params = { "center" => "#{lat},#{lng}",
             "zoom"   => zoom,
             "size"   => "#{width}x#{height}" }

  query = params.map do |key, value|
    "#{CGI.escape key.to_s}=#{CGI.escape value.to_s}"
  end
  "http://maps.google.com/?" + query.join("&")
end

url = maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204%2C-0.1148897& ...
#                             zoom=15& ...
#                             size=640x220"

html = '<a href="' + ERB::Util.h(url) + '">Link</a>'

# => '<a href="http://maps.google.com/?center=51.4651204%2C-0.1148897&amp; ...
#                                      zoom=15&amp; ...
#                                      size=640x220">Link</a>'

Now we see that we get two valid pieces of data: url is a valid URL with all its query parameters correctly encoded but no HTML entities present, and html is a valid HTML fragment with its attributes correctly entity-encoded.

Also, note how we have treated all incoming data as literal (i.e. not already encoded for the task at hand), and we have not hand-written any encoding ourselves (e.g. hand-writing entities like &amp;). You should deal with data assuming it contains the literal information it represents and use library functions to encode it correctly. There’s a very good chance you don’t know all the text transformations required by each layer.

Thinking in types

At this point you’re probably thinking that I’ve made something quite simple seem very complicated. But thinking in terms of types of strings, treating your output as a language stack and following the bullet list above is a good discipline to follow if you want to make sure you handle data safely.

There are some systems that do this for you, for example Rails 3 automatically HTML-escapes any value you insert into an ERB template by default. I’m working on a more general version of this idea: Coping is a templating language that checks your templates conform to the language you’re producing, and doesn’t let input introduce new syntactic elements.

If you’re feeling very brave, I recommend taking the Coursera Compilers course. Although it doesn’t seem immediately relevant to web devs, many concepts from parser theory, type checking and code generation can be applied to security and are well worth learning.

Above all, learn from other people’s security failures and consider where you may have made similar mistakes.

validates_uniqueness_of :nothing

Warning: this article contains rather a lot of silly decisions.

I’ve recently been working out some bugs in our OAuth implementation, including our OAuth2::Provider library. One of the biggest gotchas I found while diagnosing problems with our client apps was the existence of duplicate Authorization records.

An Authorization is a link between a ResouceOwner (i.e. a Songkick user) and a Client, for example our iPhone application. It represents that the user has granted the client access to their resources on Songkick. There should only be one of these per owner-client pair, and somehow we had a few thousand duplicates in our database. Getting more concrete, the table’s columns include the following:

+---------------------+--------------+
| Field               | Type         |
+---------------------+--------------+
| resource_owner_type | varchar(255) |
| resource_owner_id   | int(11)      |
| client_id           | int(11)      |
+---------------------+--------------+

Each combination of values for these three columns must only appear once in the table.

A series of unfortunate events

Now the Rails Way to make such guarantees is to use validates_uniqueness_of, or use a find_or_create_by_* call to check if something exists before creating it. And that’s basically what I’d done; OAuth2::Provider has a method called Authorization.for(owner, client) that would either find a suitable record or create a new one.

But despite implementing this, we were still getting duplicates. I removed an alternative code path for getting Authorization records, and still the duplicates continued. I figured something in our applications must be creating them, so I made new() and create() private on the Authorization model. No dice.

And then I remembered: concurrency! Trying to enforce uniqueness on the client doesn’t work, unless all the clients subscribe to a distributed decision-making protocol. If two requests are in flight, both can run a SELECT query, find there’s no existing record, and then both decide to create the record. Something like this:

             User 1                 |               User 2
------------------------------------+--------------------------------------
# User 1 checks whether there's     |
# already a comment with the title  |
# 'My Post'. This is not the case.  |
SELECT * FROM comments              |
WHERE title = 'My Post'             |
                                    |
                                    | # User 2 does the same thing and also
                                    | # infers that his title is unique.
                                    | SELECT * FROM comments
                                    | WHERE title = 'My Post'
                                    |
# User 1 inserts his comment.       |
INSERT INTO comments                |
(title, content) VALUES             |
('My Post', 'hi!')                  |
                                    |
                                    | # User 2 does the same thing.
                                    | INSERT INTO comments
                                    | (title, content) VALUES
                                    | ('My Post', 'hello!')
                                    |
                                    | # ^^^^^^
                                    | # Boom! We now have a duplicate
                                    | # title!

This may look familiar to you. In fact, I lifted straight out of the ActiveRecord source where it explains why validates_uniqueness_ofdoesn’t work when you have concurrent requests.

Users do the funniest things

I agree with you – in theory. In theory, communism works. In theory.

— Homer J. Simpson

There can be a tendency among some programmers to dismiss these arguments as things that probably won’t be a problem in practice. Why would two requests arrive at the same time, close enough to cause this race condition in the database, for the same user’s resources? This is the same thinking that tells you timing attacks are impossible over the Internet.

And I subscribed to this belief for a long time. Not that I thought it was impossible, I just thought there were likelier causes – hence all the attempts to shut down record creation code paths. But I was wrong, and here’s why:

People double-click on things on the Web.

Over time, we designers of software systems have instilled some confusing habits in the people who use our products, and one of those habits means that there is a set of people that always double-click links and form buttons on web pages. Looking at the updated_at timestamps on the duplicate records showed that most of them were modified very close together in time, certainly close enough to cause database race conditions. This fact by itself makes client-enforced uniqueness checks a waste of time. Even if you’re not getting a lot of requests, one little user action can blow your validation.

This is the database’s job

Here’s how this thing should be done, even if you think you’re not at risk:

class AddUniqueIndexToThings < ActiveRecord::Migration
  def self.up
    add_index :oauth_authorizations,
              [:client_id, :resource_owner_type, :resource_owner_id],
              :unique => true
  end
  
  def self.down
    remove_index :oauth_authorizations,
                 [:client_id, :resource_owner_type, :resource_owner_id]
  end
end

Then, when you try to create a record, you should catch the potential exception that this index will through if the new record violates the uniqueness constraint. Rails 3 introduced a new exception called ActiveRecord::RecordNotUnique for its core adapters, but if you’re still supporting older Rails versions you need to catch ActiveRecord::StatementInvalid and check the error message. Here’s how our OAuth library does things.

DUPLICATE_RECORD_ERRORS = [
  /^Mysql::Error:\s+Duplicate\s+entry\b/,
  /^PG::Error:\s+ERROR:\s+duplicate\s+key\b/,
  /\bConstraintException\b/
]

def self.duplicate_record_error?(error)
  error.class.name == 'ActiveRecord::RecordNotUnique' or
  DUPLICATE_RECORD_ERRORS.any? { |re| re =~ error.message }
end

In the Authorization.for(owner, client) method, there’s a rescue clause that uses duplicate_record_error? to check the exception raised. If it’s a duplicate record error, we retry the method call since the second time it should find the new record that was inserted since the first call started.

Get your objects out of my session

Last week I had the pleasant job of fixing a feature that broke due to a change in a third-party API. Specifically, Twitter changed part of their authentication API and this broke our ‘post your attendance to Twitter’ feature. After a while spelunking through several layers of HTTP indirection inside the twitter and oauth gems, it became apparent that an upgrade was in order – we implemented this feature so long ago that our twitter gem was lagging four major releases behind the current version.

But this isn’t about Twitter, or OAuth, or even those specific Ruby libraries. It’s about an antipattern I was reminded of while updating our code and reading the OAuth gem documentation. Here is how it suggests you start the authorization process in your Twitter client app:

@callback_url = "http://127.0.0.1:3000/oauth/callback"
@consumer = OAuth::Consumer.new("key", "secret", :site => "https://agree2")
@request_token = @consumer.get_request_token(:oauth_callback => @callback_url)
session[:request_token] = @request_token
redirect_to @request_token.authorize_url(:oauth_callback => @callback_url)

This code contains a bug that’s bitten me so many times it jumped right off the page:

session[:request_token] = @request_token

Here’s the bug: you just stored the Marshal.dump of some random object in the session. One day, you will refactor this object – change its class name, adjust its instance variables – and next time you deploy, no-one will be able to access your site. It doesn’t matter whether the session is stored in the cookie (and therefore on the user’s computer) or on your servers, the problem is that you’ve stored a representation of state that’s tightly coupled to its implementation.

A simple example

Let’s see this in action. Imagine we have a little Sinatra app with two endpoints. One of these endpoints puts an object in the session, and another one retrieves data from the stored object:

require 'sinatra'
set :sessions, true
set :session_secret, 'some very large random value'

class State
  def initialize(params = {})
    @params = params
  end

  def get
    @params.values.first
  end
end

get '/' do
  session[:state] = State.new(:flow => 'sign_up')
  'Hello'
end

get '/state' do
  session[:state].get
end

We boot the app, and see that it works:

$ curl -i localhost:4567/
HTTP/1.1 200 OK
Content-Type: text/html;charset=utf-8
Content-Length: 5
Set-Cookie: rack.session=BAh7CEk...; path=/; HttpOnly

Hello

$ curl localhost:4567/state -H 'Cookie: rack.session=BAh7CEk...'
sign_up

A little change

So, this seems to work, and we leave the site running like this for a while, and people visit the site and create sessions. Then one day we decide we need to refactor the State class, by changing that hash into an array:

class State
  def initialize(params = [])
    @params = params
  end

  def get
    @params.last
  end
end

get '/' do
  session[:state] = State.new(['sign_up'])
  'Hello'
end

Now if we retry our request we find this buried among the stack traces:

$ curl localhost:4567/state -H 'Cookie: rack.session=BAh7CEk...'

NoMethodError at /state
undefined method `last' for {:flow=>"sign_up"}:Hash

A peek at Rack’s guts

To understand why this happens you need to see how Rack represents the session. Basically, it takes the session hash, such as {:state => State.new(:flow => 'sign_up')}, runs it through Marshal.dump and base64-encodes the result. Here’s what Marshal emits:

>> session = {:state => State.new(:flow => 'sign_up')}
=> {:state=>#"sign_up"}>}
>> Marshal.dump session
=> "\x04\b{\x06:\nstateo:\nState\x06:\f@params{\x06:\tflowI\"\fsign_up\x06:\x06ET"

Marshal produces a literal representation of the object – its class, its instance variables and their values. It is a snapshot of the object that can be completely reconstructed later through Marshal.load.

When you store objects in the session, you are dumping part of your program’s implementation into storage and, if you use cookie-stored sessions, sending that representation to the user for them to give back later. Now, fortunately, cookies are signed by Rack using HMAC-SHA1 so the user should not be able to construct arbitrary Marshal output and inject objects into your program – don’t forget to set :session_secret unless you want people sending forged objects to you! But there is still the problem that your code is effectively injecting objects into processes running in the future, when those objects may no longer be valid.

If you change the name of a class, then Marshal.load will fail, and you’ll get an empty session object. But if all the types referenced in the session dump still exist, it will happily reconstruct all those objects and their state may not reflect what the current process expects.

And as a bonus, once you’ve deployed the session-breaking change, you can’t revert it, because recent visitors will have the new representation in their session. We’ve got various classes in our codebase with multiple names to work around times when we made this mistake.

A better way

In light of the above, you should treat your sessions with a certain degree of paranoia. You should treat them with the same care as a public API, making sure you only put stable representations of state into them. Personally I stick to Ruby’s core data types – strings, numbers, booleans, arrays, hashes. I don’t put user-defined classes (including anything from stdlib or gems) in there. Similarly, you should not assume any given session key exists, since the session may become corrupt, the user may delete their cookies, and so on. Always check for nil values before using any session data, unless you want your site to become unreachable.

A future-proof Twitter client

So how should you use the Twitter gem and avoid these problems? Easy – just store the credentials from the request token, and reconstruct the token when Twitter calls you back:

Twitter.configure do |c|
  c.consumer_key    = 'twitter_key'
  c.consumer_secret = 'twitter_secret'
end

def consumer
  OAuth::Consumer.new('twitter_key',
                      'twitter_secret',
                      :site => 'https://www.example.com')
end

def callback_url
  'https://www.example.com/auth/twitter/callback'
end

get '/auth/twitter' do
  request_token = consumer.get_request_token(:oauth_callback => callback_url)
  session[:request_token] = @request_token.token
  session[:request_secret] = @request_token.secret
  redirect request_token.authorize_url(:oauth_callback => callback_url)
end

get '/auth/twitter/callback' do
  token  = session[:request_token]
  secret = session[:request_secret]

  halt 400 unless token and secret
  session[:request_token] = session[:request_secret] = nil
  
  request_token = OAuth::RequestToken.from_hash(consumer,
                      :oauth_token => token,
                      :oauth_token_secret => secret)
  
  access_token = request_token.get_access_token(:oauth_verifier => params[:oauth_verifier])
  
  client = Twitter::Client.new(
               :oauth_token => access_token.token,
               :oauth_token_secret => access_token.secret)
  
  user_details = client.verify_credentials
  
  store_twitter_tokens(user_details.screen_name,
                       access_token.token,
                       access_token.secret)
  
  redirect '/auth/twitter/success'
end

Note how we only store strings in the session and the database, and we store just enough of the credentials that we can construct an OAuth or Twitter client later, whenever we need one.

This approach only stores stable representations – tokens used in the OAuth protocol – and constructs objects by hand when they are needed rather than relying on Marshal dumps. This makes the application more resilient when the libraries you depend on inevitably need upgrading.

A month at Songkick

I love Songkick.

Not in a soppy “no you hang up first” kinda way, but in a “I haven’t missed a great gig in over a year” way. Which is why when I was given the opportunity to work here, I jumped at it.

After working at Songkick for a few weeks now, I thought I’d write about my experiences so far, from the interview process through to day-to-day development.

Here are the six simple steps I took to Songkick happiness.

Step 1 – Network

I’ve been a fan of Songkick’s service for a long time, and after I met some of the team at the Silicon Milkroundabout event in May 2012, I was invited to start the interview process. This was great news (Songkick are awesome[1]).

Initially, I did have a few concerns about my technical compatibility with the company; I’ve spent the last few years in a Windows and .NET environment, and Songkick are a long way from that. I was soon to find that these worries were misplaced.

Step 2 – Code

To kick off the interview process, I received an email from Songkick – “Hey Aaron, You seem pretty rad, fancy taking a technical test?”. At least that’s how I remember it.

The rules:

  • Complete an hour long programming challenge
  • From home, at a time that suited you
  • In a programming language of your choice

I let them know when I could set an hour aside, and at the agreed time I was emailed a PDF describing the challenge. I can’t give too much away, but the challenge was really interesting, and Songkick-specific.

I hacked away in C#, making use of third-party libraries as required, and after the hour was up, emailed my solution. I didn’t have time to fully complete the challenge, but I had concentrated on getting a clean design, stubbing all core interfaces, classes and methods, and adding comments and pseudo-code where necessary.

After a few days, I received an email informing me that I was through to round two.

Step 3 – More Code

I was invited to have a couple of face-to-face interviews, and sit another coding test. This time I was to complete a 90-minute pair-programming exercise, in Ruby.

The test was a little daunting as I was a complete Ruby novice. However, with it being a pair-programming exercise, I had a friendly developer (Sabrina) sitting with me to help with syntax questions. Any time I was unaware of the syntax in Ruby (quite a lot!), I could scribble on a notepad how I would solve the problem in C#, and Sabrina would show me the equivalent syntax in Ruby.

This was a test-driven development exercise, and I was introduced to the challenge with a brief overview of the task, and a collection of failing Cucumber tests. I wrote code to gradually pass each test, until all passed – and in the nick of time too. I had a couple of minutes to discuss my solution and what I would add to it if I had more time, and the 1.5 hours were up.

Step 4 – Meet and Greet

As a firm believer in The Joel Test, I agree that writing code during the interview process is important, but equally important is the rapport between yourself and your potential colleagues.

During the interview process, I met a large percentage of the company over a number of interviews, including a coffee and chat with the entire development team. It’s pretty intimidating stuff, but it gives both parties the opportunity to make sure each will be a good fit for the other.

After a few more days of waiting, I received the call I was hoping for.

Step 5 – On-boarding

Joining Songkick was a super-smooth operation. We run a tight ship (as I was to find out), and my first few days were as follows.

Day 1

I spent the morning being shown around the office: an open plan environment with everything a professional developer needs to maintain a high level of productivity (ping pong table, foosball table, a fully-stocked kitchen and a proper coffee machine).

I was provided with a mentor for the week – Robin. Having someone to sit with you, explain the development environment and application design really helped me to become productive quickly. In fact, I made my first code commit on day one.

Day 2 & 3

I spent the next two days divided between coding (with Robin) and various presentations from the different departments in Songkick. These ranged from the data science team (who handle making sense of the huge amounts of data we have), to QA and infrastructure.

Day 4

The whole company boarded a vintage Routemaster bus, and we were taken to End of the Road festival for the weekend. Did I mention Songkick are awesome[1]?

Step 6 – Develop

By far the biggest change (and probably worry) in my move to Songkick was the development environment. I’ve been working in a .NET ecosystem for a number of years, the framework is stable and Visual Studio is in my opinion, a great IDE; it’s feature-rich and has some useful plugins. On the other hand, Songkick’s development environment is entirely Unix-based, making use of (and contributing back to) lots of open-source projects.

I do have experience developing in a Linux environment, but haven’t touched it for a few years, so had a feeling I was going to be rusty. After a few days, I was pleasantly surprised to see how far the tools and frameworks have come. Again, having a mentor to guide me through this transition was crucial; I could ask questions and receive answers immediately.

All in all, joining Songkick has been an amazing experience. I’m surrounded by different teams of people (ranging from developers and testers, through to UX experts and designers), all of which are the best at what they do (but don’t take my word for it, check out the team page). Having a passion for the product is essential, but if you love live music, Songkick is for you.

[1] How about developing for a platform that has millions of users, and enables fans from across the world to see their favourite artists live. And the perks are pretty amazing too; great office, free food and drink, table tennis and foosball, monthly ticket allowance, annual festival trip for the company, etc. I could go on, but you should probably just apply.

Run the right tests at the right time

Way back in June, Dan Crow posted about some of the key principles that we at Songkick believe in. One that I spend some time thinking about every day is, ‘ship early, ship often’. We firmly believe that code should be shipped as soon as it’s ready. From a development point view this just makes sense. From a user’s point of view this just makes sense. From a testing point of view this proves to be a bit of a challenge.

Shipping fast doesn’t mean shipping untested code and hoping for the best. Every single thing that we release has been tested extensively. Obviously the only way we manage to ship often is by keeping the build/test/release cycle as short as possible. All builds are managed in Jenkins. Pushing code will automatically trigger our unit and integration test suites. If all the tests pass we end up with a green build which can be manually deployed to our test environment. Finally a suite of Acceptance tests run through the browser using Capybara and the Selenium Web Driver to confirm we haven’t broken any of our critical user journeys. These tests are pretty slow, taking roughly 4 minutes to run a handful of scenarios but this is the first check that the user will actually be able to interact with the website.

Only after all these tests have passed will we deploy code to Production. This applies to all new features, bug fixes and even changes to the tests themselves.

The problem

Despite our best intentions we were still struggling to ship changes as soon as they were ready:

In June 2011 we made 7 releases.

In the best case it took 3 hours to build, test and ship code. In reality we were spending around 2 days preparing each release. Something had to change.

Dan Lucraft wrote an excellent post about how we reduced the time it takes to run our tests. It feels pretty obvious to say you can increase release speed if you make your tests run faster but this was only part of the solution. Keeping the test suites fast requires constant diligence. Aiming for 100% test coverage is a distraction. Not only will you never achieve it but if you even came close then your builds would likely be taking far longer than needed to run.

Run the right tests

We took the step of identifying which features we wouldn’t want to break and plotting them against the overhead of running tests. In the case of unit tests you can pretty much add as many tests as you like without too much overhead. Integration tests need to be things that you actually care about. If you discovered a feature was broken during manual testing but wouldn’t hold a release to fix it then you shouldn’t have an automated test for that feature in your build (well, unless it was a super quick unit test).

An example of this is our automatic tweets when authenticated users mark their attendance to an event. It is a valid and highly used service that we wouldn’t want to be without but it is not business critical. If we were to have an automated test for this we would need a test which set up a user who appears authenticated with Twitter. The test user would then mark their attendance to an event and the test would need to check whether the tweet was fired for the correct event.

Not only is that a fair bit of work to write and maintain but the resulting test would be pretty slow to execute. The alternative, to push to production and monitor errors in the logs whilst also keeping an eye on the Songkick twitter feed (something we’re already monitoring) means we have one fewer test to run and maintain. The feedback comes later (post release rather than pre) but since we wouldn’t hold a release even if we knew that we had broken this feature then actual time to fix is roughly the same.

At the right time

To allow the team to ship fast we need to keep the release channel clear. Builds run through the test suites as cleanly and as quickly as possible to free up the channel for the next release. Part of our process involves establishing up-front how we will test a code change. Usually this will mean adding or modifying automated tests to cover the new functionality. However some of our changes need more than just an automated build run against them so we needed to come up with a way to separate testing from the actual releases.

Our solution was to use what we call Flippers, additional code which lets admins control whether a feature is visible to users. We can then turn features on and off on the live site without needing to make additional releases. As well as giving us a fast way to turn off problem features this has the benefit of allowing us to turn features on for a particular type of user. High risk or extensively changed features are released to production behind a flipper that makes them visible to admin users only. This means we can run the code on the live servers, using live data but test them as if we were working on a test environment.

Fix bugs fast

One problem with testing code on Production is that the bugs you find are also on Production. Obviously many of these bugs aren’t visible to users thanks to to the flippers but there will always be some bugs in live code. Our approach is a cultural one: yes, we move fast and accept that things might break, but we don’t leave them like that. We fix bugs as fast as possible.

Sounds interesting but does it work?

We spent 12 months looking at our tests, our process and probably ourselves. Changes were made and in June 2012 we made 113 releases. 14 of those were on the same day. In fact we released on every single working day that month (and there were a few sneaky weekend releases too!).

Our object-based Rails frontend

Part of the rewrite of Songkick’s website was a re-architecture of the main client application, affectionately known as skweb (pronounced /skwɛb/, not /ɛskeɪwɛb/). Skweb, as has been mentioned in other posts, had grown into a monster, not just in size but also in complexity. I was asked to suggest an improved structure for the new simplified application. Based on my observations working on our application and the one I’d worked on at the Guardian, I noticed that a lot of complexity was introduced to make rendering web pages easier. It was as if, since we were so focused on modelling the business logic of the company, we had neglected to model a core function of a web site: presenting HTML pages to the user.

With this in mind I proposed splitting out the modelling of webpages into ‘page models’ that would sit alongside the application models and focus on taking Songkick’s data and turning it in to web pages. Each type of page on the website would have a ‘page model’ responsible for rendering the page. This separation would eventually lead naturally to suggesting that we use services to drive skweb, since the page models were built to be agnostic about where their data came from so we could migrate away from our single database more easily.

These days, all the business logic that drives Songkick is contained within internal web services, and skweb’s main job is creating web pages from that information. Certainly there are pages about artists and concerts with tickets and venues so all that vocabulary remains, but it is not the business model of Songkick we are modelling. What we are concerned with is presenting that information in web pages.

Pages, Components, Elements

Once we settled on having page models, it became straightforward to break the page up into its constituent parts. A page has a collection of components, and the components consist of elements. The component is given any data it needs by its enclosing page. Any sufficiently complex components can have their own models that the page model invokes when needed.

The default behaviour for a component which has no data to render is to render nothing. For example if the service that provides data to the component is down, the component should contain the error and emit no output. There should be no stray markup hanging around on the page, and if components need to display something when empty it is up to the page to allow this.

What makes a component?

A component is a discrete module of functionality on the page, that can function independently of other components. Typically you can easily draw a box around a component and it will probably contain a heading and some supporting information. I decided (somewhat arbitrarily) that components are not nestable: you cannot have components inside components. While this constraint is not a technical one, I imposed it to try and reduce complexity in the design. Since components aren’t nestable, if we do need to break them into parts or share code between components then we use elements instead. Components that appear on more than one type of page are called shared components.

An element is something smaller and usually less complex than a component, and may appear in more than one component (if this happens it is called a shared element). An example of this is the attendance buttons that appear all over our site and appear both in the event listings like those found on an artist page and on the individual event pages.

We arrange the view code around pages and components with each page having its own stylesheet, and each component having its own stylesheet, JavaScript and images. We use the same name for each page model and its associated assets, so it’s easy to understand which static assets the component depends on. An advantage of this approach is when a component is removed or refactored there is no ambiguity about which images, CSS files, and JavaScript must be removed or updated.

So how does all this work in practice?

Let’s examine how this works, by following one component through its rendering process. I’m going to use the Map component on the Venue page.

Skweb is still a Rails app and still has the familiar layout, but we’ve added some conventions of our own. First, all pages have a type – ‘venue’, for example – that also provides the name for the CSS file for the page to link to. The page provides methods that expose its components, and it constructs each component by passing in whatever data that component needs: the component has no access to databases, services or the HTTP request, everything they need is given to them via the page model and controller. By convention the name of the component is also the name of the template in the views folder, in fact it is the use of common names that makes understanding component dependencies easier.

A small fragment of our app might look like this:

skweb/
    app/
        controllers/
            venues_controller.rb
        models/
            page_models/
                venue.rb
            skweb/
                models/
                    venue.rb
        views/
            shared/
                components/
                    _calendar_summary.html.erb
                elements/
                    _attendance_buttons_element.html.erb
                    _event_listings.html.erb
            venues/
                _brief.html.erb
                _map.html.erb
                show.html.erb
    public/
        javascripts/
            songkick/
                component/
                    tickets.js
        stylesheets/
            components/
                venue-brief.css
                venue-map.css
            shared/
                elements/
                    pagination.css
                components/
                    brief.css
            venue.css

When a user visits the a Venue page the controller creates a new page object:

class VenuesController < ApplicationController
  def show
    @page = PageModels::Venue.new(venue, logged_in_user)
  end
end

The page model for the Venue includes something to this effect:

module PageModels
  class Venue < PageModels::Base
    def initialize(venue, logged_in_user)
      @venue = venue
      @logged_in_user = logged_in_user
    end

    def brief
      Brief.new(@venue, upcoming_events.total_entries, @logged_in_user)
    end
 end
end

The Brief component is responsible for displaying the venue’s address, map, image, image and so on, but the Ruby objects only expose data. Markup is confined to the view templates, and rendering is performed by glueing a page model and a view template together.

module PageModels
  class Venue
    class Brief
      def geolocation
        @venue.geolocation
      end
    end
  end
end

Moving to the view, the ‘show’ page for a venue might look like this:

<div class="primary col">
  <%= component('brief', @page.brief) %>
  <%= component('map', @page.brief.geolocation) %>
  <%= shared_component('calendar_summary',   @page.calendar_summary) %>
  <%= shared_component('media_summary',      @page.media_summary) %>
  <%= shared_component('media_links',        @page.media_links) %>
  <%= shared_component('gigography_summary', @page.gigography_summary) %>
</div>

component() and shared_component() are defined in ApplicationHelper and look like this:

def component(component_name, object)
  return '' if object.nil?
  render :partial => component_name, :object => object
end

def shared_component(component_name, object)
  component("shared/components/#{component_name}", object)
end

As you can see really just a thin wrapper around partials, but, it also enforces that we do not render if there is no data to give to the component.

The content of the component is pretty standard ERB:

<div class="component venue-map">
  <a href="<%= google_maps_url(map, :zoom => 15) %>" target="_blank">
    <img src="<%= static_google_maps_image_url(map, :width => 640, :height => 220, :zoom => 15) %>">
  </a>
</div>

As a convenience, the object passed in to the component by its page will have the same name as the component. That is where map comes from in the above code. this is also useful in shared components as they don’t need to know anything about the context in which they are being used and what instance variables it might be using.

The Venue page will link to its venue.css file, which looks like:

@import 'shared/components/brief.css';
@import 'components/venue-brief.css';
@import 'components/venue-map.css';
@import 'shared/components/media-summary.css';
@import 'shared/components/event-listings.css';

And the venue-map.css file is short and sweet:

.venue-map
{
  padding: 0;
  position: relative;
  z-index: 5;
  -webkit-box-shadow: 0 4px 2px -2px rgba(0, 0, 0, 0.2);
     -moz-box-shadow: 0 4px 2px -2px rgba(0, 0, 0, 0.2);
          box-shadow: 0 4px 2px -2px rgba(0, 0, 0, 0.2);
}

.venue-map img
{
  vertical-align: bottom;
}

.venue-map
{
  margin-bottom: 26px;
}

@media only screen and (max-width: 767px)
{
  .mobile-enabled .venue-map img
  {
    width: 100%
  }

  .mobile-enabled .venue-map
  {
    padding-left: 0;
    padding-right: 0;
  }
}

The CSS file contains only the CSS that this component needs and includes any CSS for the small screen rendering of that component.

What is that called?

Another aspect of the design was to use pervasive language. The idea is that everyone at Songkick – product managers, designers, and developers – uses the same name for pages and components on the website. The advantage of having a shared language across the company comes through when talking about the site. If someone says, ‘the ticket component is broken,’ I know exactly they mean. It will correspond to a file called tickets.html.erb in the views, the page model for the component will be called Tickets; its CSS will live in stylesheets/components/tickets.css, the HTML class name on the component is tickets; any JavaScript needed for the component lives in javascript/songkick/component/tickets.js. The strong naming convention makes navigating around the project easy and makes finding dependencies very straightforward.

What does this give us?

The page/component/element structure makes deciding where to put code easier by having very strong conventions. The page models made migrating skweb onto services simpler as it provided a separation between the rendering stack and the source of the data it uses. We were able to behave like we were building on top of services when in some cases the services didn’t exist yet.

We have now also used this architecture on a new application and again the clear demarcation of responsibilities makes deciding where to put code and how to structure it easier and more predictable. That’s not say that there aren’t costs to this approach: certainly some find the shear number of files, especially for CSS, difficult to navigate. Others find the insistence on rigidly mapping names across types of files excessive. While this is somewhat down to personal taste, in our experience having a predictable structure of small files with focussed responsibilities has made it easier to maintain our codebase.

The path to SOA

So far, James has explained what Songkick’s current Service Oriented Architecture looks like. I want to step back and talk about one of the hardest things we had to do: once we decided to undertake such a big change, how did we take the first step?

In our case, it made sense to start where it hurt the most: rewriting our biggest project, the songkick.com Rails app, to be a simpler web app without direct access to the ActiveRecord domain models. This would also give us the opportunity to understand the types of resources and API endpoints needed, so the services could later be built based on how they were used by clients. Another benefit of starting with the Rails app itself, instead of the services, was that we would have the immediate benefits of a simpler, decoupled web app.

The plan was for an “inside-out rewrite”, that is, we didn’t start a new project from scratch. Instead, we went template after template on Songkick’s website and re-wrote it end to end, from the models and controller to the views, CSS and JavaScript. This way, our code was continuously integrated, which meant the benefits and flaws of our design were seen as soon as a template was done, instead of emerging with a completely new project months later. The drawback of this approach is that it takes a lot of effort to work with evolving code. However, I think that this is an important skill for us to learn as developers.

We started crossing the SOA chasm by creating application-specific “client model” classes that wrapped ActiveRecord models, and “service” classes that would call the respective methods on those models, decoupling the domain model from the presentation layer.

For example, if this is how an event was loaded on an event page:

class EventsController < ApplicationController
  def show
    @event = Event.find(params[:id])
  end
end

class Event < ActiveRecord::Base
end

This was rewritten to be:

class EventsController
  def show
    @event = Services::EventListings.event_from_id(params[:id])
  end
end

module Services
  class EventListings
    def self.event_from_id(event_id)
      active_record_event = Event.find(params[:id])
      ClientModels::Event.new(active_record_event.to_hash)
    end
  end
end

module ClientModels
  class Event
    def initialize(event_info)
      @id   = event_info[‘id’]
      @date = Date.parse(event_info[‘date’])
      # etc.
    end
  end
end

class Event < ActiveRecord::Base
  def to_hash
    {
      'id'   => id, 
      'date' => date.to_s, 
      # etc.
    }
  end
end

Instead of accessing an ActiveRecord instance directly, all code in our Rails app would access it via the “service” classes. Those were the only classes allowed to talk to ActiveRecord models. Any response returned by those classes must be a client model instance that is initialized with the same information we would eventually return from our internal APIs.

Starting out like this meant we could easily change the data returned by the “to_hash” method to suit our needs, and still have the benefits of encapsulating what would eventually be the service client code.

When the time came and the services were ready, we simply changed the client service classes over to use HTTP:

module Services
  class EventListings
    def self.event_from_id(event_id)
      event_hash = JSON.parse(http.get("/events/#{event_id}").body)
      ClientModels::Event.new(event_hash)
    end
  end
end

And that’s it! All the application code talking to the service and client model classes remains completely unchanged.

Understanding your product and the domain you are modelling is crucial to being successful on an effort like this. Songkick’s product and design team were essential parts of this project. We were simplifying our technical architecture, but also simplifying and focusing Songkick’s proposition.

Once we had a plan, it took us around 10 weeks to rewrite our Rails app so that every single controller and view was using the new client models. During this period, we also rewrote our front end code to have an architecture that mirrors more closely the pages and visual components used on the website. Stay tuned for more details!

The client side of SOA

This article is part of a series on Songkick’s migration to a service-oriented architecture. The full series:

Following on from my previous article on what our backend services look like, it’s time to talk about the client side. How do our user-facing applications use the services, and how is it different from using ActiveRecord?

The nice thing about Rails is it doesn’t force you into using ActiveRecord. If you do, then a lot of conveniences are made available to you, but you’re really free to do whatever you want in your Rails controllers. So, instead of speaking to ActiveRecord models, our applications make HTTP calls to several backend services.

HTTP, do you speak it?

The first bit of the problem is, how do we make HTTP calls? We want this to be extremely convenient for people writing application code, which means avoiding as much boilerplate as possible. We don’t want application code cluttered with stuff like this:

uri = URI.parse("http://accounts-service/users/#{name}")
http = Net::HTTP.new(uri.host, uri.port)
response = http.request_get(uri.path)
if response.code == '200'
  JSON.parse(response.body)
else
  raise NotFound
end

when we could just write:

http_client.get("/users/#{name}").data

And that’s the simple case. When making HTTP calls, you have to deal with a lot of complexity: serializing parameters, query strings vs entity bodies, multipart uploads, content types, service hostname lookups, keep-alive or not, response parsing and several classes of error detection: DNS failure, refused connections, timeouts, HTTP failure responses, user input validation errors, malformed or interrupted output formats… and good luck changing all that if you want to change which HTTP library you want to use.

So, the first thing we did is create an abstract HTTP API with several implementations, and released it as open-source. Songkick::Transport gives us a terse HTTP interface with backends based on Curb, HTTParty and Rack::Test, all with the same high-level feature set. This lets us switch HTTP library easily, and we’ve used this to tweak the performance of our internal code.

You use it by making a connection to a host, and issuing requests. It assumes anything but a 200, 201, 204 or 409 is a software error and raises an exception, otherwise it parses the response for you and returns it:

http = Songkick::Transport::Curb.new('http://accounts-service')
user = http.get('/users/jcoglan').data
# => {'id' => 18787, 'username' => 'jcoglan'}

Songkick::Transport also has some useful reporting facilities built into it, for example it makes it easy to record all the backend service requests made during a single call to our user-facing Rails app, and log the total time spent calling services, much like Rails does for DB calls. More details in the README.

Who needs FakeWeb?

The nice thing about having a simple flat API for doing HTTP means it’s really easy to test clients built on top of Songkick::Transport, as opposed to something like FakeWeb that fakes the whole complicated Net::HTTP interface. In each application, we have clients built on top of Songkick::Transport that take an HTTP client as a constructor argument. When they make an HTTP call, they wrap the response data in a model object, which allows the application to shield itself from potential changes to the API wire format.

module Services
  class AccountsClient
    def initialize(http_client)
      @http = http_client
    end
    
    def find_user(username)
      data = @http.get("/users/#{username}").data
      Models::User.new(data)
    end
  end
end

module Models
  class User
    def initialize(data)
      @data = data
    end

    def username
      @data['username']
    end
  end
end

This approach makes it really easy to stub out the response of a backend service for a test:

before do
  @http   = mock('Transport')
  @client = Services::AccountsClient.new(@http)
end

it "returns a User" do
  response = mock('Response', :data => {'username' => 'jcoglan'})
  @http.stub(:get).with('/users/jcoglan').and_return(response)
  @client.find_user('jcoglan').username.should == 'jcoglan'
end

It also makes mock-based testing really easy:

it "tells the service to delete a User" do
  @http.should_receive(:delete).with('/users/jcoglan')
  @client.delete_user('jcoglan')
end

Being able to stub HTTP calls like this is very powerful, especially when query strings or entity bodies are involved. Your backend probably treats foo=bar&something=else and something=else&foo=bar the same, and it’s much easier to mock/stub on such parameter sets when they’re expressed as a hash, as in

http.get '/', :foo => 'bar', :something => 'else'

rather than as an order-sensitive string:

http.get '/?foo=bar&something=else'

It’s also worth noting that the models are basically inert data objects, and in many cases they are immutable values. They don’t know anything about the services, or any other I/O device, they just accept and expose data. This means you can use real data objects in other tests, rather than hard-to-maintain fakes, and still your tests run fast.

Convenience vs flexibility

Nice as it is to be able to choose which HTTP implementation you use, most of the time the application developer does not want to write

http   = Songkick::Transport::Curb.new('http://accounts-service')
client = Services::AccountsClient.new(http)
user   = client.find_user(params[:username])

every time they need to look up a record. The flexibility helps with testing and deployment concerns, but it’s not convenient. So, we put a layer of sugar over these flexible building blocks that means most of the things an application needs to do are one-liners. We have a Services module that provides canonical instances of all the service clients; it deals with knowing which hostnames to connect to, which HTTP library to use, and which client object to construct for each service.

module Services
  def self.accounts
    @accounts ||= begin
      http = Songkick::Transport::Curb.new('http://accounts-service')
      AccountsClient.new(http)
    end
  end
end

With this layer of sugar, getting a user account is one line:

user = Services.accounts.find_user(params[:username])

In our Cucumber tests, we tend to stub out methods on these canonical instances, or make a Services method return an entirely fake instance. The cukes are not complete full-stack tests; they are integration tests of the current project, rather than of the entire stack, and the lack of backend I/O keeps them very fast. The stability of the underlying service APIs means we aren’t taking a big risk with these fakes, and we have a few acceptance tests that run against our staging and production sites to make sure we don’t break anything really important.

What about error handling?

We want it to be as easy as possible to deal with errors, since messy error handling can hamper the maintainability of a project and introduce mistakes that make things harder for end users. For this reason, we made anything but 200, 201, 204 or 409 from a backend raise an exception, for example if the accounts service returns a 404 for this call, an exception is raised:

Services.accounts.find_user('santa_claus')

The exception raised by Songkick::Transport contains information about the request and response. This means you can put a catch-all error handler in your Rails or Sinatra app to catch Songkick::Transport::HttpError, and forward the 404 from the backend out to the user. The removes a lot of error handling code from the application.

In some cases though, you don’t want this behaviour. For example, say we’re rendering an artist’s page and we have a sidebar module showing related artists. If the main artist gives a 404, then the whole page response should be a 404. But if we can’t get the related artists, or their profile images, then we don’t want the whole page to fail, just that sidebar module. Such cases tend to be the minority in our applications, and it’s easy enough to catch the service exception and render nothing if the services backing a non-core component fail. Using an object model of our user interface helps to isolate these failures, and we hope to cover that in a future post.

Repeat after me: sometimes, you should repeat yourself

One open question when we moved to this model was: should we maintain client libraries for each service, or just make whatever calls we need in each application? The DRY principle suggests the former is obviously the best, but it’s worth asking this question if you do a project like this.

We went with the latter, for several reasons. First, since the services and Songkick::Transport encapsulate a lot of business and wire logic, the client and model classes in each application end up being pretty thin wrappers, and it isn’t hard to build just what you need in each project. Second, we got burned by having too many things depending on in-process Ruby APIs, where any change to a shared library would require us to re-test and re-start all downstream applications. This coupling tended to slow us down, and we found that sharing in-process code isn’t worth it unless it’s encapsulating substantial complexity.

Each application is free to tweak how it interacts with the service APIs, without affecting any other application, and this is a big win for us. It means no change to one application can have side effects or block work on another application, and we have’t actually found ourselves reinventing substantial pieces of logic since that’s all hidden behind the HTTP APIs.

And finally, having per-application service clients gives you a really accessible picture of what data each application actually relies on. Having one catch-all domain library made this sort of reasoning really difficult, and made it hard to assess the cost of changing anything.

Wrapping up

So that’s our architecture these days. If you decide to go down this route, remember there’s no ‘one right way’ to do things. You have to make trade-offs all the time, and the textbook engineering answer doesn’t always give your team the greatest velocity. Examine why you’re making each change, focus on long-term productivity, and you won’t go far wrong.

SOA: what our services look like

This article is part of a series on Songkick’s migration to a service-oriented architecture. The full series:

Since I began mentioning to people that Songkick is migrating its user-facing Rails app and lots more supporting components to service-oriented architecture, I’ve been asked many times to explain how we’re doing it. Truth is, it took us a while to figure this out. Any departure from the Rails Way suddenly requires all sorts of dangerous things like Decisions and Creativity and Disagreement. What we have today is the mostly-stable result of rounds of iteration, trial-and-error and debate.

What do you mean by services, exactly?

When we say SOA, we mean we’re replacing all the ActiveRecord-based data access and business logic in our applications with a number of orthogonal web services:

  • event-listings handles data relating to concerts, artists, venues, and so on
  • accounts handles users’ account data and authentication
  • taste-imports processes sets of artists uploaded by various sources of user taste data
  • caltrak handles users’ taste data and calendar generation
  • attendance stores concerts users have said they are going to
  • media handles and stores file uploads – photos, videos and the like
  • recommendations determines sets of similar artists

These are all just Sinatra apps that return JSON for the most part. They encapsulate all the business logic previously held in our ActiveRecord models, indeed they are still based on these models at present. But they don’t simply mirror the ActiveRecord APIs: they reflect how data is used rather than how it’s stored.

ActiveRecord models tend to reflect the design of normalized databases, which reflect the static properties of entities involved. Let’s take an example. Say I ask you to design an ActiveRecord schema for modelling concerts. Most people would come up with something close to our actual model, which is:

  • A polymorphic type Event with two subtypes Concert and Festival
  • The Event class has a date property and optionally an end_date
  • The Event belongs to a Venue, which belongs to a City, which belongs to a MetroArea, which belongs to a Country, and all of these entities have a name
  • The Event has many Performances
  • Each Performance belongs to an Artist and has a billing, either headline or support
  • All Artists have a name, and other metadata like popularity

This makes sense as a database design, but doesn’t reflect how the data is used. Usually, when dealing with an event, you want all of the above information, which means accessing about seven tables. Hope you didn’t miss a JOIN somewhere!

So, we could have exposed all these as distinct resources in our services, with links from each resource to those related to it, but that would be a giant waste of HTTP requests when you always want all this information all at once. It also makes it harder to write client code for the common case – you’d need to write code to follow all those links in every app you build on top of such a service. That’s what I mean when I say the services should reflect how data is used rather than how it is stored. Here’s a request I just made to find out all about Grandaddy’s upcoming show at the Shepherds Bush Empire in September.

$ curl appserver:9101/events/12511498

{
    "id":           12511498,
    "type":         "Concert",
    "status":       "ok",
    "path":         "/concerts/12511498-grandaddy-at-o2-shepherds-bush-empire",
    "date":         "2012-09-04",
    "startTime":    "2012-09-04T19:00:00+0000",
    "endDate":      null,
    "upcoming":     true,
    "profileImage": {"id": 523306, "type": "Image"},
    "series": null,
    "performances": [{
        "artist": {
            "id":           63366,
            "name":         "Grandaddy",
            "path":         "/artists/63366-grandaddy",
            "popularity":   0.044921,
            "active":       true,
            "profileImage": {"id": 523306, "type": "Image"},
            "upcomingEventsCount": 11
        },
        "id":       24380668,
        "billing":  "headline"
    }],
    "venue": {
        "id":         38320,
        "internalId": 38320,
        "name":       "O2 Shepherd's Bush Empire",
        "path":       "/venues/38320-o2-shepherds-bush-empire",
        "smallCityLongName": "London, UK",
        "unknown":    false
    }
}

Everything an app wants to know about an event, in one HTTP call. I’m reminded of this quote, which always springs to mind when I’m putting a boundary between my business logic and a user interface:

Remember that the job of your model layer is not to represent objects but to answer questions. Provide an API that answers the questions your application has, as simply and efficiently as possible. Sometimes these answers will be painfully specific, in a way that seems “wrong” to even a seasoned OO developer.

ORM is an anti-pattern

As well as encapsulating common queries, the services encapsulate operations we often need to perform, such as recording that someone likes At The Drive-In, or creating a new concert. The services are not a thin skin over the database, they encapsulate all our domain logic so it does not get replicated in various applications. The amount of code you need in an app in order to access this logic is fairly minimal, and I’ll explain what it looks like in a future post.

What is this buying us?

The core maintainability problem with a large monolithic application, like our songkick-domain library, is that internal coupling tends to creep in over time, rendering it hard to change one thing without affecting a lot of unexpected components. Every time you commit a change to the monolithic core, all the apps depending on it need re-testing and re-starting.

Monolithic database abstractions in particular are problematic because they’re coupled to, well, a monolithic database. If you have everything in one big MySQL DB, chances are parts of that DB are under much heavier load than others. It’s hard to add more capacity in this situation without replicating the whole database; you’d rather have your data split into chunks that can be horizontally scaled independently. This both makes scaling easier and reduces cost, since you’re not wasting DB machine capacity on lots of data that probably doesn’t need replicating (yet).

Creating a set of decoupled services gives us a way to deal with that: by creating an explicit boundary layer that’s designed to be kept stable, we can change the internals of the services without breaking the apps downstream, and do it faster than if the apps were still coupled to the Ruby implementation of this logic. As our applications are moved off of ActiveRecord and onto these service APIs, the volume of code coupled to our models is going down by orders of magnitude, so we can more easily chip away at these models and begin to split them up, assigning them just to the services that need them.

I mentioned in my previous post that, because of the amount of coupled Ruby code living in one process, we’ve been stuck on old versions of Ruby, Rails and other libraries for some time. Splitting our code up like this greatly reduces the amount of code living in the same process, and makes it easier for us to upgrade our dependencies, or totally change what language or hosting platform a service runs on.

The boundary creates an awareness among the team that this is a deliberate stable API, and makes the abstraction boundary more obvious than it is with bag of Ruby APIs that all live in the same process. But we can only do this because we understand the problem domain sufficiently. We’ve been working on Songkick for five years, and so we have a much better understanding of how to divide the domain up than when we started. Of course, when you start a project, you have no idea about half the stuff that’s going to end up in it, so this migration should be seen as refactoring, rather than cookie-cutter architecture to adopt from day one.