Service-oriented Songkick

This article is part of a series on Songkick’s migration to a service-oriented architecture. The full series:

For a few months now, we’ve been transitioning Songkick to a service-oriented architecture (SOA). This is the first in what will hopefully be a series of articles on what that means and how we’re doing it, and what benefits it’s bringing us. But first, some history.

In the beginning

Songkick has, for its five-year history, been a Rails app. (Well, there was a prototype in PHP but you didn’t hear that from me, right.) It was still a Rails app by the time I joined, two years into the project in 2009. And I mean it was a Rails app and nothing else. Although the system consisted of a database, a website, message queues, background processing, file storage, daily tasks like email notifications, and so on, it was all one big project.

Oh sure, we told ourselves all the non-web components were separate projects, but they were included in the Rails app via git submodules. They all shared code from app/models and lib. They all changed together. If you changed our background job processor you had to bump a submodule in the Rails app, run all the tests and deploy the entire thing to all the machines.

Oh and did I mention the build took two hours spread across a handful of Jenkins (then Hudson) build slaves? It’s a wonder we ever shipped anything.

Time for some house-cleaning

If you’ve worked on any early-stage, rapidly growing product you probably recognize this scenario. You’ve been adding features and tests all over the place, you’re not sure which ones have value but you keep all of them anyway, and you focus on releasing as fast as possible. We went through two major versions of the product like this, and it’s fine when your team and the codebase are relatively small. Everyone knows where everything is, it’s not that hard to maintain.

But in the medium- and long-term, this doesn’t scale. The Big Ball of Mud makes it increasingly harder to experiment, to bring new hires up to speed or deal with sudden scaling issues. We needed to do something.

Step 1: get organized

We began this process in mid-2010 by extracting the shared part of our codebase into a couple of libraries, songkick-core and songkick-domain. Core mostly contains fairly generic infrastructure stuff: APIs for locating and connecting to databases and message queues, managing per-environment config, logging/diagnostics support etc. Domain contains all the shared business logic, which given our history means all our ActiveRecord models and observers, and anything else we needed to share between the website and the background processes: the web routes, third-party API clients, file management, factory_girl definitions and shared cucumber steps, etc.

This was a really useful step forward since it let us take all our background process code out of the Rails app and deploy it separately. Each project just included Core and Domain as submodules and through them got access to all our infrastructure and business logic. Happy days. It was a great feeling not to have all that stuff gunking up our website’s codebase, and it meant it didn’t need re-testing and re-deploying quite so often.

Step 2: encourage small projects

One great way to keep development sustainable is to favour small components: components and libraries with focused responsibilities that you can easily reuse. Encouraging this style of development means you need to make it easy to develop, integrate, and deploy many small components rather than one big ball. The easier this is, the more likely people are to create such libraries.

Unfortunately, despite our restructured codebase this was nowhere near easy enough. Using git submodules meant that any time Core or Domain was changed, one had to bump those submodules in all downstream projects, re-test and re-deploy them. We needed something more dynamic that would ease this workload.

The first thing we tried was Rubygems. We started packaging Core as a gem and a Debian package, which is how we distribute all the libraries we rely on. We thought that by using semantic versioning we could force ourselves to pay better attention to our API design. This turned out to be wishful thinking: this is a core component on which everything depends, and has to change fairly frequently. It’s the sort of thing that should be deployed from git by Capistrano, not through formal versioning and apt-get. The fact that it was now a globally installed library also made it really hard to test and do incremental roll-out. Long story short, we ended up at version 0.3.27 before giving up on this system.

(I can already hear everyone saying we should have used Bundler. Another consequence of the time we started the project is that we run Rails 2.2 on Ruby 1.8.7 and Rubygems 1.3.x, and making Bundler work has proved more trouble than it’s worth. Upgrading Rails and Ruby is, let’s say, Somewhat Difficult, especially with the volume of code and libraries we have, and at a startup there’s always something more urgent to do. These days we have a bunch of apps and services running on modern Ruby stacks, but it’s still not pervasive. Part of this process is about decoupling things so we can change their runtimes more easily.)

Step 3: tools, tools, tools

So we needed a migration path to get to a more sustainable model. In 2011 we built a dependency tracker called Janda (don’t ask) to make it easier to manage and encourage lots of small projects. It was based on a few key ideas borrowed from Bundler and elsewhere:

  • Every project declares which others it depends on
  • Circular dependencies are not allowed
  • Dependencies can be loaded from a global location or vendored inside the project
  • A project cannot load code from anything not in its dependency graph
  • Versioning is done with git
  • Builds are run by checking dependencies out into the project itself and the system tracks which versions of components have been tested together
  • We only deploy one version of each project to production at any time
  • The deployment system makes sure the set of versions we deploy are mutually compatible, based on build results

This gave us several important things: a system for dynamically locating and loading dependencies, which let us stop using submodules and manually updating them; a dependency-aware build and deployment system that made it easy to check what needed testing as a result of every change; and a framework imposing some light restrictions on how code could be structured.

Building this tool exposed dozens of places in our code where we had implicit and circular dependencies we weren’t aware of. To make our software work with this system, we had to get it into better shape through refactoring. This process itself led to several new libraries being extracted so they could be safely shared and tracked. It was a big step forward, and helped us ship code faster and with more confidence.

Step 4: break the dependencies

That probably sounds like a weird thing to say after spending all that effort on a dependency tracker. But in truth it was always going to be an interim measure; we want to be using the same Ruby toolchain everyone else is, it’s just easier that way. Plus, we have mounting pressure in other areas. Domain is still a big project, full of dozens of classes that know too much about each other. Every ActiveRecord model we have is free to interact with the others. It’s hard to change it without breaking anything downstream, and it’s making it harder for us to split our monolithic database into chunks that can scale independently. All familiar scaling woes.

So, since late last year we’ve been working on the current stage of growing our codebase: replacing all our couplings to ActiveRecord, and the Domain project as a whole, with web services. We have a handful of services that expose JSON representations of various facets of our domain. One service handles concert data, one handles user accounts, one deals with uploaded media, and so on. Long-term, the aim is to get to a stage where we can change the internals of these services – both their code and their datastores – independently of each other, and independently of the apps that use them.

These services put an explicit stable boundary layer into our stack that makes it easier to work on components on either side of the line independently. They reduce coupling, because apps are now making HTTP calls to stable language-agnostic APIs rather than loading giant globs of Ruby code, and it simplifies deployment – if you change a service, you don’t need to restart all the clients of the service since there’s no code they need to reload.

Enough pontificating, show us the code!

We’re going to get into the details of how we’re implementing this in later articles. There’s a lot we can talk about, so if you have any questions you should drop us a line on Twitter.

From 15 hours to 15 seconds: reducing a crushing build time

Over the past year we have reduced our website test suite build time by over 99.9%.

  • Build time a year ago: 15 hours.
    Across 15 EC2 build slaves it took “only” 1 hour of real time.

  • Build time today: 15 seconds
    On my laptop.

Having a build that took over an hour to run crippled the productivity of our team.

So, how did we make such a drastic improvement? There were no quick fixes, though Lord knows we tried to find them. Instead we have had to completely change the way we test.

Rather than any brilliant new techniques, there were instead three big mistakes that we had made that created such a monster build time. We went down a wrong path, and it took a lot of time and effort to fix it later.

Bad Practice #1: We favoured integration tests over unit tests

We used to be extremely thorough in our integration tests. We used them to test everything, usually instead of unit tests, which were comparatively thin on the ground. Since integration tests are far, far slower than unit tests, this caused a lot of unnecessary work.

To fix this we looked at each integration test in turn and either:

  • ditched it (i.e. we increased our tolerance for broken things in exchange for having a faster build)
  • rewrote it as a unit test on a specific class
  • kept it, as we still needed a few integration tests for each component

Bad Practice #2: We had many, many features that were relatively unimportant

Many of the less used or less strategic features on songkick.com have gone. This was an extremely painful decision to make, and we made it for bigger reasons than just improving our build time. But it certainly improved the build time a lot.

Fixing this and the previous point have turned a library of 1642 Cucumber scenarios into just 200.

Bad Practice #3: Our integration tests were actually acceptance tests

This test suite used to integrate over our website, domain library, database, message bus and background job workers. Each was spun up as a separate process in the test environment. We basically ran all our website tests against a working version of our entire system. Remember I said we tested virtually every code path? This added up to a lot of time.

Nowadays, our website integration tests are really only integration tests. They integrate over code inside a single project. Every interface to another project is stubbed.

All our database access code is isolated in a domain library behind a thin service layer and is stubbed in the website project.

Instead of over a thousand acceptance tests, we now have fewer than 10. They run against our staging and production environments, after deployment, instead of slowly booting up a test environment during the build.

Six months later

Productivity is up! Morale is up! It’s amazing just how much a faster build has improved our working experience.

Remember that the suite described above was only one of our builds. We had multiple projects with builds that took more than 30 minutes to run. Now none of our test builds take longer than 5 minutes, which is now considered “slow”.

These mistakes are far clearer to us in hindsight than they were at the time, so I’d recommend looking carefully to make sure you are not infected by any of these bad practices. If you are, take the time to fix the problem. If you’re not, congratulations on avoiding the pitfalls we fell in to!

PS: Bonus quick fixes for reading this far

Ok, so there were one or two things we did that probably count as Quick Fixes:

  • tweak our Ruby interpreter’s GC settings by trial and error (we use REE).
  • run the build entirely on the tmpfs in-memory file system

Both of these gave surprisingly significant improvements, and accounted for perhaps 10% of the speedup.

The poetry of doors

This is not an article about Jim Morrison. This is about the other sort of door:

door: (n) a hinged, sliding, or revolving barrier at the entrance to a building, room, or vehicle, or in the framework of a cupboard.

More specifically it’s about locking and unlocking doors, and what that might teach us about usability design.

The Songkick office door gets a new lock

We’ve recently upgraded the Songkick front door from a traditional hardware lock to a fancy new electronic entry system. This is great. I no longer have to fumble through my keys, finding the right one, fitting it into the old and gnarly lock, which half the time stuck, while holding a laptop, a latte and a donut in the other hand. Now I have a sleek keyfob which is much smaller and lighter than the key it replaces. I just swipe the fob near the electronic reader and I’m in. No more spilled lattes or dropped donuts. Huzzah.

Ingress, egress, but not egrets

So, getting into to the Songkick office is much improved. What about getting out? Is it just as easy?

Sadly, the answer is “no”.

The lock is there to stop unauthorized people getting into the office and stealing our lovely stuff. Having the door locked is necessary, but it should be as easy as possible for authorized people to get in. The electronic lock is meets these requirements nicely. Mission accomplished.

But, making it hard to leave the office is not a goal. Once I’m inside, getting out should be as easy as possible – especially if there is an emergency.

So does the new Songkick electronic lock makes leaving easy? It does not. Here’s a view of the door from the inside. How would you open the door?

As you approach a door, you’re naturally looking right at it: people tend to look at the object they want to operate in order to figure out how to operate it (well, d’uh). So you’ll look at the door and see the small brass handle next to the window. Here’s our first problem – in fitting the new lock we left the old handle in place, but it doesn’t work anymore. It’s very confusing for people who have used the door before and it trips up new users because it’s placed exactly where you’d expect the mechanism to unlock the door to be.

So here’s Lesson One: don’t leave behind old stuff that doesn’t work any more, it distracts users from the new, working stuff.

Okay, so after some fumbling, we realize that the handle is in fact just there to mock us. How do we open the door?

The wall to the right of the door is covered in stuff. There’s a phone, for example. Don’t worry, you don’t have to call a doorman to get permission to exit, that’s the entry phone for the door downstairs. Below that are two similar sized boxes. If you’re trying to exit, you will end up looking through all these objects to the right and eventually notice the box marked “Press to exit”, with a light switch on it. Really? This is how I open the door? Yes, yes it is.

What is wrong with this arrangement? Here are a few problems:

  • The exit switch is in the wrong place. To open the door I have to hunt for the switch, because its not in the obvious place: on the door. Lesson two: put the control that operates an object on the object itself.
  • The switch is lost in the clutter of signs and switches. It isn’t easy to pick out from the jumble of other non-door related stuff. Everyone needs to open the door, very few need to read Songkick’s fire alarm details, so move the fire alarm and its notice somewhere else and make the switch more visible. Lesson three: make the important, frequently used controls stand out from the clutter.
  • I can only tell that this is the door switch by reading the label. If it weren’t for the words “Press to exit”, I’d have no clue this would unlock the door. The switch is a standard light switch, why would I believe it opens the door and not, say, turn on the lights. Lesson four: Don’t use a control that is associated with one operation for a different operation; its confusing.

But for all these detailed flaws in the choice of switch and its location, there is an even better way to solve this design problem. We don’t need a switch for the door at all. All I should have to do is push the door open: that is a natural and obvious mechanism for opening this door.

So the real design for the door cuts through all the confusion of the current “design” and just lets me do the obvious thing. How much better would that be?

Interestingly, a door that pushes open is better both for experienced users and novice users – in this case both Songkick employees who used the old mechanism and infrequent visitors to our office. People often complain that designers tend to think only about one group or the other, but often simple, thoughtful design can cater well to both groups.

Stop moaning already

There are two counter arguments to all this:

First, maybe it’s hard for the lock mechanism to distinguish between a push to open it from the inside, and a pull from the outside which should not open the door. That’s an engineering problem, not a design one – we should start with the design that works best for users and figure out if there is a plausible engineering solution, not the other way around. Only compromise the design if you really can’t solve the engineering problem. I can guarantee you that is not what’s happened here. I’ve used doors that unlock on a push and are secure in the other direction. It’s really not rocket science.

Second objection: why all this fuss about a door? People can figure it out, you’re overthinking this, it will take you a day or two to learn to press that exit button. Stop sweating the small stuff.

In design, whether of doors or software, the small stuff matters a lot. When people talk about Apple’s products being polished, they mean that Apple pay attention to exactly these sorts of details. Great design is all about making the user experience as simple as possible, but no simpler. This is not the iPad of door mechanisms. This might just be the Maylong M-150 of doors.

Its amazing what you can learn from a door.

Further reading

One of my favorite design books is Don Norman’s classic The Psychology of Everyday Things (later editions were renamed The Design of Everyday Things, which rather spoiled everyone’s habit of referring to the book as POET). Norman looks at how people use everyday objects like teapots and doors and examines how their design can help, or hinder, their use. Its a short, simple, well written book that anyone can enjoy, not just design nerds.

I was lucky enough to work briefly with Don when we were both at Apple in the 1990s. He’s a very smart guy. POET is still one of the best, most accessible books on design. Anyone who builds products, or is interested in learning more about how the objects in our world work, should read it.

Update

The awesome team at Songkick took pity on me and provided this helpful solution:

Songkick rocks!