Recent talks on Songkick Engineering

Since I joined Songkick a little over four years ago, our development team has done some amazing things. Our technology, process and culture have improved an enormous amount.

We’ve always been eager to share our progress on this blog and elsewhere, and we often talk about what we’ve learned and where we are still trying to improve.

Here are some recent talks given by members of our team discussing various aspects of how we work.

The cultural side of continuous deployment

We’ve written multiple posts about how we reduced our build time and optimised our tests. Moving to continuous integration (CI) and continuous deployment (CD) allowed us to remove many of the frustrations we had with our build and deploy process. On its own a fast build was not going to move us to continuous deployment but it was a pretty significant enabler. We knew that we had plenty more that we could be improving; we were still reliant on having the right people around to sign off features before releasing and we still depended on manual testing to supplement our automation.

We wanted to avoid drawing a process diagram and then having to police it so we focused on a process which was natural to the way we worked but that improved our process as much as possible.

Don’t aim for perfection

One of our major hold-ups was our attempts to make every feature and every release perfect. We were spending days perfecting pixels and copy only to find out that the feature didn’t have the anticipated impact. There is a huge benefit in getting feedback from users on what works and what doesn’t before you invest a whole load of time in making it look perfect on multiple browsers. Over time we have moved from releasing features and then tweaking them to planning and running A/B tests to gather the information we need before we start designing the final feature.

QA has a key role to play in working with the Product and Design teams to define exactly how much breakage is acceptable. We were moving from a process where every release was tested and it was expected that almost all bugs would have been spotted and fixed before going to production. Now we were relying on our developers and our automation to keep things in a ‘good enough’ state. When something went wrong we stepped back and looked at what was missing – in most cases it was an up-front conversation about risks and expectations.

Of course this is not an excuse for having a website full of badly designed and half-working features. We accept that bugs will end up on production but we work hard to make sure they get fixed as soon as possible.

Managing how many more bugs went to production was a job for our automated tests. Reviewing all the tests as part of our ‘make all the tests fast’ overhaul started to convince us that we had decent coverage. Deciding that we were going to trust the tests gave us the freedom to say that any green build was a releasable build. If this turned out not to the be the case, either because manual testing discovered a bug or because of an issue in production then we amended the tests. Regular reviews and conversations, particularly between developers and QA, help us to keep the tests maintained and testing the right things.

Avoid red builds

Historically Songkick has had an unnatural tolerance for red builds. They didn’t appear to slow us down that much so we didn’t take the time to really try to avoid them. Once we started to seriously look at adopting continuous integration we realised that this would have to change. Frequent check-ins will only work if the builds are green. Loud and visible alerts that go to the whole team when a build fails not only means someone looks into the failure quickly but also helped us to view red builds as a delay. This coupled with having a very simple, and fast, way to run the tests on a dev environment before checking code in keeps our red builds to a minimum.

Integrate small changes frequently

A key part of CI is integrating frequently. In an ideal world you probably have everyone working off the master branch. We are careful to maintain a releasable master branch but opted for individual freedom around working on individual branches or directly off master. We like CI because it allows developers the freedom to work in a way that suits them whilst still having enough safeguards to keep the site running. Once we had a fast and painless way to integrate and release most developers naturally started integrating small changes on a more frequent basis.

Have a shared understanding of your goals

Make sure you, and everyone in the team understands what you’re trying to achieve at each stage of the build pipeline. At Songkick we expect to be able to build and test features on a local dev environment. If we discover something that forces us to test on a real test environment, such as missing data or missing services, then work gets prioritised to change that for next time.

Green builds have been tested on the CI server so we assume that a green build has the minimum required functionality to be releasable.

We use the test environment to test that the build can be deployed, and that the website works as we expect it to when running on multiple servers with lifelike data. Acceptance tests running with Selenium check that agreed business-critical functionality has not been broken. We have separated our build and deploy pipeline from feature launches so passing acceptance tests are our green flag to deploy to production.

Manual acceptance testing takes place on the production environment with the aid of feature flippers to control who can see which features. Once a feature has been tested we manually change the flipper to ‘launch’ the feature to the users.

Keep on learning

CI and CD are difficult to implement, and one of the hardest parts is imagining what the process will actually look like. Rather than trying to pin down the final process we introduced changes gradually, focusing on removing the biggest bottlenecks first. Once one bottleneck was removed it was pretty easy to see what the next one was. Speaking up when you feel frustrated along with analysing problems using the 5-Whys method has helped us improve the process to where we are today. It is fine to make a mistake but at least make sure it is an original one.

Run the right tests at the right time

Way back in June, Dan Crow posted about some of the key principles that we at Songkick believe in. One that I spend some time thinking about every day is, ‘ship early, ship often’. We firmly believe that code should be shipped as soon as it’s ready. From a development point view this just makes sense. From a user’s point of view this just makes sense. From a testing point of view this proves to be a bit of a challenge.

Shipping fast doesn’t mean shipping untested code and hoping for the best. Every single thing that we release has been tested extensively. Obviously the only way we manage to ship often is by keeping the build/test/release cycle as short as possible. All builds are managed in Jenkins. Pushing code will automatically trigger our unit and integration test suites. If all the tests pass we end up with a green build which can be manually deployed to our test environment. Finally a suite of Acceptance tests run through the browser using Capybara and the Selenium Web Driver to confirm we haven’t broken any of our critical user journeys. These tests are pretty slow, taking roughly 4 minutes to run a handful of scenarios but this is the first check that the user will actually be able to interact with the website.

Only after all these tests have passed will we deploy code to Production. This applies to all new features, bug fixes and even changes to the tests themselves.

The problem

Despite our best intentions we were still struggling to ship changes as soon as they were ready:

In June 2011 we made 7 releases.

In the best case it took 3 hours to build, test and ship code. In reality we were spending around 2 days preparing each release. Something had to change.

Dan Lucraft wrote an excellent post about how we reduced the time it takes to run our tests. It feels pretty obvious to say you can increase release speed if you make your tests run faster but this was only part of the solution. Keeping the test suites fast requires constant diligence. Aiming for 100% test coverage is a distraction. Not only will you never achieve it but if you even came close then your builds would likely be taking far longer than needed to run.

Run the right tests

We took the step of identifying which features we wouldn’t want to break and plotting them against the overhead of running tests. In the case of unit tests you can pretty much add as many tests as you like without too much overhead. Integration tests need to be things that you actually care about. If you discovered a feature was broken during manual testing but wouldn’t hold a release to fix it then you shouldn’t have an automated test for that feature in your build (well, unless it was a super quick unit test).

An example of this is our automatic tweets when authenticated users mark their attendance to an event. It is a valid and highly used service that we wouldn’t want to be without but it is not business critical. If we were to have an automated test for this we would need a test which set up a user who appears authenticated with Twitter. The test user would then mark their attendance to an event and the test would need to check whether the tweet was fired for the correct event.

Not only is that a fair bit of work to write and maintain but the resulting test would be pretty slow to execute. The alternative, to push to production and monitor errors in the logs whilst also keeping an eye on the Songkick twitter feed (something we’re already monitoring) means we have one fewer test to run and maintain. The feedback comes later (post release rather than pre) but since we wouldn’t hold a release even if we knew that we had broken this feature then actual time to fix is roughly the same.

At the right time

To allow the team to ship fast we need to keep the release channel clear. Builds run through the test suites as cleanly and as quickly as possible to free up the channel for the next release. Part of our process involves establishing up-front how we will test a code change. Usually this will mean adding or modifying automated tests to cover the new functionality. However some of our changes need more than just an automated build run against them so we needed to come up with a way to separate testing from the actual releases.

Our solution was to use what we call Flippers, additional code which lets admins control whether a feature is visible to users. We can then turn features on and off on the live site without needing to make additional releases. As well as giving us a fast way to turn off problem features this has the benefit of allowing us to turn features on for a particular type of user. High risk or extensively changed features are released to production behind a flipper that makes them visible to admin users only. This means we can run the code on the live servers, using live data but test them as if we were working on a test environment.

Fix bugs fast

One problem with testing code on Production is that the bugs you find are also on Production. Obviously many of these bugs aren’t visible to users thanks to to the flippers but there will always be some bugs in live code. Our approach is a cultural one: yes, we move fast and accept that things might break, but we don’t leave them like that. We fix bugs as fast as possible.

Sounds interesting but does it work?

We spent 12 months looking at our tests, our process and probably ourselves. Changes were made and in June 2012 we made 113 releases. 14 of those were on the same day. In fact we released on every single working day that month (and there were a few sneaky weekend releases too!).

What we believe in

Recently, the Songkick development team wrote down some of the principles we follow when building software. We originally did this for the latest Silicon Milkroundabout job fair. I thought I’d share them with you, to give you a flavor of what it’s like to work here.

Ship early, ship often

Our development process is optimized for speed of building, releasing and iterating. Five minutes is way too long to wait for a build, so we work hard to make it as fast as possible. Shipping fast means we deliver more great features to our users. We practice continuous integration and continuous deployment. We have high test coverage and we run tests automatically. If the build is green, ship it to production; if it’s red, fix it now.

Architecture is fundamental

Simple, versatile software abstractions are the key to building maintainable software. Our service oriented architecture is easy to extend, and is scalable and elegant. Each service has a clearly defined role and a simple, stable API. New clients are quickly assembled from these services. Client teams focus on creating awesome products for their users: our iPhone app looks different from our Spotify client, which looks different from our website, but they are all built on the same services. Each team can operate like they are their own startup.

Automate away the trivial

We have 10 engineers supporting more than six million users. To manage this, we automate away the operational burden. Services restart automatically in production. Our machines are managed by Puppet, our deployments by Jenkins and Capistrano. Our production monitoring tells us if something goes wrong, so we can fix it fast. Automation and good tools make our lives easier.

Bad code kills startups

It’s too easy to be lazy: don’t reuse that existing method, roll yet another version; don’t refactor, you can do that another day; don’t remove old code, it’s not like it gets in the way. Over months and years, these small indiscretions add up and can strangle a startup like bindweed. We don’t let this happen, we hold ourselves to a higher standard and actively tend our codebase.

Explore, listen, learn

Every engineer spends time learning. Recently a group of us did the Stanford Machine Learning course together. Everyone gets a budget and work time to attend conferences, to speak or listen. Getting better at what you do is a key part of your job. We hold regular dojos and tech talks to share and explore new technologies and ideas, and regular retrospectives to make sure we always do better next time.

Work together

We value a broad range of skills and a broad range of perspectives. Our engineers come from across the globe and have many different technical interests. Some of us dig machine learning, others are into client side coding, we have testing gurus and security experts, some are architecture geeks. We share a common passion for building the best live music service in the world. We respect our differences and believe that makes us better.

Always be playing

What fun would it be if we knew all the answers? We’re always looking for new ideas and new approaches. So we hold regular innovation days where everyone in the company can try out any idea. It might be a new product, a novel automation tool, a better user interface, a different way of building software. These are not just idle experiments: we use many of them in our day to day lives or launch them to our users.

Hire the best

Songkick is an outstanding place to work because we have an outstanding team. We’ve hired the best and the brightest from around the world: the UK, France, Germany, New Zealand, America, Brazil, South Africa, Portugal; from universities such as Imperial, Cambridge, Durham, and Stanford; and from companies including Apple, Google, Yahoo!, The Guardian and the BBC. The things we look for in great engineers are skills, passion and a desire to be part of a great team.