Move fast, but test the code

At Songkick we believe code only starts adding value when it’s out in production, and being used by real users. Using Continuous Deployment helps us ship quickly and frequently. Code is pushed to Git, automatically built, checked, and if all appears well, deployed to production.

Automated pipelines make sure that every release goes through all of our defined steps. We don’t need to remember to trigger test suites, and we don’t need to merge features between branches. Our pipeline contains enough automated checks for us to be confident releasing the code to production.

However, our automated checks are not enough to confirm if a feature is actually working as it should be. For that we need to run through all our defined acceptance criteria and implicit requirements, and see the feature being used in the real world by real users.

In a previous life we used to try and perform all of our testing in the build/test/release pipeline. Not only was this slow and inefficient, dependent on lots of different people to be available at the same time, but often we found that features behaved very differently in production. Real users do unexpected things and it’s difficult to create truly realistic test environments.

Our motivation to get features out to real users as quickly as possible drove our adoption of Continuous Deployment. Having manual acceptance testing within the release pipeline slowed us down and made processes unpredictable. It was hard to define a process that relied on so many different people. We treated everyday events such as meetings and other work priorities as exceptional events which made things even more delay-prone and frustrating.

Eventually we decided that the build and release pipeline must be fully automated. We wanted developers to be able to push code and know that if Jenkins passed the build, it was safe for them to deploy to production. Attempting to automate all testing is never going to be achievable, or desirable. Firstly, automated tests are expensive to build and maintain. Secondly, testing, as opposed to checking, is not something that can be automated.

When we check something we are comparing the system against a known outcome. For example checking a button launches the expected popup when clicked, or checking a date displays in the specified format. Things like this can be, and should be automated.

Testing is more involved and relies on a human making a judgement. Testing involves exploring the system in creative ways in order to discover the things that you forgot about, the things that are unexpected, or difficult to completely define. It’s hard to predict how time and specific data combinations will affect computer systems, testing is a good way to try and uncover what actually happens. Removing the constraint of needing fully defined expected outcomes allows us to explore the system as a user might.

In practical terms this means running automated checks in our release pipeline and performing testing before code is committed, and post release. Taking testing out of the release pipeline removes the time pressures and allows us freedom to test everything as deeply as we require.

Songkick's Test and Release Process

Songkick’s Test and Release Process

Small, informal meetings called kick-offs help involve everyone in defining and designing the feature. We discuss what we’re building and why, plan how to test and release the code, and consider ways to measure success. Anything more complicated than a simple bug fix gets a kick-off before we start writing code. Understanding the context is important for helping us do the right thing. If we know that there are deadlines or business risks associated then we’re likely to act differently from a situation than has technical risks.

Coming out of the kick-off meeting we know how risky we consider the feature to be. We will have decided on the best approach to testing and releasing the code. As part of developing the feature we’ll also write or update our automated checks to make sure we don’t break the feature further down the line. Our process is intentionally flexible to allow us to treat each change appropriately depending on risk and need to ship.

Consider a recently released feature to store promoter details against ticket allocations as an example. The feature kick-off meeting identified risks and we discussed what and how to test the feature. We identified ways to break down the work into smaller pieces that could be developed and released independently; each hidden behind a feature flipper to keep it invisible from real users.

Developers and testers paired together to decide on specific areas to test. The tester’s testing expertise, and the developer’s deep understanding of the code feed into an informal collection of test ideas based on risk. Usually these are represented in a visual mind map for easy reference.

The developers, guided by the mind map, tested the feature and added automated unit and integration tests as they went. Front-end changes were overseen by a designer working closely with one of the developers to come up with the best, feasible, design. Once we had all the pieces of the feature the whole team jumped in to do some testing, and update our automated acceptance tests.

The feature required a bit of data backfilling so the development team were able to use the functionality in production, in ways we expect real users to use it. Of course we found some bugs but by working with small releases we were able to quickly locate the source of the problem. Fast release pipelines allow fixes to be deployed within minutes, making the cost of most bugs tolerably low.

Once the feature had been fully released and switched on for all users we used monitoring to check for unexpected issues. Reviewing features after a week or two of real world usage allows us to make informed decisions about the technical implementation and user experience. Taking the time to review how much value features are adding allows us to quickly spot and respond to problems.

Testing a feature involves many experts. Testers must be on hand to aid the developers in their testing, often by creating a mindmap of test ideas to guide testing. We try to use our previous experience of releasing similar features to focus the testing on areas that are typically complex or easy to break. Designers and UX people get involved to make sure the UX works as hoped, and the design looks good on all our supported devices and browsers. Product managers make sure the features actually do what they want them to do. High risk features have additional deep testing from the test team and in certain cases we throw in some focused performance or security testing.

Most of our bugs come from forgetting use cases or not understanding existing functionality in the system. Testing gives us a chance to use the system in an investigative way to hopefully find these bugs. Moving testing outside of our release pipeline gives us space to perform enough testing for each feature whilst maintaining a fully automated, and fast, release pipeline.

The cultural side of continuous deployment

We’ve written multiple posts about how we reduced our build time and optimised our tests. Moving to continuous integration (CI) and continuous deployment (CD) allowed us to remove many of the frustrations we had with our build and deploy process. On its own a fast build was not going to move us to continuous deployment but it was a pretty significant enabler. We knew that we had plenty more that we could be improving; we were still reliant on having the right people around to sign off features before releasing and we still depended on manual testing to supplement our automation.

We wanted to avoid drawing a process diagram and then having to police it so we focused on a process which was natural to the way we worked but that improved our process as much as possible.

Don’t aim for perfection

One of our major hold-ups was our attempts to make every feature and every release perfect. We were spending days perfecting pixels and copy only to find out that the feature didn’t have the anticipated impact. There is a huge benefit in getting feedback from users on what works and what doesn’t before you invest a whole load of time in making it look perfect on multiple browsers. Over time we have moved from releasing features and then tweaking them to planning and running A/B tests to gather the information we need before we start designing the final feature.

QA has a key role to play in working with the Product and Design teams to define exactly how much breakage is acceptable. We were moving from a process where every release was tested and it was expected that almost all bugs would have been spotted and fixed before going to production. Now we were relying on our developers and our automation to keep things in a ‘good enough’ state. When something went wrong we stepped back and looked at what was missing – in most cases it was an up-front conversation about risks and expectations.

Of course this is not an excuse for having a website full of badly designed and half-working features. We accept that bugs will end up on production but we work hard to make sure they get fixed as soon as possible.

Managing how many more bugs went to production was a job for our automated tests. Reviewing all the tests as part of our ‘make all the tests fast’ overhaul started to convince us that we had decent coverage. Deciding that we were going to trust the tests gave us the freedom to say that any green build was a releasable build. If this turned out not to the be the case, either because manual testing discovered a bug or because of an issue in production then we amended the tests. Regular reviews and conversations, particularly between developers and QA, help us to keep the tests maintained and testing the right things.

Avoid red builds

Historically Songkick has had an unnatural tolerance for red builds. They didn’t appear to slow us down that much so we didn’t take the time to really try to avoid them. Once we started to seriously look at adopting continuous integration we realised that this would have to change. Frequent check-ins will only work if the builds are green. Loud and visible alerts that go to the whole team when a build fails not only means someone looks into the failure quickly but also helped us to view red builds as a delay. This coupled with having a very simple, and fast, way to run the tests on a dev environment before checking code in keeps our red builds to a minimum.

Integrate small changes frequently

A key part of CI is integrating frequently. In an ideal world you probably have everyone working off the master branch. We are careful to maintain a releasable master branch but opted for individual freedom around working on individual branches or directly off master. We like CI because it allows developers the freedom to work in a way that suits them whilst still having enough safeguards to keep the site running. Once we had a fast and painless way to integrate and release most developers naturally started integrating small changes on a more frequent basis.

Have a shared understanding of your goals

Make sure you, and everyone in the team understands what you’re trying to achieve at each stage of the build pipeline. At Songkick we expect to be able to build and test features on a local dev environment. If we discover something that forces us to test on a real test environment, such as missing data or missing services, then work gets prioritised to change that for next time.

Green builds have been tested on the CI server so we assume that a green build has the minimum required functionality to be releasable.

We use the test environment to test that the build can be deployed, and that the website works as we expect it to when running on multiple servers with lifelike data. Acceptance tests running with Selenium check that agreed business-critical functionality has not been broken. We have separated our build and deploy pipeline from feature launches so passing acceptance tests are our green flag to deploy to production.

Manual acceptance testing takes place on the production environment with the aid of feature flippers to control who can see which features. Once a feature has been tested we manually change the flipper to ‘launch’ the feature to the users.

Keep on learning

CI and CD are difficult to implement, and one of the hardest parts is imagining what the process will actually look like. Rather than trying to pin down the final process we introduced changes gradually, focusing on removing the biggest bottlenecks first. Once one bottleneck was removed it was pretty easy to see what the next one was. Speaking up when you feel frustrated along with analysing problems using the 5-Whys method has helped us improve the process to where we are today. It is fine to make a mistake but at least make sure it is an original one.

Run the right tests at the right time

Way back in June, Dan Crow posted about some of the key principles that we at Songkick believe in. One that I spend some time thinking about every day is, ‘ship early, ship often’. We firmly believe that code should be shipped as soon as it’s ready. From a development point view this just makes sense. From a user’s point of view this just makes sense. From a testing point of view this proves to be a bit of a challenge.

Shipping fast doesn’t mean shipping untested code and hoping for the best. Every single thing that we release has been tested extensively. Obviously the only way we manage to ship often is by keeping the build/test/release cycle as short as possible. All builds are managed in Jenkins. Pushing code will automatically trigger our unit and integration test suites. If all the tests pass we end up with a green build which can be manually deployed to our test environment. Finally a suite of Acceptance tests run through the browser using Capybara and the Selenium Web Driver to confirm we haven’t broken any of our critical user journeys. These tests are pretty slow, taking roughly 4 minutes to run a handful of scenarios but this is the first check that the user will actually be able to interact with the website.

Only after all these tests have passed will we deploy code to Production. This applies to all new features, bug fixes and even changes to the tests themselves.

The problem

Despite our best intentions we were still struggling to ship changes as soon as they were ready:

In June 2011 we made 7 releases.

In the best case it took 3 hours to build, test and ship code. In reality we were spending around 2 days preparing each release. Something had to change.

Dan Lucraft wrote an excellent post about how we reduced the time it takes to run our tests. It feels pretty obvious to say you can increase release speed if you make your tests run faster but this was only part of the solution. Keeping the test suites fast requires constant diligence. Aiming for 100% test coverage is a distraction. Not only will you never achieve it but if you even came close then your builds would likely be taking far longer than needed to run.

Run the right tests

We took the step of identifying which features we wouldn’t want to break and plotting them against the overhead of running tests. In the case of unit tests you can pretty much add as many tests as you like without too much overhead. Integration tests need to be things that you actually care about. If you discovered a feature was broken during manual testing but wouldn’t hold a release to fix it then you shouldn’t have an automated test for that feature in your build (well, unless it was a super quick unit test).

An example of this is our automatic tweets when authenticated users mark their attendance to an event. It is a valid and highly used service that we wouldn’t want to be without but it is not business critical. If we were to have an automated test for this we would need a test which set up a user who appears authenticated with Twitter. The test user would then mark their attendance to an event and the test would need to check whether the tweet was fired for the correct event.

Not only is that a fair bit of work to write and maintain but the resulting test would be pretty slow to execute. The alternative, to push to production and monitor errors in the logs whilst also keeping an eye on the Songkick twitter feed (something we’re already monitoring) means we have one fewer test to run and maintain. The feedback comes later (post release rather than pre) but since we wouldn’t hold a release even if we knew that we had broken this feature then actual time to fix is roughly the same.

At the right time

To allow the team to ship fast we need to keep the release channel clear. Builds run through the test suites as cleanly and as quickly as possible to free up the channel for the next release. Part of our process involves establishing up-front how we will test a code change. Usually this will mean adding or modifying automated tests to cover the new functionality. However some of our changes need more than just an automated build run against them so we needed to come up with a way to separate testing from the actual releases.

Our solution was to use what we call Flippers, additional code which lets admins control whether a feature is visible to users. We can then turn features on and off on the live site without needing to make additional releases. As well as giving us a fast way to turn off problem features this has the benefit of allowing us to turn features on for a particular type of user. High risk or extensively changed features are released to production behind a flipper that makes them visible to admin users only. This means we can run the code on the live servers, using live data but test them as if we were working on a test environment.

Fix bugs fast

One problem with testing code on Production is that the bugs you find are also on Production. Obviously many of these bugs aren’t visible to users thanks to to the flippers but there will always be some bugs in live code. Our approach is a cultural one: yes, we move fast and accept that things might break, but we don’t leave them like that. We fix bugs as fast as possible.

Sounds interesting but does it work?

We spent 12 months looking at our tests, our process and probably ourselves. Changes were made and in June 2012 we made 113 releases. 14 of those were on the same day. In fact we released on every single working day that month (and there were a few sneaky weekend releases too!).