How Docker is changing the way we develop, test & ship apps at Songkick

We’re really excited to have shipped our first app that uses Docker throughout our entire release cycle; from development, through to running tests on our CI server, and finally to our production environment. This article explains a bit about why we came to choose Docker, how we’re using it, and the benefits it brings.

Since Songkick and Crowdsurge merged last year we’ve had a mix of infrastructures, and in a long-term quest to consolidate platforms we’ve been looking at how to create a great development experience that would work cross-platform. We started by asking what a great development environment looks like, and came up with the following requirements:

  • Isolate dependencies (trying to run two different versions of a language or database on the same machine isn’t fun!)
  • Match production accurately
  • Fast to set up, and fast to work with day-to-day
  • Simple to use (think make run)
  • Easy for developers to change

We’ve aspired to created a development environment that gets out of the way and allows developers to focus on building great products. We believe that if you want a happy, productive development team it’s essential to get this right, and with the right decisions and a bit of work Docker is a great tool to achieve that.

We’ve broken down some advice and examples of how we’re using Docker for one of our new internal apps.

Install the Docker Toolbox

The Docker Toolbox provides you with all the right tools to work with Docker on Mac or Windows.

A few of us have also been playing with Docker for Mac that provides a more native experience. It’s still in beta but it’s a fantastic step forwards compared to the Docker toolbox and docker-machine.

Use VMWare Fusion instead of Virtualbox

Although Docker Toolbox comes with Virtualbox included, we chose to use VMWare Fusion instead. File change notifications are significantly better using VMWare Fusion, allowing features like Rails auto-reloading to work properly.

Creating a different Docker machine is simple:

$ docker-machine create --driver vmwarefusion default

Use existing services where possible

In development we connect directly to our staging database, removing a set of dependencies (running a local database, seeding structure and data) and giving us a useful, rich dataset to develop against.

Having a production-like set of data to develop and test against is really important, helping us catch bugs, edge-cases and data-related UX problems early.

Test in isolation

For testing we use docker-compose to run the tests against an ephemeral local database, making our tests fast and reliable.

Because you may not want to run your entire test suite each time, we also have a test shell ideal for running specific sets of tests:

$ make shell ENV=test
$ rspec spec/controllers/

Proper development tooling

As well as running the Ruby web server through Docker, we also provide a development shell container, aliased for convenience. This is great for trying out commands in the Rails console or installing new gems without needing Ruby or other dependencies on your Mac.

$ make shell ENV=dev
$ bundle install
$ rails console

Use separate Dockerfiles for development and production

We build our development and production images slightly differently. They both declare the same system dependencies but differ in how they install gems and handle assets. Let’s run through each one and see how they work:

Dockefile.dev

FROM ruby:2.3.1-slim

RUN mkdir -p /app

RUN apt-get update && \
 apt-get install -y \
 build-essential \
 pkg-config \
 libxml2-dev \
 libxslt-dev \
 libmysqlclient-dev \
 mysql-client \
 libssl-dev \
 libreadline-dev \
 git \
 libfontconfig \
 wget && \
 apt-get clean && \
 rm -rf /var/lib/apt/lists/ /tmp/ /var/tmp/

# Add our Gemfile to the app directory, this is here so if it changes
# then the bundle install is triggered again
WORKDIR /app
COPY Gemfile* /app/
COPY vendor/cache /app/vendor/cache

RUN bundle config build.nokogiri --use-system-libraries \
 && bundle install --local

COPY . /app

EXPOSE 8080

CMD ["rails", "server", "-b", "0.0.0.0", "-p", "8080"]

Here we deliberately copy the Gemfile, corresponding lock file and the vendor/cache directory first, then run bundle install.

When steps in the Dockerfile change, Docker only re-runs that step and steps after. This means we only run `bundle install` when there’s a change to the Gemfile or the cached gems, but when other files in the app change we can skip this step, significantly speeding up build time.

We deliberately chose to cache the gems rather than install afresh from Rubygems.org each time for three reasons. First, it removes a deployment dependency–when you’re deploying several times a day it’s not great having to rely on more external services than necessary. Second, it means we don’t have to authenticate for installing private or Git-based gems from inside containers. Finally, it’s also much faster installing gems from the filesystem, using the –local flag to avoid hitting Rubygems altogether.

Dockefile.prod

FROM ruby:2.3.1-slim

# Create our app directory
RUN mkdir -p /app

RUN apt-get update && \
 apt-get install -y \
 build-essential \
 ...
 apt-get clean && \
 rm -rf /var/lib/apt/lists/ /tmp/ /var/tmp/

WORKDIR /app
COPY . /app

RUN bundle config build.nokogiri --use-system-libraries \
 && bundle install --local --without development test

RUN RAILS_ENV=production bundle exec rake assets:precompile

EXPOSE 8080

CMD ["rails", "server", "-b", "0.0.0.0", "-p", "8080", "--pid", "/tmp/rails.pid"]

For production we install our gems differently, skipping test and development groups and precompiling assets into the image.

Deployment

To release this image we tag it as the latest version, as well as the git SHA. This is then pushed to our private ECR.

We deliberately deploy that specific version of the image, meaning rolling back is as simple re-deploying a previous version from Jenkins.

Running in production

For running containers in production, we’re doing the simplest possible thing–using Docker to solve a dependency management problem only.

We’re running one container per node, using host networking and managing the process using upstart. When deploying we simply tell the upstart service to restart, which pulls the relevant image from the registry, stops the existing container and starts the new one.

This isn’t the most scalable or resource-efficient way of running containers but for a low-traffic internal app it’s a great balance of simplicity and effectiveness.

Next steps

One thing we’re still missing on production is downtime-less deploys. Amazon’s ECS handles this automatically (by spinning up a new pool of containers before automatically swapping them out in the load balancer) so we’re looking to move towards using that instead.

We’re still learning a lot about using Docker but so far it’s been a powerful, reliable and enjoyable tool to use for both developers and ops.

Migrating to a new Puppet certification authority

At Songkick all our servers are managed using Puppet, an open source configuration management tool. We use it in client-server mode and recently had the need to replace the certification authority certificates on all our nodes. I couldn’t find much information on how to do this without logging onto every machine, so I’ve documented my method.

What is this Puppet CA anyway?

If you’re using puppet in its typical client-server or agent-master setup, then when the puppet master is first started it will create a certification authority (CA) which all clients that connect to it need to be trusted by and must trust. This usually happens transparently, so often people aren’t aware that this certification authority exists.

The CA is an attempt to have trust between the agents and the master, so that an attacker cannot set up malicious puppet masters and tell puppet agents to do his or her bidding and also prevent malicious clients being able to see configuration data for other clients. Agents should only connect to masters that have certificates signed by its CA and masters should only send configuration information to clients that have certificates signed by the same CA.

There’s a more comprehensive explanation of Puppet SSL written by Brice Figureau which goes into far more detail than we have space for. The main thing to understand is that the CA is an important part of maintaining security and that you can only have one CA across a set of machines that access the same secured resources.

Why would I want to migrate to a new CA?

  • Your current CA certificate is about to expire. By default, CA certificates have a validity period of 5 years, so fairly early adopters of puppet will need to replace them.
  • You’ve had multiple CAs in use and need to consolidate on one.
  • You believe that your certificates and private keys are in the hands of people who could cause mischief with them.
  • You have fallen foul of bugs relating to the fact that you use a CA created in an older version of puppet.

It was in fact the second of these reasons that applied to Songkick; we’d previously been using multiple puppet masters, each with their own CA. We wanted to start using exported resources, stored in the same PuppetDB instance for all nodes. This meant that each master needed to be trusted by the same CA that signed the PuppetDB instance; hence we needed to consolidate on one CA.

How do I do it?

Set up NEW puppet master(s)

Set up at least one new puppet master server, with a new CA certificate.

If you have a lot of existing hosts managed by puppet, then it’s worth considering enabling the autosign option, even if only temporarily, as you’ll have a number of certificate requests to approve manually otherwise.

Configure AGents to connect to THe new master(S)

We’re assuming here that you’re managing the puppet agent configuration through puppet. and that changes to the puppet configuration cause an automatic restart of the puppet agent.

Change the configuration of your puppet agents, to connect to the new master(s) and use a different ssldir:

[main]
server = <new server hostname> 
ssldir = /var/lib/puppet/ssl2

Be careful not to apply this change to your newly created puppet master.

Your clients should reconfigure themselves, restart and when they start up, connect to your new puppet master, forgetting their old ssl configuration, including the CA certificates.

If you have autodiscovery records for puppet in DNS, e.g. an A record for ‘puppet’ or the SRV records, then you should leave them in place for now. Agents that have not been migrated to the new CA may need it.

It is a good idea to test this on a handful of nodes and check that it works in a completely automated fashion before applying to every node.

Tidying up (part 1)

Once every node has checked in with the new master and been issued with a new certificate, it’s time to start the process of tidying up. It’s a good idea to revert back to using the default ssldir, so that when agents bootstrap themselves with the default config, they do not then switch to the new ssldir and thus forget their old certificates. This will cause the master to refuse to talk to them, as this looks like a spoofing attempt.

On each client, we mirror the new ssldir to the old one:

file { '/var/lib/puppet/ssl': 
  source => 'file:///var/lib/puppet/ssl2',
  recurse => true, 
  purge => true, 
  force => true, 
}

Be careful not to apply this change to your newly created puppet master.

Tidy up (part 2)

Once that’s shipped everywhere, we remove the ssldir configuration, fall back on the default ssldir and remove the above resource definiton to copy the ssldir.

Tidy up (part 3)

You can now update your autodiscovery DNS entries, to point to the new servers and remove the autosign configuration, if desired.

Finally, we ship a change to the clients that removes the temporary /var/lib/puppet/ssl2 directory.

And that’s it, everything has been migrated to the new CA, with no need to do anything outside of puppet.

Testing your database backups: the test environment database refresh pattern

When did you last try restoring your database backups? A month ago, a week ago? A year ago? Never? When was the last time you refreshed the data in your test environments? When I joined Songkick, one of the first things I asked was when we last tested a restore of our database backups. The answer, pleasingly, was at 03:00 UK time that morning and not coincidentally, that’s when we last refreshed the data in our test environments.

Here’s how we get the warm and fuzzy feeling of knowing that our backups contain data that can be restored and makes sense.

  1. Every morning, our database servers run their scheduled backups, copying the resulting images to a backup server in the data centre.
  2. Overnight those backups get copied to the office, giving us an offsite copy.
  3. In the small hours, when most of us are asleep, each of the database servers in our staging environment retrieve the backups, erase their local data files and then restore the production backups over the top of them.
  4. We perform sanitisation on the data, to make it suitable for use in a testing environment.
  5. And finally, and most importantly, we use the databases in our testing.

By doing this, we identified one case when our backups seemed to work, produced plausible looking backups, but MySQL failed to apply InnoDB log records during recovery. It was inconvenient to discover this problem in our staging environment, but far less inconvenient than discovering it when we needed the backups to put our production system back into operation.

Here are some practical tips based on our experience implementing and managing this system at Songkick:

Back all databases up at the same time

If your system is composed of services backed by independent databases on different machines, it’s possible that there’s some implicit consistency between them. For example, a common situation at Songkick is to have an accounts service responsible for storing user accounts and another service that stores user data keyed against a user, then there’s an expectation that those databases have some degree of consistency.

If you back them up at different times, you’ll find inconsistencies, that a service might have a reference to a user that doesn’t yet exist. If the ID of the user is exposed to other services and that ID can be reused, you may find that newly created users in your test environment have existing data associated with them and this can cause significant problems in testing.

It’s worth noting that, in the case of a production restore, these issues would need to be diagnosed and solved in the heat of the moment. By finding them in your test environment, you’re giving yourself the space to solve them earlier, under less pressure.

Design the backups to be regularly exercised

Some types of backups are more amenable to being restored regularly in test environments. For example, our initial MongoDB database backups performed snapshots of our MongoDB database path. These proved difficult to restore, because they included local databases which contained information on replica set membership. This means that on startup, our staging MongoDB server would forget its existing replica set membership and try to talk to the production servers instead.

We switched to using mongodump to take a logical export of the database, simply so that we could restore it on the primary member of our existing staging replica set and update the entire replica set.

Sanitisation tips

After we’ve restored the databases, there are certain things we do to make them safe and usable in our testing environments.

  • Remove or obfuscate email addresses. We’re not fond of accidentally emailing people with test events we’ve created in staging, so we change people’s email addresses to be unusable, so that can’t happen. We leave people’s email addresses alone if they work at Songkick, so we can test email features by emailing ourselves.
  • Remove or obfuscate payment tokens. If it’s uncool to accidentally email people, accidentally charging them is positively hostile. Anything that’s used for payment needs to be removed.
  • Fix or replace information about the environment. It’s best to avoid keeping references to your technical environment in the same database as your application data, but sometimes it’s tricky to workaround. For example, our MogileFS installation needs to be kept in sync with our production one, to avoid problems with missing media. This means that we need to manually update the database to substitute the hostnames of the mogilefs servers.

Write code that can withstand the database going away

Unless you’ve put some work in, almost no database driver will gracefully handle the disappearance of a database server and then its re-appearance some time later. If the restore in your test environment is the first time you’ve tried this, you may find that you need to manually restart services, even after the database re-appears on the network.

The solution will vary depending on the database client being used, but often it’s a case of catching an exception, or changing some options when you establish the connection.

By making your applications reconnect to the database with no manual input, you are again fixing a problem that will eventually occur in production – a much more stressful time for it to be diagnosed and fixed.

Summary

Testing your database backups by restoring them automatically and regularly in your test environments is a great way to battle-harden your backups and applications and to make sure that your test environment looks like the real production environment.


If you’ve liked what you’ve read, why not head over to our jobs page? We’re looking for a Systems Engineer to add more touches like these to our infrastructure.