How Docker is changing the way we develop, test & ship apps at Songkick

We’re really excited to have shipped our first app that uses Docker throughout our entire release cycle; from development, through to running tests on our CI server, and finally to our production environment. This article explains a bit about why we came to choose Docker, how we’re using it, and the benefits it brings.

Since Songkick and Crowdsurge merged last year we’ve had a mix of infrastructures, and in a long-term quest to consolidate platforms we’ve been looking at how to create a great development experience that would work cross-platform. We started by asking what a great development environment looks like, and came up with the following requirements:

  • Isolate dependencies (trying to run two different versions of a language or database on the same machine isn’t fun!)
  • Match production accurately
  • Fast to set up, and fast to work with day-to-day
  • Simple to use (think make run)
  • Easy for developers to change

We’ve aspired to created a development environment that gets out of the way and allows developers to focus on building great products. We believe that if you want a happy, productive development team it’s essential to get this right, and with the right decisions and a bit of work Docker is a great tool to achieve that.

We’ve broken down some advice and examples of how we’re using Docker for one of our new internal apps.

Install the Docker Toolbox

The Docker Toolbox provides you with all the right tools to work with Docker on Mac or Windows.

A few of us have also been playing with Docker for Mac that provides a more native experience. It’s still in beta but it’s a fantastic step forwards compared to the Docker toolbox and docker-machine.

Use VMWare Fusion instead of Virtualbox

Although Docker Toolbox comes with Virtualbox included, we chose to use VMWare Fusion instead. File change notifications are significantly better using VMWare Fusion, allowing features like Rails auto-reloading to work properly.

Creating a different Docker machine is simple:

$ docker-machine create --driver vmwarefusion default

Use existing services where possible

In development we connect directly to our staging database, removing a set of dependencies (running a local database, seeding structure and data) and giving us a useful, rich dataset to develop against.

Having a production-like set of data to develop and test against is really important, helping us catch bugs, edge-cases and data-related UX problems early.

Test in isolation

For testing we use docker-compose to run the tests against an ephemeral local database, making our tests fast and reliable.

Because you may not want to run your entire test suite each time, we also have a test shell ideal for running specific sets of tests:

$ make shell ENV=test
$ rspec spec/controllers/

Proper development tooling

As well as running the Ruby web server through Docker, we also provide a development shell container, aliased for convenience. This is great for trying out commands in the Rails console or installing new gems without needing Ruby or other dependencies on your Mac.

$ make shell ENV=dev
$ bundle install
$ rails console

Use separate Dockerfiles for development and production

We build our development and production images slightly differently. They both declare the same system dependencies but differ in how they install gems and handle assets. Let’s run through each one and see how they work:

Dockefile.dev

FROM ruby:2.3.1-slim

RUN mkdir -p /app

RUN apt-get update && \
 apt-get install -y \
 build-essential \
 pkg-config \
 libxml2-dev \
 libxslt-dev \
 libmysqlclient-dev \
 mysql-client \
 libssl-dev \
 libreadline-dev \
 git \
 libfontconfig \
 wget && \
 apt-get clean && \
 rm -rf /var/lib/apt/lists/ /tmp/ /var/tmp/

# Add our Gemfile to the app directory, this is here so if it changes
# then the bundle install is triggered again
WORKDIR /app
COPY Gemfile* /app/
COPY vendor/cache /app/vendor/cache

RUN bundle config build.nokogiri --use-system-libraries \
 && bundle install --local

COPY . /app

EXPOSE 8080

CMD ["rails", "server", "-b", "0.0.0.0", "-p", "8080"]

Here we deliberately copy the Gemfile, corresponding lock file and the vendor/cache directory first, then run bundle install.

When steps in the Dockerfile change, Docker only re-runs that step and steps after. This means we only run `bundle install` when there’s a change to the Gemfile or the cached gems, but when other files in the app change we can skip this step, significantly speeding up build time.

We deliberately chose to cache the gems rather than install afresh from Rubygems.org each time for three reasons. First, it removes a deployment dependency–when you’re deploying several times a day it’s not great having to rely on more external services than necessary. Second, it means we don’t have to authenticate for installing private or Git-based gems from inside containers. Finally, it’s also much faster installing gems from the filesystem, using the –local flag to avoid hitting Rubygems altogether.

Dockefile.prod

FROM ruby:2.3.1-slim

# Create our app directory
RUN mkdir -p /app

RUN apt-get update && \
 apt-get install -y \
 build-essential \
 ...
 apt-get clean && \
 rm -rf /var/lib/apt/lists/ /tmp/ /var/tmp/

WORKDIR /app
COPY . /app

RUN bundle config build.nokogiri --use-system-libraries \
 && bundle install --local --without development test

RUN RAILS_ENV=production bundle exec rake assets:precompile

EXPOSE 8080

CMD ["rails", "server", "-b", "0.0.0.0", "-p", "8080", "--pid", "/tmp/rails.pid"]

For production we install our gems differently, skipping test and development groups and precompiling assets into the image.

Deployment

To release this image we tag it as the latest version, as well as the git SHA. This is then pushed to our private ECR.

We deliberately deploy that specific version of the image, meaning rolling back is as simple re-deploying a previous version from Jenkins.

Running in production

For running containers in production, we’re doing the simplest possible thing–using Docker to solve a dependency management problem only.

We’re running one container per node, using host networking and managing the process using upstart. When deploying we simply tell the upstart service to restart, which pulls the relevant image from the registry, stops the existing container and starts the new one.

This isn’t the most scalable or resource-efficient way of running containers but for a low-traffic internal app it’s a great balance of simplicity and effectiveness.

Next steps

One thing we’re still missing on production is downtime-less deploys. Amazon’s ECS handles this automatically (by spinning up a new pool of containers before automatically swapping them out in the load balancer) so we’re looking to move towards using that instead.

We’re still learning a lot about using Docker but so far it’s been a powerful, reliable and enjoyable tool to use for both developers and ops.

Safely dealing with magical text

Boy, what a week it’s been. A remote-code-execution bug was discovered in Ruby on Rails, and we’ve all been scrambling to patch our servers (please patch your apps before reading any further, there is an automated exploit out there that gives people a shell on your boxes otherwise).

What the Ruby community, and those of other dynamic languages, must realize from recent Rails security blunders is that very similar problems can easily exist in any non-trivial web application. Indeed, I found a remote-execution bug in my own open-source project Faye yesterday, 3.5 years into the life of the project (again: patch before reading on).

There are a lot of lessons to be had from recent Rails security blunders, since they involve so many co-operating factors: excessive trust of user input, insufficient input validation and output encoding, the behavioural capabilities of Ruby objects and certain Rails classes, ignorance of cryptography and the computational complexity of data transport formats. In this post I’d like to focus on one in particular: safely encoding data for output and execution.

Ugh, do I have to?

I know, I know, booooooring, but so many people are still getting this really badly wrong and it continues punish end users by exposing their data to malicious manipulation.

Robert Hansen and Meredith Patterson have a really good slide deck on stopping injection attacks with computational theory. One core message in that paper is that injection exploits (including SQL injection and cross-site scripting) involve crafting input such that it creates new and unexpected syntactic elements in code executed by the software, essentially introducing new instructions for the software to execute. Let’s look at a simple example.

Learn you a query string

I found the code that prompted me to write this post while updating some Google Maps URLs on our site this afternoon. Some of this code was constructing URLs by doing something like this:

def maps_url(lat, lng, zoom, width, height)
  params = [ "center=#{lat},#{lng}",
             "zoom=#{zoom}",
             "size=#{width}x#{height}" ]

  "http://maps.google.com/?" + params.join("&")
end

maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204,-0.1148897& ...
#                             zoom=15& ...
#                             size=640x220"

You can see the intent here: whoever wrote this code assumes the URL is going to end up being embedded in HTML, and so they have encoded the query string delimiters as & entities. But this doesn’t fix the problem entities are designed to solve, namely: safely representing characters that usually have special meaning in HTML. What is telling is that the comma in the query string should really also be encoded as %2C, but isn’t.

So although the ampersands are being encoded, the actual query data is not, and that means anyone calling this function can use it to inject HTML, for example:

link = '<a href="' + maps_url(0, 0, 1, 0, '"><script>alert("Hello!")</script>') +
           '">Link text</a>'

# => '<a href="http://maps.google.com/?center=0,0&amp; ... 
#     zoom=1&amp; ... 
#     size=0x"> ...
#     <script>alert("Hello!")</script> ...
#     ">Link text</a>'

By abusing the maps_url() function, I have managed to inject characters with special meaning — <, >, etc. — into the output and thereby added new HTML elements to the output that shouldn’t be there. By passing unexpected input I’ve created a lovely little cross-site scripting exploit and stolen all your users’ sessions!

Note that you cannot cleanly fix this by using an HTML-escaping function like ERB::Util.h() on the output of maps_url(), because this would serve to re-encode the ampersands, leaving strings like &amp;amp; in the href attribute.

Stacks of languages

Meredith Patterson of the above-linked paper gave another presentation at 28C3 called The Science of Insecurity. I’ve been telling absolutely everyone to watch it recently, so here it is.

This talk describes how we should think of data transfer formats, network protocols and the like as languages, because in fact that’s what they are. It covers the different levels of language power – regular languages, context-free languages and Turing-complete languages – and how use of each affects the security of our systems. It also explains why, if your application relies on Turing-complete protocols, it will take an infinite amount of time to secure it.

When you build HTML pages, you are using a handful of languages that all run together in the same document. There’s HTML itself, and embedded URLs, and CSS, and JavaScript, and JavaScript embedded in CSS, and CSS selectors embedded in CSS and JavaScript, and base64 encoded images, and … well this list is long. All of these are languages and have formal definitions about how to parse them, and your browser needs to know which type of data it’s dealing with whenever it’s parsing your code.

Every character of output you generate is an instruction that tells the browser what do next. If it’s parsing an HTML attribute and sees the " character, it truncates the attribute at that point. If it thinks it’s reading a text node and sees a <, it starts parsing the input as an HTML tag.

Instead of thinking of your pages as data, you should think of them as executable language.

Back to reality

Let’s apply this idea to our URL:

http://maps.google.com/?center=51.4651204,-0.1148897&amp;zoom=15&amp;size=640x220

Outside of an HTML document, the meaning of this list of characters changes: those & blobs only have meaning when interpreting HTML, and if we treat this query string verbatim we get these parameters out:

{
  'center'   => '51.4651204,-0.1148897',
  'amp;zoom' => '15',
  'amp;size' => '640x220'
}

(This assumes your URL parser doesn’t treat ; as a value delimiter, or complain that the comma is not encoded.)

We’ve seen what happens when we embed HTML-related characters in the URL: inserting the characters "> chops the <a> tag short and allows injection of new HTML elements. But that behaviour comes from HTML, not from anything about URLs; when the browser is parsing an href attribute, it just reads until it hits the closing quote symbol and then HTML-decodes whatever it read up to that point to get the attribute value. It could be a URL, or any other text value, the browser does not care. At that level of parsing, it only matters that the text is HTML-encoded.

In fact, you could have a query string like foo=true&bar="> and parsing it with a URL parser will give you the data {'foo' => 'true', 'bar' => '">'}. The characters "> mean something in the HTML language, but not in the query string language.

So, we have a stack of languages, each nested inside the other. Symbols with no special meaning at one level can gain meaning at the next. What to do?

Stacks of encodings

What we’re really doing here is taking a value and putting it into a query string inside a URL, then putting that URL inside an HTML document.

                                +-------------------------+
                                | "51.4651204,-0.1148897" |
                                +------------+------------+
                                             |
    +----------------------------------------|--------+
    |                                +-------V------+ |
    | http://maps.google.com/?center=| centre_value | |
    |                                +--------------+ |
    +------------------------+------------------------+
                             |
                       +-----V-----+
              <a href="| url_value |">Link</a>
                       +-----------+

At each layer, the template views the value being injected in as an opaque string — it deosn’t care what it is, it just needs to make sure it’s encoded properly. The problem with our original example is that it pre-emptively applies HTML encoding to data because it anticipates that the value will be used in HTML, but does not apply encodings relevant to the task at hand, namely URL construction. This is precisely backwards: considering the problem as above we see that we should instead:

  1. Decide what type of string we’re creating — is it a URL, an HTML doc, etc.
  2. Apply all encoding relevant to the type of string being made
  3. Do not apply encodings for languages further up the stack

In other words, we should make a URL-constructing function apply URL-related encoding to its inputs, and an HTML-constructing function should apply HTML encoding. This means each layer’s functions can be recombined with others and still work correctly, becasue their outputs don’t make assumptions about where they will be used. So we would rewrite our code as:

def maps_url(lat, lng, zoom, width, height)
  params = { "center" => "#{lat},#{lng}",
             "zoom"   => zoom,
             "size"   => "#{width}x#{height}" }

  query = params.map do |key, value|
    "#{CGI.escape key.to_s}=#{CGI.escape value.to_s}"
  end
  "http://maps.google.com/?" + query.join("&")
end

url = maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204%2C-0.1148897& ...
#                             zoom=15& ...
#                             size=640x220"

html = '<a href="' + ERB::Util.h(url) + '">Link</a>'

# => '<a href="http://maps.google.com/?center=51.4651204%2C-0.114889
#     &amp; ... 
#     zoom=15&amp; ... 
#     size=640x220">Link</a>'

Now we see that we get two valid pieces of data: url is a valid URL with all its query parameters correctly encoded but no HTML entities present, and html is a valid HTML fragment with its attributes correctly entity-encoded.

Also, note how we have treated all incoming data as literal (i.e. not already encoded for the task at hand), and we have not hand-written any encoding ourselves (e.g. hand-writing entities like &amp;). You should deal with data assuming it contains the literal information it represents and use library functions to encode it correctly. There’s a very good chance you don’t know all the text transformations required by each layer.

Thinking in types

At this point you’re probably thinking that I’ve made something quite simple seem very complicated. But thinking in terms of types of strings, treating your output as a language stack and following the bullet list above is a good discipline to follow if you want to make sure you handle data safely.

There are some systems that do this for you, for example Rails 3 automatically HTML-escapes any value you insert into an ERB template by default. I’m working on a more general version of this idea: Coping is a templating language that checks your templates conform to the language you’re producing, and doesn’t let input introduce new syntactic elements.

If you’re feeling very brave, I recommend taking the Coursera Compilers course. Although it doesn’t seem immediately relevant to web devs, many concepts from parser theory, type checking and code generation can be applied to security and are well worth learning.

Above all, learn from other people’s security failures and consider where you may have made similar mistakes.

Streaming modules in Rails 3

Rails 3 has limited built-in support for progressive rendering (or template streaming). It can stream your header, page, and footer when each has completed rendering. This can have a big impact on your pages’ speed. In our case it’s brought our average first response time from 1s to 400ms.

However, this doesn’t go as far as we’d like. In previous versions of songkick.com, we used a plugin that activated module-by-module template streaming. This means that the page streams each module of the page as it is rendered, not just in three big blocks.

This doesn’t seem to be possible in Rails 3 (although it is apparently planned for Rails 4), but with a bit of fiddling we have managed to make it work. It’s currently activated on most pages on www.songkick.com, if you head over there and find a slow page you can watch the page come in a module at a time.

I’m sharing it here both because I thought it might be useful, and because it’s a bit of a hack and I’d like feedback on it.

In the action you want to stream, you can turn on standard Rails 3 streaming with:

def show
  render :stream => true
end

But to enable module by module streaming there are two things to do. First, in your layout:

<% $streaming_output_buffer = @output_buffer %>
<%= yield %>

And then in the page template (here show.html.erb) wrap the entire contents in a block:

<% with_output_buffer($streaming_output_buffer) %>
  … all of ERB template in here
<% end %>

This will cause this ERB template to write to the streaming buffer rather than its own output buffer each time it renders a portion of the page.

You should only add this block in the top level page templates. Don’t put it in partials or whatever, otherwise weird things will happen. Also, if your entire page is a render call to just one other template (say the page just contained one big partial) then you won’t see any benefit, because it’s only this template that streams not recursively called ones.

Told you it was a bit of a hack. Hopefully it will tide us over until Rails 4 implements this properly.

validates_uniqueness_of :nothing

Warning: this article contains rather a lot of silly decisions.

I’ve recently been working out some bugs in our OAuth implementation, including our OAuth2::Provider library. One of the biggest gotchas I found while diagnosing problems with our client apps was the existence of duplicate Authorization records.

An Authorization is a link between a ResouceOwner (i.e. a Songkick user) and a Client, for example our iPhone application. It represents that the user has granted the client access to their resources on Songkick. There should only be one of these per owner-client pair, and somehow we had a few thousand duplicates in our database. Getting more concrete, the table’s columns include the following:

+---------------------+--------------+
| Field               | Type         |
+---------------------+--------------+
| resource_owner_type | varchar(255) |
| resource_owner_id   | int(11)      |
| client_id           | int(11)      |
+---------------------+--------------+

Each combination of values for these three columns must only appear once in the table.

A series of unfortunate events

Now the Rails Way to make such guarantees is to use validates_uniqueness_of, or use a find_or_create_by_* call to check if something exists before creating it. And that’s basically what I’d done; OAuth2::Provider has a method called Authorization.for(owner, client) that would either find a suitable record or create a new one.

But despite implementing this, we were still getting duplicates. I removed an alternative code path for getting Authorization records, and still the duplicates continued. I figured something in our applications must be creating them, so I made new() and create() private on the Authorization model. No dice.

And then I remembered: concurrency! Trying to enforce uniqueness on the client doesn’t work, unless all the clients subscribe to a distributed decision-making protocol. If two requests are in flight, both can run a SELECT query, find there’s no existing record, and then both decide to create the record. Something like this:

             User 1                 |               User 2
------------------------------------+--------------------------------------
# User 1 checks whether there's     |
# already a comment with the title  |
# 'My Post'. This is not the case.  |
SELECT * FROM comments              |
WHERE title = 'My Post'             |
                                    |
                                    | # User 2 does the same thing and also
                                    | # infers that his title is unique.
                                    | SELECT * FROM comments
                                    | WHERE title = 'My Post'
                                    |
# User 1 inserts his comment.       |
INSERT INTO comments                |
(title, content) VALUES             |
('My Post', 'hi!')                  |
                                    |
                                    | # User 2 does the same thing.
                                    | INSERT INTO comments
                                    | (title, content) VALUES
                                    | ('My Post', 'hello!')
                                    |
                                    | # ^^^^^^
                                    | # Boom! We now have a duplicate
                                    | # title!

This may look familiar to you. In fact, I lifted straight out of the ActiveRecord source where it explains why validates_uniqueness_ofdoesn’t work when you have concurrent requests.

Users do the funniest things

I agree with you – in theory. In theory, communism works. In theory.

— Homer J. Simpson

There can be a tendency among some programmers to dismiss these arguments as things that probably won’t be a problem in practice. Why would two requests arrive at the same time, close enough to cause this race condition in the database, for the same user’s resources? This is the same thinking that tells you timing attacks are impossible over the Internet.

And I subscribed to this belief for a long time. Not that I thought it was impossible, I just thought there were likelier causes – hence all the attempts to shut down record creation code paths. But I was wrong, and here’s why:

People double-click on things on the Web.

Over time, we designers of software systems have instilled some confusing habits in the people who use our products, and one of those habits means that there is a set of people that always double-click links and form buttons on web pages. Looking at the updated_at timestamps on the duplicate records showed that most of them were modified very close together in time, certainly close enough to cause database race conditions. This fact by itself makes client-enforced uniqueness checks a waste of time. Even if you’re not getting a lot of requests, one little user action can blow your validation.

This is the database’s job

Here’s how this thing should be done, even if you think you’re not at risk:

class AddUniqueIndexToThings < ActiveRecord::Migration
  def self.up
    add_index :oauth_authorizations,
              [:client_id, :resource_owner_type, :resource_owner_id],
              :unique => true
  end
  
  def self.down
    remove_index :oauth_authorizations,
                 [:client_id, :resource_owner_type, :resource_owner_id]
  end
end

Then, when you try to create a record, you should catch the potential exception that this index will through if the new record violates the uniqueness constraint. Rails 3 introduced a new exception called ActiveRecord::RecordNotUnique for its core adapters, but if you’re still supporting older Rails versions you need to catch ActiveRecord::StatementInvalid and check the error message. Here’s how our OAuth library does things.

DUPLICATE_RECORD_ERRORS = [
  /^Mysql::Error:\s+Duplicate\s+entry\b/,
  /^PG::Error:\s+ERROR:\s+duplicate\s+key\b/,
  /\bConstraintException\b/
]

def self.duplicate_record_error?(error)
  error.class.name == 'ActiveRecord::RecordNotUnique' or
  DUPLICATE_RECORD_ERRORS.any? { |re| re =~ error.message }
end

In the Authorization.for(owner, client) method, there’s a rescue clause that uses duplicate_record_error? to check the exception raised. If it’s a duplicate record error, we retry the method call since the second time it should find the new record that was inserted since the first call started.