How we do product discovery

A few weeks ago, I gave a talk at the Future of Web Apps conference on how we do product discovery at Songkick. I had such an overwhelming response to it that I thought it might be useful to share it with the rest of the world.

Apologies for the whack formatting, but SlideShare doesn’t support Keynote files and I didn’t have time to redo the slides in PowerPoint to include the notes in a better way.

I’d love to hear how you guys go about product discovery and any tips / tricks on how to do it better.

Testing iOS apps

We recently released an update to our iPhone app. The app was originally developed by a third-party, so releasing an update required bringing the app development and testing in-house. We develop our projects in a continuous build environment, with automated builds, unit and acceptance tests. It allows us to develop fast and release often, and we wanted the iPhone project to work in the same way.

This article covers some of the tools we used to handle building and testing the app.

Build Automation

We use Git for our version control system, and Jenkins for our continuous integration server. Automating the project build (i.e. building the project to check for compilation errors) seemed like a basic step and a good place to start.

A prerequisite to this was to create a Mac Jenkins Build Slave, which is outside of the scope of this blog post (but if you’re interested, I followed the “master launches slave agent via SSH” instructions of the Jenkins site).

A quick search of Jenkins plugins page revealed a Xcode plugin which allows for building Objective-C applications. Setting up the plugin was a snap – search and install the “XCode integration” plugin from the Jenkins server plugin page, point the plugin to your project directory on the build slave, enable keychain access, and save.

Now for every commit I made to the project, this task would automatically run, and send me a rude email if project compilation failed. In practice I found that this was an excellent way of reminding me of any files I had forgot to check-in to Git; the project would compile on my laptop but fail on the CI server due to missing classes, images, etc.

Unit testing

I looked briefly into the unit testing framework Apple provides, which ships with Xcode. I added a unit test project to the Songkick app, and looked into creating mocks using OCMock, an Objective-C implementation of mock objects.

We already have fairly extensive API tests to test for specific iPhone-related user-flows (such as signing up, tracking an artist, etc), and due to time constraints we opted to concentrate on building acceptance tests, and revisit unit tests if we had time.

Acceptance Testing

There are a bunch of acceptance testing applications available for iOS apps. Here’s a few of the tools I looked into in detail:

Frank

Frank is an iOS acceptance testing application which supports a Cucumber-style test syntax. I was interested in Frank as we already make use of Cucumber to test our Ruby projects, so the familiarity of the domain-specific language would have been a benefit.

I downloaded the project and got a sample test up-and-running fairly quickly. Frank ships with some useful tools, including a web inspector (“Symbiote”) which allows for inspecting app UI elements using the browser, and a “Frank console” for running ad-hoc commands against an iPhone simulator from the command line.

Frank seems to be a pretty feature rich application. The drawbacks for me were that Frank could not be run on real hardware (as of March 2013, this appears to now be possible), and Frank also requires recompiling your application to make a special “Frankified” version to work with the testing framework.

Instruments

Apple provides an application called Instruments to handle testing, profiling and analysis of applications written with Xcode. Instruments allows for recording and editing UIAutomation scripts – runnable JavaScript test files for use against a simulated iOS app or a real hardware install.

InstrumentsRecordingTest

Being able to launch your app with Instruments, perform some actions from within the app, and have those actions automatically converted into a runnable test script was a really quick and easy way of defining tests. Instruments also supports running scripts via the command line.

The drawback of test scripts created with Instruments is that they can be particularly verbose, and Instruments does not provide a convenient way of formatting and defining individual test files (outside of a single UIAutomation script per unique action).

Tuneup_js

Designed to be used as an accompaniment to UIAutomation scripts created using Instruments, Tuneup_js is a JavaScript library that helps to ease the pain of working with the long-winded UIAutomation syntax.

It provides a basic test structure for organising test steps, and a bunch of user-friendly assertions built on top of the standard ones supported by Instruments.

tuneup

I found that recording tests in Instruments, and then converting them into the Tuneup_js test syntax was a really quick way of building acceptance tests for iOS apps. These tests could then be run using a script provided with the Tuneup_js package.

Scenarios

I settled on using Instruments and Tuneup_js to handle acceptance testing. Instruments because of the ability to quickly record acceptance test steps, and Tuneup_js because it could be used to wrap recorded test steps into repeatable tests and allowed for a nicer test syntax than offered out-of-the-box with UIAutomation. What was missing with these applications was a way to handle running the test files in an easily repeatable fashion, and against the iOS simulator as well as hardware devices.

I couldn’t find an existing application to do this, so I wrote Scenarios (Scenar-iOS, see what I did there?) to handle this task. Scenarios is a simple console Ruby app that performs the following steps:

  • Cleans any previous app installs from the target test device
  • Builds the latest version of the app
  • Installs the app on the target test device
  • Runs Tuneup_js-formatted tests against the installed app
  • Reports the test results

Scenarios accepts command-line parameters, such as the option to target the simulator or a hardware device (with the option of auto-detecting the hardware, or supplying a device ID). Scenarios also adds a couple of extra functions on top of the UIAutomation library:

  • withTimout – Can be used for potentially long-running calls (e.g. a button click to login, where the API call may be slow):
    withTimeout(function(){
      app.mainWindow().buttons()["Login"].tap();
    });
  • slowTap – Allows for slowing-down the speed at which taps are executed. Instruments can run test steps very fast, and sometimes it helps to slow down tests to see what they are doing, and help create a more realistic simulated user experience:
    app.toolbar().buttons()["Delete"].slowTap();

Scenarios ships with a sample project (app and tests) that can be run using the simulator or hardware. Here’s a video of the sample running on a simulator:

Jenkins Pipeline

Now I had build and acceptance tests in place, it was time to hook the tests up to Jenkins. I created the following Jenkins projects:

  • “ios-app” – runs the build automation
  • “ios-app-acceptance-tests-simulator” – runs the app (via Scenarios) on a simulator
  • “ios-app-acceptance-tests-iPhone3GS” – runs the app (via Scenarios) on an iPhone 3GS

jenkins-pipeline

Committing a code change to the iOS app Git repo caused the projects in the Jenkins pipeline to build the app, run the acceptance tests against the simulator, and finally run the acceptance tests on an iPhone 3GS. If any stage of the pipeline failed, I received an email informing me I had broken something.

test-iphone

Manual testing with TestFlight

As well as an automated setup, we also made use of the excellent TestFlight service, which enables over-the-air distribution of apps to testers. We had 12 users and 16 devices set up in TestFlight, and I was releasing builds (often daily) over-the-air. It enabled us to get some real-user feedback on the app, something that build and acceptance tests cannot replace.

Jenkins also has a TestFlight plugin, which enables you to automatically deploy a build to TestFlight as part of the pipeline. Very cool, but as we were committing code changes often throughout the day (and only wanted to release to TestFlight once a day), we decided to skip this step for the time being.

Overall, I think that the tools (both open-source and proprietary) available today for automated testing of iOS apps are feature rich (even if some are still in their infancy), and I’m pretty happy with our development setup at Songkick.

Migrating to a new Puppet certification authority

At Songkick all our servers are managed using Puppet, an open source configuration management tool. We use it in client-server mode and recently had the need to replace the certification authority certificates on all our nodes. I couldn’t find much information on how to do this without logging onto every machine, so I’ve documented my method.

What is this Puppet CA anyway?

If you’re using puppet in its typical client-server or agent-master setup, then when the puppet master is first started it will create a certification authority (CA) which all clients that connect to it need to be trusted by and must trust. This usually happens transparently, so often people aren’t aware that this certification authority exists.

The CA is an attempt to have trust between the agents and the master, so that an attacker cannot set up malicious puppet masters and tell puppet agents to do his or her bidding and also prevent malicious clients being able to see configuration data for other clients. Agents should only connect to masters that have certificates signed by its CA and masters should only send configuration information to clients that have certificates signed by the same CA.

There’s a more comprehensive explanation of Puppet SSL written by Brice Figureau which goes into far more detail than we have space for. The main thing to understand is that the CA is an important part of maintaining security and that you can only have one CA across a set of machines that access the same secured resources.

Why would I want to migrate to a new CA?

  • Your current CA certificate is about to expire. By default, CA certificates have a validity period of 5 years, so fairly early adopters of puppet will need to replace them.
  • You’ve had multiple CAs in use and need to consolidate on one.
  • You believe that your certificates and private keys are in the hands of people who could cause mischief with them.
  • You have fallen foul of bugs relating to the fact that you use a CA created in an older version of puppet.

It was in fact the second of these reasons that applied to Songkick; we’d previously been using multiple puppet masters, each with their own CA. We wanted to start using exported resources, stored in the same PuppetDB instance for all nodes. This meant that each master needed to be trusted by the same CA that signed the PuppetDB instance; hence we needed to consolidate on one CA.

How do I do it?

Set up NEW puppet master(s)

Set up at least one new puppet master server, with a new CA certificate.

If you have a lot of existing hosts managed by puppet, then it’s worth considering enabling the autosign option, even if only temporarily, as you’ll have a number of certificate requests to approve manually otherwise.

Configure AGents to connect to THe new master(S)

We’re assuming here that you’re managing the puppet agent configuration through puppet. and that changes to the puppet configuration cause an automatic restart of the puppet agent.

Change the configuration of your puppet agents, to connect to the new master(s) and use a different ssldir:

[main]
server = <new server hostname> 
ssldir = /var/lib/puppet/ssl2

Be careful not to apply this change to your newly created puppet master.

Your clients should reconfigure themselves, restart and when they start up, connect to your new puppet master, forgetting their old ssl configuration, including the CA certificates.

If you have autodiscovery records for puppet in DNS, e.g. an A record for ‘puppet’ or the SRV records, then you should leave them in place for now. Agents that have not been migrated to the new CA may need it.

It is a good idea to test this on a handful of nodes and check that it works in a completely automated fashion before applying to every node.

Tidying up (part 1)

Once every node has checked in with the new master and been issued with a new certificate, it’s time to start the process of tidying up. It’s a good idea to revert back to using the default ssldir, so that when agents bootstrap themselves with the default config, they do not then switch to the new ssldir and thus forget their old certificates. This will cause the master to refuse to talk to them, as this looks like a spoofing attempt.

On each client, we mirror the new ssldir to the old one:

file { '/var/lib/puppet/ssl': 
  source => 'file:///var/lib/puppet/ssl2',
  recurse => true, 
  purge => true, 
  force => true, 
}

Be careful not to apply this change to your newly created puppet master.

Tidy up (part 2)

Once that’s shipped everywhere, we remove the ssldir configuration, fall back on the default ssldir and remove the above resource definiton to copy the ssldir.

Tidy up (part 3)

You can now update your autodiscovery DNS entries, to point to the new servers and remove the autosign configuration, if desired.

Finally, we ship a change to the clients that removes the temporary /var/lib/puppet/ssl2 directory.

And that’s it, everything has been migrated to the new CA, with no need to do anything outside of puppet.

Testing your database backups: the test environment database refresh pattern

When did you last try restoring your database backups? A month ago, a week ago? A year ago? Never? When was the last time you refreshed the data in your test environments? When I joined Songkick, one of the first things I asked was when we last tested a restore of our database backups. The answer, pleasingly, was at 03:00 UK time that morning and not coincidentally, that’s when we last refreshed the data in our test environments.

Here’s how we get the warm and fuzzy feeling of knowing that our backups contain data that can be restored and makes sense.

  1. Every morning, our database servers run their scheduled backups, copying the resulting images to a backup server in the data centre.
  2. Overnight those backups get copied to the office, giving us an offsite copy.
  3. In the small hours, when most of us are asleep, each of the database servers in our staging environment retrieve the backups, erase their local data files and then restore the production backups over the top of them.
  4. We perform sanitisation on the data, to make it suitable for use in a testing environment.
  5. And finally, and most importantly, we use the databases in our testing.

By doing this, we identified one case when our backups seemed to work, produced plausible looking backups, but MySQL failed to apply InnoDB log records during recovery. It was inconvenient to discover this problem in our staging environment, but far less inconvenient than discovering it when we needed the backups to put our production system back into operation.

Here are some practical tips based on our experience implementing and managing this system at Songkick:

Back all databases up at the same time

If your system is composed of services backed by independent databases on different machines, it’s possible that there’s some implicit consistency between them. For example, a common situation at Songkick is to have an accounts service responsible for storing user accounts and another service that stores user data keyed against a user, then there’s an expectation that those databases have some degree of consistency.

If you back them up at different times, you’ll find inconsistencies, that a service might have a reference to a user that doesn’t yet exist. If the ID of the user is exposed to other services and that ID can be reused, you may find that newly created users in your test environment have existing data associated with them and this can cause significant problems in testing.

It’s worth noting that, in the case of a production restore, these issues would need to be diagnosed and solved in the heat of the moment. By finding them in your test environment, you’re giving yourself the space to solve them earlier, under less pressure.

Design the backups to be regularly exercised

Some types of backups are more amenable to being restored regularly in test environments. For example, our initial MongoDB database backups performed snapshots of our MongoDB database path. These proved difficult to restore, because they included local databases which contained information on replica set membership. This means that on startup, our staging MongoDB server would forget its existing replica set membership and try to talk to the production servers instead.

We switched to using mongodump to take a logical export of the database, simply so that we could restore it on the primary member of our existing staging replica set and update the entire replica set.

Sanitisation tips

After we’ve restored the databases, there are certain things we do to make them safe and usable in our testing environments.

  • Remove or obfuscate email addresses. We’re not fond of accidentally emailing people with test events we’ve created in staging, so we change people’s email addresses to be unusable, so that can’t happen. We leave people’s email addresses alone if they work at Songkick, so we can test email features by emailing ourselves.
  • Remove or obfuscate payment tokens. If it’s uncool to accidentally email people, accidentally charging them is positively hostile. Anything that’s used for payment needs to be removed.
  • Fix or replace information about the environment. It’s best to avoid keeping references to your technical environment in the same database as your application data, but sometimes it’s tricky to workaround. For example, our MogileFS installation needs to be kept in sync with our production one, to avoid problems with missing media. This means that we need to manually update the database to substitute the hostnames of the mogilefs servers.

Write code that can withstand the database going away

Unless you’ve put some work in, almost no database driver will gracefully handle the disappearance of a database server and then its re-appearance some time later. If the restore in your test environment is the first time you’ve tried this, you may find that you need to manually restart services, even after the database re-appears on the network.

The solution will vary depending on the database client being used, but often it’s a case of catching an exception, or changing some options when you establish the connection.

By making your applications reconnect to the database with no manual input, you are again fixing a problem that will eventually occur in production – a much more stressful time for it to be diagnosed and fixed.

Summary

Testing your database backups by restoring them automatically and regularly in your test environments is a great way to battle-harden your backups and applications and to make sure that your test environment looks like the real production environment.


If you’ve liked what you’ve read, why not head over to our jobs page? We’re looking for a Systems Engineer to add more touches like these to our infrastructure.

Safely dealing with magical text

Boy, what a week it’s been. A remote-code-execution bug was discovered in Ruby on Rails, and we’ve all been scrambling to patch our servers (please patch your apps before reading any further, there is an automated exploit out there that gives people a shell on your boxes otherwise).

What the Ruby community, and those of other dynamic languages, must realize from recent Rails security blunders is that very similar problems can easily exist in any non-trivial web application. Indeed, I found a remote-execution bug in my own open-source project Faye yesterday, 3.5 years into the life of the project (again: patch before reading on).

There are a lot of lessons to be had from recent Rails security blunders, since they involve so many co-operating factors: excessive trust of user input, insufficient input validation and output encoding, the behavioural capabilities of Ruby objects and certain Rails classes, ignorance of cryptography and the computational complexity of data transport formats. In this post I’d like to focus on one in particular: safely encoding data for output and execution.

Ugh, do I have to?

I know, I know, booooooring, but so many people are still getting this really badly wrong and it continues punish end users by exposing their data to malicious manipulation.

Robert Hansen and Meredith Patterson have a really good slide deck on stopping injection attacks with computational theory. One core message in that paper is that injection exploits (including SQL injection and cross-site scripting) involve crafting input such that it creates new and unexpected syntactic elements in code executed by the software, essentially introducing new instructions for the software to execute. Let’s look at a simple example.

Learn you a query string

I found the code that prompted me to write this post while updating some Google Maps URLs on our site this afternoon. Some of this code was constructing URLs by doing something like this:

def maps_url(lat, lng, zoom, width, height)
  params = [ "center=#{lat},#{lng}",
             "zoom=#{zoom}",
             "size=#{width}x#{height}" ]

  "http://maps.google.com/?" + params.join("&amp;")
end

maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204,-0.1148897&amp; ...
#                             zoom=15&amp; ...
#                             size=640x220"

You can see the intent here: whoever wrote this code assumes the URL is going to end up being embedded in HTML, and so they have encoded the query string delimiters as &amp; entities. But this doesn’t fix the problem entities are designed to solve, namely: safely representing characters that usually have special meaning in HTML. What is telling is that the comma in the query string should really also be encoded as %2C, but isn’t.

So although the ampersands are being encoded, the actual query data is not, and that means anyone calling this function can use it to inject HTML, for example:

link = '<a href="' +
           maps_url(0, 0, 1, 0, '"><script>alert("Hello!")</script>') +
           '">Link text</a>'

# => '<a href="http://maps.google.com/?center=0,0&amp; ...
#                                      zoom=1&amp; ...
#                                      size=0x"> ...
#         <script>alert("Hello!")</script> ...
#         ">Link text</a>'

By abusing the maps_url() function, I have managed to inject characters with special meaning — <, >, etc. — into the output and thereby added new HTML elements to the output that shouldn’t be there. By passing unexpected input I’ve created a lovely little cross-site scripting exploit and stolen all your users’ sessions!

Note that you cannot cleanly fix this by using an HTML-escaping function like ERB::Util.h() on the output of maps_url(), because this would serve to re-encode the ampersands, leaving strings like &amp;amp; in the href attribute.

Stacks of languages

Meredith Patterson of the above-linked paper gave another presentation at 28C3 called The Science of Insecurity. I’ve been telling absolutely everyone to watch it recently, so here it is.

This talk describes how we should think of data transfer formats, network protocols and the like as languages, because in fact that’s what they are. It covers the different levels of language power – regular languages, context-free languages and Turing-complete languages – and how use of each affects the security of our systems. It also explains why, if your application relies on Turing-complete protocols, it will take an infinite amount of time to secure it.

When you build HTML pages, you are using a handful of languages that all run together in the same document. There’s HTML itself, and embedded URLs, and CSS, and JavaScript, and JavaScript embedded in CSS, and CSS selectors embedded in CSS and JavaScript, and base64 encoded images, and … well this list is long. All of these are languages and have formal definitions about how to parse them, and your browser needs to know which type of data it’s dealing with whenever it’s parsing your code.

Every character of output you generate is an instruction that tells the browser what do next. If it’s parsing an HTML attribute and sees the " character, it truncates the attribute at that point. If it thinks it’s reading a text node and sees a <, it starts parsing the input as an HTML tag.

Instead of thinking of your pages as data, you should think of them as executable language.

Back to reality

Let’s apply this idea to our URL:

http://maps.google.com/?center=51.4651204,-0.1148897&amp;zoom=15&amp;size=640x220

Outside of an HTML document, the meaning of this list of characters changes: those &amp; blobs only have meaning when interpreting HTML, and if we treat this query string verbatim we get these parameters out:

{
  'center'   => '51.4651204,-0.1148897',
  'amp;zoom' => '15',
  'amp;size' => '640x220'
}

(This assumes your URL parser doesn’t treat ; as a value delimiter, or complain that the comma is not encoded.)

We’ve seen what happens when we embed HTML-related characters in the URL: inserting the characters "> chops the <a> tag short and allows injection of new HTML elements. But that behaviour comes from HTML, not from anything about URLs; when the browser is parsing an href attribute, it just reads until it hits the closing quote symbol and then HTML-decodes whatever it read up to that point to get the attribute value. It could be a URL, or any other text value, the browser does not care. At that level of parsing, it only matters that the text is HTML-encoded.

In fact, you could have a query string like foo=true&bar="> and parsing it with a URL parser will give you the data {'foo' => 'true', 'bar' => '">'}. The characters "> mean something in the HTML language, but not in the query string language.

So, we have a stack of languages, each nested inside the other. Symbols with no special meaning at one level can gain meaning at the next. What to do?

Stacks of encodings

What we’re really doing here is taking a value and putting it into a query string inside a URL, then putting that URL inside an HTML document.

                                +-------------------------+
                                | "51.4651204,-0.1148897" |
                                +------------+------------+
                                             |
    +----------------------------------------|--------+
    |                                +-------V------+ |
    | http://maps.google.com/?center=| centre_value | |
    |                                +--------------+ |
    +------------------------+------------------------+
                             |
                       +-----V-----+
              <a href="| url_value |">Link>/a>
                       +-----------+

At each layer, the template views the value being injected in as an opaque string — it deosn’t care what it is, it just needs to make sure it’s encoded properly. The problem with our original example is that it pre-emptively applies HTML encoding to data because it anticipates that the value will be used in HTML, but does not apply encodings relevant to the task at hand, namely URL construction. This is precisely backwards: considering the problem as above we see that we should instead:

  1. Decide what type of string we’re creating — is it a URL, an HTML doc, etc.
  2. Apply all encoding relevant to the type of string being made
  3. Do not apply encodings for languages further up the stack

In other words, we should make a URL-constructing function apply URL-related encoding to its inputs, and an HTML-constructing function should apply HTML encoding. This means each layer’s functions can be recombined with others and still work correctly, becasue their outputs don’t make assumptions about where they will be used. So we would rewrite our code as:

def maps_url(lat, lng, zoom, width, height)
  params = { "center" => "#{lat},#{lng}",
             "zoom"   => zoom,
             "size"   => "#{width}x#{height}" }

  query = params.map do |key, value|
    "#{CGI.escape key.to_s}=#{CGI.escape value.to_s}"
  end
  "http://maps.google.com/?" + query.join("&")
end

url = maps_url(51.4651204, -0.1148897, 15, 640, 220)

# => "http://maps.google.com/?center=51.4651204%2C-0.1148897& ...
#                             zoom=15& ...
#                             size=640x220"

html = '<a href="' + ERB::Util.h(url) + '">Link</a>'

# => '<a href="http://maps.google.com/?center=51.4651204%2C-0.1148897&amp; ...
#                                      zoom=15&amp; ...
#                                      size=640x220">Link</a>'

Now we see that we get two valid pieces of data: url is a valid URL with all its query parameters correctly encoded but no HTML entities present, and html is a valid HTML fragment with its attributes correctly entity-encoded.

Also, note how we have treated all incoming data as literal (i.e. not already encoded for the task at hand), and we have not hand-written any encoding ourselves (e.g. hand-writing entities like &amp;). You should deal with data assuming it contains the literal information it represents and use library functions to encode it correctly. There’s a very good chance you don’t know all the text transformations required by each layer.

Thinking in types

At this point you’re probably thinking that I’ve made something quite simple seem very complicated. But thinking in terms of types of strings, treating your output as a language stack and following the bullet list above is a good discipline to follow if you want to make sure you handle data safely.

There are some systems that do this for you, for example Rails 3 automatically HTML-escapes any value you insert into an ERB template by default. I’m working on a more general version of this idea: Coping is a templating language that checks your templates conform to the language you’re producing, and doesn’t let input introduce new syntactic elements.

If you’re feeling very brave, I recommend taking the Coursera Compilers course. Although it doesn’t seem immediately relevant to web devs, many concepts from parser theory, type checking and code generation can be applied to security and are well worth learning.

Above all, learn from other people’s security failures and consider where you may have made similar mistakes.

Introducing Aspec: A black box API testing DSL

Caltrak is the service that stores Songkick users’ tracked artists and cities. It has no other service dependencies. You put data into the Caltrak box, then you get it back out.

For instance, you might make two POST requests to store artist trackings, and then want to retrieve them, which would look like this:

# create and retrieve artist trackings
POST /users/7/artists/1    204
POST /users/7/artists/2    204
 GET /users/7/artists      200    application/json   [1, 2]

Did you understand basically what that was saying? I hope so, because that’s an executable spec from the Caltrak tests.

It’s pretty simple. Every line is both a request and an assertion. Every line says “If I make this request then I expect to get this back”.

This works because the behaviour of this service can be entirely described through the REST API. There are no “side affects” that are not visible through the API itself.

Here is a longer portion from the aspec file.

# no users have pending notifications
   GET /users/with-pending-notifications                200  application/json  []

# users with events on their calendar have pending notifications
  POST /users/764/metro-areas/999                       204
  POST /users/764/artists/123                           204
  POST /events/5?artist_ids=123&metro_area_id=999       204
  POST /events/5/enqueue-notifications                  204
   GET /users/with-pending-notifications                200  application/json  [[764, "ep"]]

# users are unique in the response
  POST /users/764/artists/123                           204
  POST /users/764/artists/456                           204
  POST /users/764/metro-areas/999                       204
  POST /events/5?artist_ids=123,456&metro_area_id=999   204
  POST /events/5/enqueue-notifications                  204
   GET /users/with-pending-notifications                200  application/json  [[764, "ep"]]

Some aspects:

  • Each line has the format Verb, Url (with Params), Status, Content Type, Body separated by whitespace. These are the only things that can be asserted about the service responses.
  • Each “paragraph” is a separate test. The database is cleared in-between.
  • Lines beginning with # are comments.
  • Aspec stubs time, so that the first line of the test occurs precisely on the epoch and each subsequent line occurs 2s after that. This allows us to test responses with creation timestamps in them.

Motivation

When we began developing Caltrak, I wasn’t happy with the process of writing tests for this service.

I wanted the test framework to expose the simple nature of the API. You could make something almost as simple in RSpec or Cucumber with judicious use of helpers and so on, but you would end up with a DSL that obscured the underlying REST API.

In an Aspec file, there is no syntax that does not express real data either sent or received from the service. You’re basically writing down the actual HTTP requests and responses with lots of parts omitted. It is technical, but it is very readable. I think it is better documentation than most service tests.

Also, there is no context that is not immediately visible, as there might be with nested RSpec contexts, for example, where in a large test file the setup may be very distant from the test and assertion.

Implementation

NB This project is very immature. Use at your own risk.

Aspec assumes your project uses Rack, and uses Rack/Test to talk to it. The code is published on GitHub and there is a tiny example API project.

It is very similar to rspec in operation. You write a .aspec file, and put an aspec_helper.rb next to it.

Then run

aspec aspec/my_service.aspec

I’d be interested in hearing your thoughts on this testing style.

A Second Here, A Second There…

A Mysterious Performance Problem

In production, we were seeing lots of requests to our backend services taking a long time. They would typically be over 1000ms when we would expect them to be much, much faster. It was intermittent too. Sometimes the same endpoints were fast, sometimes they were slow.

When we investigated, we were able to reduce the performance problem to the following mystifying test case, which should happen on your system too.

With this Sinatra app:

require 'sinatra'

post "/foo" do
  status 200
  "There are #{params.length} params"
end

Making this request:

curl -d "a=1" "localhost:3000/foo"

takes almost exactly 0 seconds.

(Note, we use libcurl to talk to services instead of Ruby’s Net::HTTP, so this is very close to a real service request from one of our applications.)

On the other hand, making this request:

curl -id "a=$A_KB_OF_DATA" "localhost:8080/foo"

takes almost exactly 1 second.

Now, it shouldn’t take 1 second to make such a simple request just because I’m sending 1K of data.

So what on earth is happening here? If you already know the answer then much respect to you; but this was a head scratcher for me.

After spending a long time in the Sinatra, Rack and Unicorn stack, I was able to trace the long wait to the socket read call in the Unicorn request object.

At which point, Graham (our Head of Tech Ops) suggested that we dump the packets to examine what Curl was actually sending to the server and receiving back. The command we used for this was:

tcpdump -i lo -s 0 port 8080

(Which says, capture all packets on the loopback interface, in their entirety, to and from port 8080.)

Inspecting the contents of these packets led to us (or at least me) learning some new things about HTTP.

I Suck At HTTP

Here’s something I didn’t know about HTTP, from the RFC. An HTTP client can omit the request body and add this header to the request:

Expect: 100-continue

Which is a way of saying “I’m not sure if you want this request body, do you?”

If the server does want the request body, it should respond with a 100 response, and the client can then upload the body and the request/response continues as normal from there.

If it does not want the request body, it can respond with whatever other HTTP code.

Now, this is handy if there are circumstances where you want to reject a request before you go to the trouble of uploading a perhaps massive request body. (Maybe an enormous image file being posted to an endpoint that is not accepting uploads right now.)

It’s entirely up to the client whether to include this header. You can not bother at all, or only bother for very large files, or do it all the time. All valid approaches.

Curl says: if your post data is larger than 1K, I’ll send the Expect header. I say: OK Curl. 1 kilobyte seems pretty small, but whatever, fine.

So far so good.

Here’s the problem: Ruby web servers do not implement 100 Continue automatically.

And they shouldn’t, either. It’s up to the application to make that choice, receive or not receive, so application authors need to implement that themselves in general. And we didn’t.

So what happens? Curl sits there waiting. “Come on, come on, do you want this request body or not?

Curl makes a Weird Choice

So true, our app was technically broken. It happens. And we would have fixed it if Curl had timed out waiting for the 100, as you might expect. Because that would have been an error we would have noticed and fixed.

But Curl doesn’t timeout waiting. If it has sent the Expect header, Curl will wait for the 100 response, but only for exactly one second. After one second has passed, it will send the request body anyway.

So from our apps point of view, everything is fine. It was just a bit slow. No errors raised or reported. Just a slow service. Sometimes.

I guess from the point of view of Curl, it’s best to make every effort to have a successful request, even with performance degradation. From our point of view, failing hard would have been better than months of silent performance regression.

Working Around It

So we now have two options. The first and best is to actually implement the 100 Continue behaviour correctly. The second is to make Curl not behave like this.

Because we control all the clients to our internal services and wanted a quick fix, we decided to go with the second approach, which I’ll describe first.

If you set the Expect header yourself, even to a blank string, that overrides Curl’s behaviour and forces it to send the entire request immediately:

Expect: 

We did this and fixed all the slow requests we were seeing instantly.

Fixing it Properly (With Unicorn)

It’s not entirely obvious how to implement the required behaviour in a Sinatra application. How would you respond with 100 and then continue processing a request in Sinatra?

Fortunately, Unicorn has support for 100 Continue built in. If you respond with 100 from your application, it will return that to the client, and then immediately turn around and call your app a second time with the full data.

So we can make our application detect the Expect header and respond with 100 if appropriate.

We’d like to be able to make this decision from Sinatra, but I believe that Sinatra eagerly reads all of the request, so the delay is triggered before your action block is run.

So instead, include the logic in a Rack middleware, like this. Although here there is no logic, we simply request the body always and immediately if we detect that the client is waiting for 100 response:

class AlwaysRequestBody
  def initialize(app)
    @app = app
  end

  def call(env)
    if env["HTTP_EXPECT"] =~ /100-continue/
      [100, {}, [""]]
    else
      @app.call(env)
    end
  end
end

use AlwaysRequestBody
run Sinatra::Application

With this in place, the previously slow curl request returns instantly, and you can see here Curl logging the intermediate 100 response as well.

$ time curl -id "a=$A_KB_OF_DATA" "localhost:8080/foo"
HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Date: Tue, 27 Nov 2012 17:32:52 GMT
Status: 200 OK
Connection: close
X-Frame-Options: sameorigin
Content-Type: text/html;charset=utf-8
X-XSS-Protection: 1; mode=block
Content-Length: 18
There are 1 params

real    0m0.008s
user    0m0.008s
sys     0m0.008s

Problem solved.

Conclusion

Hence my initial assertion that your app is probably broken. If a client sends the Expect header it has every right to expect a correct response,

Streaming modules in Rails 3

Rails 3 has limited built-in support for progressive rendering (or template streaming). It can stream your header, page, and footer when each has completed rendering. This can have a big impact on your pages’ speed. In our case it’s brought our average first response time from 1s to 400ms.

However, this doesn’t go as far as we’d like. In previous versions of songkick.com, we used a plugin that activated module-by-module template streaming. This means that the page streams each module of the page as it is rendered, not just in three big blocks.

This doesn’t seem to be possible in Rails 3 (although it is apparently planned for Rails 4), but with a bit of fiddling we have managed to make it work. It’s currently activated on most pages on www.songkick.com, if you head over there and find a slow page you can watch the page come in a module at a time.

I’m sharing it here both because I thought it might be useful, and because it’s a bit of a hack and I’d like feedback on it.

In the action you want to stream, you can turn on standard Rails 3 streaming with:

def show
  render :stream => true
end

But to enable module by module streaming there are two things to do. First, in your layout:

<% $streaming_output_buffer = @output_buffer %>
<%= yield %>

And then in the page template (here show.html.erb) wrap the entire contents in a block:

<% with_output_buffer($streaming_output_buffer) %>
  … all of ERB template in here
<% end %>

This will cause this ERB template to write to the streaming buffer rather than its own output buffer each time it renders a portion of the page.

You should only add this block in the top level page templates. Don’t put it in partials or whatever, otherwise weird things will happen. Also, if your entire page is a render call to just one other template (say the page just contained one big partial) then you won’t see any benefit, because it’s only this template that streams not recursively called ones.

Told you it was a bit of a hack. Hopefully it will tide us over until Rails 4 implements this properly.

The cultural side of continuous deployment

We’ve written multiple posts about how we reduced our build time and optimised our tests. Moving to continuous integration (CI) and continuous deployment (CD) allowed us to remove many of the frustrations we had with our build and deploy process. On its own a fast build was not going to move us to continuous deployment but it was a pretty significant enabler. We knew that we had plenty more that we could be improving; we were still reliant on having the right people around to sign off features before releasing and we still depended on manual testing to supplement our automation.

We wanted to avoid drawing a process diagram and then having to police it so we focused on a process which was natural to the way we worked but that improved our process as much as possible.

Don’t aim for perfection

One of our major hold-ups was our attempts to make every feature and every release perfect. We were spending days perfecting pixels and copy only to find out that the feature didn’t have the anticipated impact. There is a huge benefit in getting feedback from users on what works and what doesn’t before you invest a whole load of time in making it look perfect on multiple browsers. Over time we have moved from releasing features and then tweaking them to planning and running A/B tests to gather the information we need before we start designing the final feature.

QA has a key role to play in working with the Product and Design teams to define exactly how much breakage is acceptable. We were moving from a process where every release was tested and it was expected that almost all bugs would have been spotted and fixed before going to production. Now we were relying on our developers and our automation to keep things in a ‘good enough’ state. When something went wrong we stepped back and looked at what was missing – in most cases it was an up-front conversation about risks and expectations.

Of course this is not an excuse for having a website full of badly designed and half-working features. We accept that bugs will end up on production but we work hard to make sure they get fixed as soon as possible.

Managing how many more bugs went to production was a job for our automated tests. Reviewing all the tests as part of our ‘make all the tests fast’ overhaul started to convince us that we had decent coverage. Deciding that we were going to trust the tests gave us the freedom to say that any green build was a releasable build. If this turned out not to the be the case, either because manual testing discovered a bug or because of an issue in production then we amended the tests. Regular reviews and conversations, particularly between developers and QA, help us to keep the tests maintained and testing the right things.

Avoid red builds

Historically Songkick has had an unnatural tolerance for red builds. They didn’t appear to slow us down that much so we didn’t take the time to really try to avoid them. Once we started to seriously look at adopting continuous integration we realised that this would have to change. Frequent check-ins will only work if the builds are green. Loud and visible alerts that go to the whole team when a build fails not only means someone looks into the failure quickly but also helped us to view red builds as a delay. This coupled with having a very simple, and fast, way to run the tests on a dev environment before checking code in keeps our red builds to a minimum.

Integrate small changes frequently

A key part of CI is integrating frequently. In an ideal world you probably have everyone working off the master branch. We are careful to maintain a releasable master branch but opted for individual freedom around working on individual branches or directly off master. We like CI because it allows developers the freedom to work in a way that suits them whilst still having enough safeguards to keep the site running. Once we had a fast and painless way to integrate and release most developers naturally started integrating small changes on a more frequent basis.

Have a shared understanding of your goals

Make sure you, and everyone in the team understands what you’re trying to achieve at each stage of the build pipeline. At Songkick we expect to be able to build and test features on a local dev environment. If we discover something that forces us to test on a real test environment, such as missing data or missing services, then work gets prioritised to change that for next time.

Green builds have been tested on the CI server so we assume that a green build has the minimum required functionality to be releasable.

We use the test environment to test that the build can be deployed, and that the website works as we expect it to when running on multiple servers with lifelike data. Acceptance tests running with Selenium check that agreed business-critical functionality has not been broken. We have separated our build and deploy pipeline from feature launches so passing acceptance tests are our green flag to deploy to production.

Manual acceptance testing takes place on the production environment with the aid of feature flippers to control who can see which features. Once a feature has been tested we manually change the flipper to ‘launch’ the feature to the users.

Keep on learning

CI and CD are difficult to implement, and one of the hardest parts is imagining what the process will actually look like. Rather than trying to pin down the final process we introduced changes gradually, focusing on removing the biggest bottlenecks first. Once one bottleneck was removed it was pretty easy to see what the next one was. Speaking up when you feel frustrated along with analysing problems using the 5-Whys method has helped us improve the process to where we are today. It is fine to make a mistake but at least make sure it is an original one.

validates_uniqueness_of :nothing

Warning: this article contains rather a lot of silly decisions.

I’ve recently been working out some bugs in our OAuth implementation, including our OAuth2::Provider library. One of the biggest gotchas I found while diagnosing problems with our client apps was the existence of duplicate Authorization records.

An Authorization is a link between a ResouceOwner (i.e. a Songkick user) and a Client, for example our iPhone application. It represents that the user has granted the client access to their resources on Songkick. There should only be one of these per owner-client pair, and somehow we had a few thousand duplicates in our database. Getting more concrete, the table’s columns include the following:

+---------------------+--------------+
| Field               | Type         |
+---------------------+--------------+
| resource_owner_type | varchar(255) |
| resource_owner_id   | int(11)      |
| client_id           | int(11)      |
+---------------------+--------------+

Each combination of values for these three columns must only appear once in the table.

A series of unfortunate events

Now the Rails Way to make such guarantees is to use validates_uniqueness_of, or use a find_or_create_by_* call to check if something exists before creating it. And that’s basically what I’d done; OAuth2::Provider has a method called Authorization.for(owner, client) that would either find a suitable record or create a new one.

But despite implementing this, we were still getting duplicates. I removed an alternative code path for getting Authorization records, and still the duplicates continued. I figured something in our applications must be creating them, so I made new() and create() private on the Authorization model. No dice.

And then I remembered: concurrency! Trying to enforce uniqueness on the client doesn’t work, unless all the clients subscribe to a distributed decision-making protocol. If two requests are in flight, both can run a SELECT query, find there’s no existing record, and then both decide to create the record. Something like this:

             User 1                 |               User 2
------------------------------------+--------------------------------------
# User 1 checks whether there's     |
# already a comment with the title  |
# 'My Post'. This is not the case.  |
SELECT * FROM comments              |
WHERE title = 'My Post'             |
                                    |
                                    | # User 2 does the same thing and also
                                    | # infers that his title is unique.
                                    | SELECT * FROM comments
                                    | WHERE title = 'My Post'
                                    |
# User 1 inserts his comment.       |
INSERT INTO comments                |
(title, content) VALUES             |
('My Post', 'hi!')                  |
                                    |
                                    | # User 2 does the same thing.
                                    | INSERT INTO comments
                                    | (title, content) VALUES
                                    | ('My Post', 'hello!')
                                    |
                                    | # ^^^^^^
                                    | # Boom! We now have a duplicate
                                    | # title!

This may look familiar to you. In fact, I lifted straight out of the ActiveRecord source where it explains why validates_uniqueness_ofdoesn’t work when you have concurrent requests.

Users do the funniest things

I agree with you – in theory. In theory, communism works. In theory.

— Homer J. Simpson

There can be a tendency among some programmers to dismiss these arguments as things that probably won’t be a problem in practice. Why would two requests arrive at the same time, close enough to cause this race condition in the database, for the same user’s resources? This is the same thinking that tells you timing attacks are impossible over the Internet.

And I subscribed to this belief for a long time. Not that I thought it was impossible, I just thought there were likelier causes – hence all the attempts to shut down record creation code paths. But I was wrong, and here’s why:

People double-click on things on the Web.

Over time, we designers of software systems have instilled some confusing habits in the people who use our products, and one of those habits means that there is a set of people that always double-click links and form buttons on web pages. Looking at the updated_at timestamps on the duplicate records showed that most of them were modified very close together in time, certainly close enough to cause database race conditions. This fact by itself makes client-enforced uniqueness checks a waste of time. Even if you’re not getting a lot of requests, one little user action can blow your validation.

This is the database’s job

Here’s how this thing should be done, even if you think you’re not at risk:

class AddUniqueIndexToThings < ActiveRecord::Migration
  def self.up
    add_index :oauth_authorizations,
              [:client_id, :resource_owner_type, :resource_owner_id],
              :unique => true
  end
  
  def self.down
    remove_index :oauth_authorizations,
                 [:client_id, :resource_owner_type, :resource_owner_id]
  end
end

Then, when you try to create a record, you should catch the potential exception that this index will through if the new record violates the uniqueness constraint. Rails 3 introduced a new exception called ActiveRecord::RecordNotUnique for its core adapters, but if you’re still supporting older Rails versions you need to catch ActiveRecord::StatementInvalid and check the error message. Here’s how our OAuth library does things.

DUPLICATE_RECORD_ERRORS = [
  /^Mysql::Error:\s+Duplicate\s+entry\b/,
  /^PG::Error:\s+ERROR:\s+duplicate\s+key\b/,
  /\bConstraintException\b/
]

def self.duplicate_record_error?(error)
  error.class.name == 'ActiveRecord::RecordNotUnique' or
  DUPLICATE_RECORD_ERRORS.any? { |re| re =~ error.message }
end

In the Authorization.for(owner, client) method, there’s a rescue clause that uses duplicate_record_error? to check the exception raised. If it’s a duplicate record error, we retry the method call since the second time it should find the new record that was inserted since the first call started.