A Second Here, A Second There…

A Mysterious Performance Problem

In production, we were seeing lots of requests to our backend services taking a long time. They would typically be over 1000ms when we would expect them to be much, much faster. It was intermittent too. Sometimes the same endpoints were fast, sometimes they were slow.

When we investigated, we were able to reduce the performance problem to the following mystifying test case, which should happen on your system too.

With this Sinatra app:

require 'sinatra'

post "/foo" do
  status 200
  "There are #{params.length} params"
end

Making this request:

curl -d "a=1" "localhost:3000/foo"

takes almost exactly 0 seconds.

(Note, we use libcurl to talk to services instead of Ruby’s Net::HTTP, so this is very close to a real service request from one of our applications.)

On the other hand, making this request:

curl -id "a=$A_KB_OF_DATA" "localhost:8080/foo"

takes almost exactly 1 second.

Now, it shouldn’t take 1 second to make such a simple request just because I’m sending 1K of data.

So what on earth is happening here? If you already know the answer then much respect to you; but this was a head scratcher for me.

After spending a long time in the Sinatra, Rack and Unicorn stack, I was able to trace the long wait to the socket read call in the Unicorn request object.

At which point, Graham (our Head of Tech Ops) suggested that we dump the packets to examine what Curl was actually sending to the server and receiving back. The command we used for this was:

tcpdump -i lo -s 0 port 8080

(Which says, capture all packets on the loopback interface, in their entirety, to and from port 8080.)

Inspecting the contents of these packets led to us (or at least me) learning some new things about HTTP.

I Suck At HTTP

Here’s something I didn’t know about HTTP, from the RFC. An HTTP client can omit the request body and add this header to the request:

Expect: 100-continue

Which is a way of saying “I’m not sure if you want this request body, do you?”

If the server does want the request body, it should respond with a 100 response, and the client can then upload the body and the request/response continues as normal from there.

If it does not want the request body, it can respond with whatever other HTTP code.

Now, this is handy if there are circumstances where you want to reject a request before you go to the trouble of uploading a perhaps massive request body. (Maybe an enormous image file being posted to an endpoint that is not accepting uploads right now.)

It’s entirely up to the client whether to include this header. You can not bother at all, or only bother for very large files, or do it all the time. All valid approaches.

Curl says: if your post data is larger than 1K, I’ll send the Expect header. I say: OK Curl. 1 kilobyte seems pretty small, but whatever, fine.

So far so good.

Here’s the problem: Ruby web servers do not implement 100 Continue automatically.

And they shouldn’t, either. It’s up to the application to make that choice, receive or not receive, so application authors need to implement that themselves in general. And we didn’t.

So what happens? Curl sits there waiting. “Come on, come on, do you want this request body or not?

Curl makes a Weird Choice

So true, our app was technically broken. It happens. And we would have fixed it if Curl had timed out waiting for the 100, as you might expect. Because that would have been an error we would have noticed and fixed.

But Curl doesn’t timeout waiting. If it has sent the Expect header, Curl will wait for the 100 response, but only for exactly one second. After one second has passed, it will send the request body anyway.

So from our apps point of view, everything is fine. It was just a bit slow. No errors raised or reported. Just a slow service. Sometimes.

I guess from the point of view of Curl, it’s best to make every effort to have a successful request, even with performance degradation. From our point of view, failing hard would have been better than months of silent performance regression.

Working Around It

So we now have two options. The first and best is to actually implement the 100 Continue behaviour correctly. The second is to make Curl not behave like this.

Because we control all the clients to our internal services and wanted a quick fix, we decided to go with the second approach, which I’ll describe first.

If you set the Expect header yourself, even to a blank string, that overrides Curl’s behaviour and forces it to send the entire request immediately:

Expect: 

We did this and fixed all the slow requests we were seeing instantly.

Fixing it Properly (With Unicorn)

It’s not entirely obvious how to implement the required behaviour in a Sinatra application. How would you respond with 100 and then continue processing a request in Sinatra?

Fortunately, Unicorn has support for 100 Continue built in. If you respond with 100 from your application, it will return that to the client, and then immediately turn around and call your app a second time with the full data.

So we can make our application detect the Expect header and respond with 100 if appropriate.

We’d like to be able to make this decision from Sinatra, but I believe that Sinatra eagerly reads all of the request, so the delay is triggered before your action block is run.

So instead, include the logic in a Rack middleware, like this. Although here there is no logic, we simply request the body always and immediately if we detect that the client is waiting for 100 response:

class AlwaysRequestBody
  def initialize(app)
    @app = app
  end

  def call(env)
    if env["HTTP_EXPECT"] =~ /100-continue/
      [100, {}, [""]]
    else
      @app.call(env)
    end
  end
end

use AlwaysRequestBody
run Sinatra::Application

With this in place, the previously slow curl request returns instantly, and you can see here Curl logging the intermediate 100 response as well.

$ time curl -id "a=$A_KB_OF_DATA" "localhost:8080/foo"
HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Date: Tue, 27 Nov 2012 17:32:52 GMT
Status: 200 OK
Connection: close
X-Frame-Options: sameorigin
Content-Type: text/html;charset=utf-8
X-XSS-Protection: 1; mode=block
Content-Length: 18
There are 1 params

real    0m0.008s
user    0m0.008s
sys     0m0.008s

Problem solved.

Conclusion

Hence my initial assertion that your app is probably broken. If a client sends the Expect header it has every right to expect a correct response,

Streaming modules in Rails 3

Rails 3 has limited built-in support for progressive rendering (or template streaming). It can stream your header, page, and footer when each has completed rendering. This can have a big impact on your pages’ speed. In our case it’s brought our average first response time from 1s to 400ms.

However, this doesn’t go as far as we’d like. In previous versions of songkick.com, we used a plugin that activated module-by-module template streaming. This means that the page streams each module of the page as it is rendered, not just in three big blocks.

This doesn’t seem to be possible in Rails 3 (although it is apparently planned for Rails 4), but with a bit of fiddling we have managed to make it work. It’s currently activated on most pages on www.songkick.com, if you head over there and find a slow page you can watch the page come in a module at a time.

I’m sharing it here both because I thought it might be useful, and because it’s a bit of a hack and I’d like feedback on it.

In the action you want to stream, you can turn on standard Rails 3 streaming with:

def show
  render :stream => true
end

But to enable module by module streaming there are two things to do. First, in your layout:

<% $streaming_output_buffer = @output_buffer %>
<%= yield %>

And then in the page template (here show.html.erb) wrap the entire contents in a block:

<% with_output_buffer($streaming_output_buffer) %>
  … all of ERB template in here
<% end %>

This will cause this ERB template to write to the streaming buffer rather than its own output buffer each time it renders a portion of the page.

You should only add this block in the top level page templates. Don’t put it in partials or whatever, otherwise weird things will happen. Also, if your entire page is a render call to just one other template (say the page just contained one big partial) then you won’t see any benefit, because it’s only this template that streams not recursively called ones.

Told you it was a bit of a hack. Hopefully it will tide us over until Rails 4 implements this properly.

The cultural side of continuous deployment

We’ve written multiple posts about how we reduced our build time and optimised our tests. Moving to continuous integration (CI) and continuous deployment (CD) allowed us to remove many of the frustrations we had with our build and deploy process. On its own a fast build was not going to move us to continuous deployment but it was a pretty significant enabler. We knew that we had plenty more that we could be improving; we were still reliant on having the right people around to sign off features before releasing and we still depended on manual testing to supplement our automation.

We wanted to avoid drawing a process diagram and then having to police it so we focused on a process which was natural to the way we worked but that improved our process as much as possible.

Don’t aim for perfection

One of our major hold-ups was our attempts to make every feature and every release perfect. We were spending days perfecting pixels and copy only to find out that the feature didn’t have the anticipated impact. There is a huge benefit in getting feedback from users on what works and what doesn’t before you invest a whole load of time in making it look perfect on multiple browsers. Over time we have moved from releasing features and then tweaking them to planning and running A/B tests to gather the information we need before we start designing the final feature.

QA has a key role to play in working with the Product and Design teams to define exactly how much breakage is acceptable. We were moving from a process where every release was tested and it was expected that almost all bugs would have been spotted and fixed before going to production. Now we were relying on our developers and our automation to keep things in a ‘good enough’ state. When something went wrong we stepped back and looked at what was missing – in most cases it was an up-front conversation about risks and expectations.

Of course this is not an excuse for having a website full of badly designed and half-working features. We accept that bugs will end up on production but we work hard to make sure they get fixed as soon as possible.

Managing how many more bugs went to production was a job for our automated tests. Reviewing all the tests as part of our ‘make all the tests fast’ overhaul started to convince us that we had decent coverage. Deciding that we were going to trust the tests gave us the freedom to say that any green build was a releasable build. If this turned out not to the be the case, either because manual testing discovered a bug or because of an issue in production then we amended the tests. Regular reviews and conversations, particularly between developers and QA, help us to keep the tests maintained and testing the right things.

Avoid red builds

Historically Songkick has had an unnatural tolerance for red builds. They didn’t appear to slow us down that much so we didn’t take the time to really try to avoid them. Once we started to seriously look at adopting continuous integration we realised that this would have to change. Frequent check-ins will only work if the builds are green. Loud and visible alerts that go to the whole team when a build fails not only means someone looks into the failure quickly but also helped us to view red builds as a delay. This coupled with having a very simple, and fast, way to run the tests on a dev environment before checking code in keeps our red builds to a minimum.

Integrate small changes frequently

A key part of CI is integrating frequently. In an ideal world you probably have everyone working off the master branch. We are careful to maintain a releasable master branch but opted for individual freedom around working on individual branches or directly off master. We like CI because it allows developers the freedom to work in a way that suits them whilst still having enough safeguards to keep the site running. Once we had a fast and painless way to integrate and release most developers naturally started integrating small changes on a more frequent basis.

Have a shared understanding of your goals

Make sure you, and everyone in the team understands what you’re trying to achieve at each stage of the build pipeline. At Songkick we expect to be able to build and test features on a local dev environment. If we discover something that forces us to test on a real test environment, such as missing data or missing services, then work gets prioritised to change that for next time.

Green builds have been tested on the CI server so we assume that a green build has the minimum required functionality to be releasable.

We use the test environment to test that the build can be deployed, and that the website works as we expect it to when running on multiple servers with lifelike data. Acceptance tests running with Selenium check that agreed business-critical functionality has not been broken. We have separated our build and deploy pipeline from feature launches so passing acceptance tests are our green flag to deploy to production.

Manual acceptance testing takes place on the production environment with the aid of feature flippers to control who can see which features. Once a feature has been tested we manually change the flipper to ‘launch’ the feature to the users.

Keep on learning

CI and CD are difficult to implement, and one of the hardest parts is imagining what the process will actually look like. Rather than trying to pin down the final process we introduced changes gradually, focusing on removing the biggest bottlenecks first. Once one bottleneck was removed it was pretty easy to see what the next one was. Speaking up when you feel frustrated along with analysing problems using the 5-Whys method has helped us improve the process to where we are today. It is fine to make a mistake but at least make sure it is an original one.

validates_uniqueness_of :nothing

Warning: this article contains rather a lot of silly decisions.

I’ve recently been working out some bugs in our OAuth implementation, including our OAuth2::Provider library. One of the biggest gotchas I found while diagnosing problems with our client apps was the existence of duplicate Authorization records.

An Authorization is a link between a ResouceOwner (i.e. a Songkick user) and a Client, for example our iPhone application. It represents that the user has granted the client access to their resources on Songkick. There should only be one of these per owner-client pair, and somehow we had a few thousand duplicates in our database. Getting more concrete, the table’s columns include the following:

+---------------------+--------------+
| Field               | Type         |
+---------------------+--------------+
| resource_owner_type | varchar(255) |
| resource_owner_id   | int(11)      |
| client_id           | int(11)      |
+---------------------+--------------+

Each combination of values for these three columns must only appear once in the table.

A series of unfortunate events

Now the Rails Way to make such guarantees is to use validates_uniqueness_of, or use a find_or_create_by_* call to check if something exists before creating it. And that’s basically what I’d done; OAuth2::Provider has a method called Authorization.for(owner, client) that would either find a suitable record or create a new one.

But despite implementing this, we were still getting duplicates. I removed an alternative code path for getting Authorization records, and still the duplicates continued. I figured something in our applications must be creating them, so I made new() and create() private on the Authorization model. No dice.

And then I remembered: concurrency! Trying to enforce uniqueness on the client doesn’t work, unless all the clients subscribe to a distributed decision-making protocol. If two requests are in flight, both can run a SELECT query, find there’s no existing record, and then both decide to create the record. Something like this:

             User 1                 |               User 2
------------------------------------+--------------------------------------
# User 1 checks whether there's     |
# already a comment with the title  |
# 'My Post'. This is not the case.  |
SELECT * FROM comments              |
WHERE title = 'My Post'             |
                                    |
                                    | # User 2 does the same thing and also
                                    | # infers that his title is unique.
                                    | SELECT * FROM comments
                                    | WHERE title = 'My Post'
                                    |
# User 1 inserts his comment.       |
INSERT INTO comments                |
(title, content) VALUES             |
('My Post', 'hi!')                  |
                                    |
                                    | # User 2 does the same thing.
                                    | INSERT INTO comments
                                    | (title, content) VALUES
                                    | ('My Post', 'hello!')
                                    |
                                    | # ^^^^^^
                                    | # Boom! We now have a duplicate
                                    | # title!

This may look familiar to you. In fact, I lifted straight out of the ActiveRecord source where it explains why validates_uniqueness_ofdoesn’t work when you have concurrent requests.

Users do the funniest things

I agree with you – in theory. In theory, communism works. In theory.

— Homer J. Simpson

There can be a tendency among some programmers to dismiss these arguments as things that probably won’t be a problem in practice. Why would two requests arrive at the same time, close enough to cause this race condition in the database, for the same user’s resources? This is the same thinking that tells you timing attacks are impossible over the Internet.

And I subscribed to this belief for a long time. Not that I thought it was impossible, I just thought there were likelier causes – hence all the attempts to shut down record creation code paths. But I was wrong, and here’s why:

People double-click on things on the Web.

Over time, we designers of software systems have instilled some confusing habits in the people who use our products, and one of those habits means that there is a set of people that always double-click links and form buttons on web pages. Looking at the updated_at timestamps on the duplicate records showed that most of them were modified very close together in time, certainly close enough to cause database race conditions. This fact by itself makes client-enforced uniqueness checks a waste of time. Even if you’re not getting a lot of requests, one little user action can blow your validation.

This is the database’s job

Here’s how this thing should be done, even if you think you’re not at risk:

class AddUniqueIndexToThings < ActiveRecord::Migration
  def self.up
    add_index :oauth_authorizations,
              [:client_id, :resource_owner_type, :resource_owner_id],
              :unique => true
  end
  
  def self.down
    remove_index :oauth_authorizations,
                 [:client_id, :resource_owner_type, :resource_owner_id]
  end
end

Then, when you try to create a record, you should catch the potential exception that this index will through if the new record violates the uniqueness constraint. Rails 3 introduced a new exception called ActiveRecord::RecordNotUnique for its core adapters, but if you’re still supporting older Rails versions you need to catch ActiveRecord::StatementInvalid and check the error message. Here’s how our OAuth library does things.

DUPLICATE_RECORD_ERRORS = [
  /^Mysql::Error:\s+Duplicate\s+entry\b/,
  /^PG::Error:\s+ERROR:\s+duplicate\s+key\b/,
  /\bConstraintException\b/
]

def self.duplicate_record_error?(error)
  error.class.name == 'ActiveRecord::RecordNotUnique' or
  DUPLICATE_RECORD_ERRORS.any? { |re| re =~ error.message }
end

In the Authorization.for(owner, client) method, there’s a rescue clause that uses duplicate_record_error? to check the exception raised. If it’s a duplicate record error, we retry the method call since the second time it should find the new record that was inserted since the first call started.