Statistics for fun and profit (and analyzing split tests)

This is a post about split testing. Split testing, sometimes known as A/B testing, is a way of figuring out which of two (or more) versions of a site performs better. The idea is simple: divide visitors to your site into groups at random and present each group with one of the versions under test. You then measure the effectiveness separately for each group and compare results. The big advantage of running things this way rather than, say, showing everyone version A on Monday followed by version B on Tuesday is that it automatically corrects for external confounding factors; what if Monday was a public holiday, for example.

So far, so good. It all sounds pretty simple, and implementation can be as straightforward as a setting a cookie and counting entries in the server logs. However, things get a little more complicated when it comes to analyzing the results.

For example, how do you know when to stop collecting data and make a decision? Leaving the test running for too long is a waste of time, which is something that most start-ups don’t exactly have a lot of, but not collecting enough data has more subtle consequences. Each group of users will display a wide variety of behavior for all sorts of reasons nothing to do with the change you’re making. Suppose that by pure chance the average age of visitors in group A was much higher than in group B; in this case you could easily imagine that their behavior would differ regardless of the versions of the site they had seen. Put another way, how can you be confident that the difference you observe implies a fundamental difference between the versions rather than simply being explained by random chance? This topic is known as statistical significance.

There are a few ways to approach this question. One common approach is frequentist hypothesis testing, which I’m not going ot discuss here. Instead I’ll focus on an approach based on Bayesian modelling.

As the name would suggest, at the core of this approach is a mathematical model of the data observed during the test. To be a little more precise, by mathematical model I mean a statement about the relationship between various quantities. A non-statistical example of this is Ohm’s law, which is a model of electrical conductivity in idealized materials, and states that three quantities, current (I), voltage (V) and resistance (R) are related by

V = I \times R

Statistical models generalize this by introducing random variables into the mix. A random variable is a variable which, rather than having a single fixed value, is represented by distribution of possible values with associated probabilities; we may be 90% sure that the number of beers left in the fridge is 10, but we can’t quite remember who drank what last night, so there’s a 10% chance that the number is 9. The exact meaning of probabilities is an interesting philosophical discussion in its own right, but intuitively it’s a measure of the strength of our belief represented as a real number between 0 and 1. Values with probability 0 can never happen, and values with probability 1 are certain, and everything in between may or may not be true.

Models for split tests

How do we apply it to the results of a split test? Let’s start by modelling the behavior of a single group of users.

As a concrete example, lets say we want to improve the number of users successfully filling in our sign-up form. In this case, over some period n visitors land on the form, of which k successfully fill it in and hit ‘submit’. A third relevant quantity is p, which is the conversion rate, i.e. the probability that a randomly chosen individual from the entire population, when presented with the form, will sign-up. The emphasis here is important; we want to be able to generalize to future visitors, so calculating a value for p based purely on the participants in the test, while easy, isn’t good enough.

Before we can make any inferences we need to relate these quantities to each other via a statistical model. In this case a binomial model is appropriate. This uses a Binomial distribution, which has the following probability mass function (PMF) for the value of k given a certain n and p:

f\left(k; n, p\right) = {n \choose k}\times p^k\times \left(1-p\right)^{n-k}

The PMF allows us to take a value of k and find the probability of that value occurring under the model. Graphically it looks like this:

where the red, green and blue curves are for p=0.5, p=0.9 and p=0.1 respectivley (n=100 in all cases).

The binomial distribution is often described using the example of a biased coin. Suppose I have such a coin with a known probability, p, of turning up heads; the binomial distribution represents the probability of seeing k heads if I flip it n times. Note the use of random variables: even if n and p are known with certainty we still can’t do better than assigning a distribution over a range of possible values for k. Hopefully it’s not too much of a stretch to relate this scenario to the sign-up conversion problem.

Inference and Bayes’ theorem

Let’s write the probability of a particular value of k as P(k | p, n). In this notation the bar (‘|’) represents conditional probability. In other words, this is an expression for the distribution over possible values of k if p and n have known, fixed values, and in this case is exactly the binomial PMF given by f(.) above.

This isn’t quite what we want. Given the results of a test, k is known, but p isn’t, so we want to a know a distribution over p given the observed data, or P(p | k, n). Fortunately, Bayes’ theorem tells us how to compute precisely that:

P\left(p\mid k,n\right) = {P\left(k\mid p,n\right) \times P\left(p\mid n\right) \over P\left(k\mid n\right)}

There are a couple of other quantities here, P(p | n), which is known as the prior, and P(k | n). The prior represents our beliefs about p in the absence of data. Given we know nothing in that case it’s not unreasonable to model it as a flat distribution (i.e. a constant). P(k | n) is dependent only on fixed, observed quantities, so can also be treated as a constant for this analysis, hence:

P\left(p\mid k,n\right) = {1 \over Z} P\left(k\mid p,n\right)

Probability distributions must sum to one (i.e. we know that we’ll certainly get one of the possible values), so Z isn’t free to vary arbitrarily.

All of this can easily be done numerically, either with a small script in your language of choice or using a spreadsheet. Excel has a function BINOMDIST which gives P(k | p, n), so you can use something like this:

Comparing test groups

In a split test we treat each group as a separate population, with separate conversion rates, pa and pb. Each of these can be analysed as above, so we’ll end up with a distribution for each. Numerically, this will be a set of discrete values for each with probabilities assigned to each, probably represented as two columns if you’re using a spreadsheet.

We’ll treat the two groups as independent. For independent variables ‘and’ queries correspond to multiplying probabilities, so the probability of group A having conversion rate pa and group B having conversion rate pb is

p\left(p_a\mid k_a, n_a\right)\times p\left(p_b\mid k_b, n_b\right)

Finding the probability that A wins is then just a matter of finding all of the pairs (pa, pb) where pa > pb, multiplying the corresponding values for each and then adding them all up. Using mathematical notation, this is the same as saying

\sum_{p_a > p_b} p\left(p_a\mid k_a, n_a\right)\times p\left(p_b\mid k_b, n_b\right)

It turns out that this sort of calculation doesn’t really lend itself to spreadsheets, but it’s pretty straightforward in most programming languages. We’ve actually put some of the scripts we use for this kind of analysis on GitHub:

To make a decision you first need to decide how confident you want to be. If the answer you get from the above is 0.95 and you’re happy with a 5% margin of error you should choose to roll out version A, and if it’s 0.05 you probably want to pick B.

If you get something close to 0.5 you need to work out whether to declare neither A nor B the winner (i.e. they’re as good as each other), or wait a bit longer and gather more data. To help with this you can vary the above sum to consider pairs where pa and pb are within some small distance of each other (say a 1% difference). If the probability mass for these pairs is high it’s very likely that there is little difference between A and B, but if not you just don’t have enough data to draw a conclusion either way.