This is a post about split testing. Split testing, sometimes known as A/B testing, is a way of figuring out which of two (or more) versions of a site performs better. The idea is simple: divide visitors to your site into groups at random and present each group with one of the versions under test. You then measure the effectiveness separately for each group and compare results. The big advantage of running things this way rather than, say, showing everyone version A on Monday followed by version B on Tuesday is that it automatically corrects for external confounding factors; what if Monday was a public holiday, for example.

So far, so good. It all sounds pretty simple, and implementation can be as straightforward as a setting a cookie and counting entries in the server logs. However, things get a little more complicated when it comes to analyzing the results.

For example, how do you know when to stop collecting data and make a decision? Leaving the test running for too long is a waste of time, which is something that most start-ups don’t exactly have a lot of, but not collecting enough data has more subtle consequences. Each group of users will display a wide variety of behavior for all sorts of reasons nothing to do with the change you’re making. Suppose that by pure chance the average age of visitors in group A was much higher than in group B; in this case you could easily imagine that their behavior would differ regardless of the versions of the site they had seen. Put another way, how can you be confident that the difference you observe implies a fundamental difference between the versions rather than simply being explained by random chance? This topic is known as *statistical significance*.

There are a few ways to approach this question. One common approach is frequentist hypothesis testing, which I’m not going ot discuss here. Instead I’ll focus on an approach based on Bayesian modelling.

As the name would suggest, at the core of this approach is a mathematical model of the data observed during the test. To be a little more precise, by mathematical model I mean a statement about the relationship between various quantities. A non-statistical example of this is Ohm’s law, which is a model of electrical conductivity in idealized materials, and states that three quantities, current (*I*), voltage (*V*) and resistance (*R*) are related by

Statistical models generalize this by introducing *random variables* into the mix. A random variable is a variable which, rather than having a single fixed value, is represented by distribution of possible values with associated probabilities; we may be 90% sure that the number of beers left in the fridge is 10, but we can’t quite remember who drank what last night, so there’s a 10% chance that the number is 9. The exact meaning of probabilities is an interesting philosophical discussion in its own right, but intuitively it’s a measure of the strength of our belief represented as a real number between 0 and 1. Values with probability 0 can never happen, and values with probability 1 are certain, and everything in between may or may not be true.

**Models for split tests**

How do we apply it to the results of a split test? Let’s start by modelling the behavior of a single group of users.

As a concrete example, lets say we want to improve the number of users successfully filling in our sign-up form. In this case, over some period *n* visitors land on the form, of which *k* successfully fill it in and hit ‘submit’. A third relevant quantity is *p*, which is the conversion rate, i.e. the probability that a randomly chosen individual *from the entire population*, when presented with the form, will sign-up. The emphasis here is important; we want to be able to generalize to future visitors, so calculating a value for *p* based purely on the participants in the test, while easy, isn’t good enough.

Before we can make any inferences we need to relate these quantities to each other via a statistical model. In this case a binomial model is appropriate. This uses a Binomial distribution, which has the following *probability mass function* (PMF) for the value of *k* given a certain *n* and *p*:

The PMF allows us to take a value of *k* and find the probability of that value occurring under the model. Graphically it looks like this:

where the red, green and blue curves are for *p=0.5*, *p=0.9* and *p=0.1* respectivley (*n=100* in all cases).

The binomial distribution is often described using the example of a biased coin. Suppose I have such a coin with a known probability, *p*, of turning up heads; the binomial distribution represents the probability of seeing *k* heads if I flip it *n* times. Note the use of random variables: even if *n* and *p* are known with certainty we still can’t do better than assigning a distribution over a range of possible values for *k*. Hopefully it’s not too much of a stretch to relate this scenario to the sign-up conversion problem.

**Inference and Bayes’ theorem**

Let’s write the probability of a particular value of *k* as *P(k | p, n)*. In this notation the bar (‘|’) represents *conditional probability*. In other words, this is an expression for the distribution over possible values of *k* if *p* and *n* have known, fixed values, and in this case is exactly the binomial PMF given by *f(.)* above.

This isn’t quite what we want. Given the results of a test, *k* is known, but *p* isn’t, so we want to a know a distribution over *p* given the observed data, or *P(p | k, n)*. Fortunately, Bayes’ theorem tells us how to compute precisely that:

There are a couple of other quantities here, *P(p | n)*, which is known as the prior, and *P(k | n)*. The prior represents our beliefs about *p* in the absence of data. Given we know nothing in that case it’s not unreasonable to model it as a flat distribution (i.e. a constant). *P(k | n)* is dependent only on fixed, observed quantities, so can also be treated as a constant for this analysis, hence:

Probability distributions must sum to one (i.e. we know that we’ll certainly get one of the possible values), so *Z* isn’t free to vary arbitrarily.

All of this can easily be done numerically, either with a small script in your language of choice or using a spreadsheet. Excel has a function BINOMDIST which gives *P(k | p, n)*, so you can use something like this:

**Comparing test groups**

In a split test we treat each group as a separate population, with separate conversion rates, *p _{a}* and

*p*. Each of these can be analysed as above, so we’ll end up with a distribution for each. Numerically, this will be a set of discrete values for each with probabilities assigned to each, probably represented as two columns if you’re using a spreadsheet.

_{b}We’ll treat the two groups as independent. For independent variables ‘and’ queries correspond to multiplying probabilities, so the probability of group A having conversion rate *p _{a}*

*and*group B having conversion rate

*p*is

_{b}Finding the probability that A wins is then just a matter of finding all of the pairs *(p _{a}, p_{b})* where

*p*, multiplying the corresponding values for each and then adding them all up. Using mathematical notation, this is the same as saying

_{a}> p_{b}It turns out that this sort of calculation doesn’t really lend itself to spreadsheets, but it’s pretty straightforward in most programming languages. We’ve actually put some of the scripts we use for this kind of analysis on GitHub: https://github.com/songkick/skab.

To make a decision you first need to decide how confident you want to be. If the answer you get from the above is 0.95 and you’re happy with a 5% margin of error you should choose to roll out version A, and if it’s 0.05 you probably want to pick B.

If you get something close to 0.5 you need to work out whether to declare neither A nor B the winner (i.e. they’re as good as each other), or wait a bit longer and gather more data. To help with this you can vary the above sum to consider pairs where *p _{a}* and

*p*are within some small distance of each other (say a 1% difference). If the probability mass for these pairs is high it’s very likely that there is little difference between A and B, but if not you just don’t have enough data to draw a conclusion either way.

_{b}
Interesting post. It seems to me that you’re basically doing a Monte Carlo integration. I’m not sure you’re generating the samples from the posterior correctly. Sampling uniformly from the prior will not give an unbiased estimate of the posterior. You should be using something like rejection sampling to adjust your samples, me thinks. Alternatively you could represent the posterior directly if you were to estimate conversion rate — then you can use a beta prior over the conversion rate and since this is conjugate the posterior is also beta. Sampling from the beta distribution is easy.

Hope that makes sense! I can’t leave with the obligatory plug for our A/B testing product Myna, which avoids all these issues (http://mynaweb.com/)

Thanks for the comment. We’re not actually doing Monte Carlo here – it’s just plain numerical integration. Do you still think that biases our results?

The relationship to the beta distribution is useful. If I’m understanding that correctly it’s essentially the same as what we’re doing in the post, but it relates the integral that we’re approximating with a sum to the regularized incomplete beta function, which in turn can be related to functions commonly available in spreadsheets. I think that should help simplify the tooling for this analysis.

Ok, I read more closely. In higher dimensions this wouldn’t work but in 1D you’re (probably) fine.

You don’t need to do this, however. What I said rather badly above is that the beta distribution is the conjugate prior for the bernoulli (and hence binomial). If you start with a beta prior you will end up with a beta posterior. You can then sample directly from the two posteriors to approximate the final integral via Monte Carlo integration.

You might then note you can 1) update your posterior after every view of a variant and 2) use your posterior to adapt the proportion in which you show A and B while your experiment running. What you now have is an algorithm for solving the so-called bandit problem. This algorithm is called Thompson sampling. Myna is an approach to A/B testing that uses this idea (amongst others).