# What is A/B Testing?

• Develop two versions of a page
• Randomly show different versions to users
• Track how users perform
• Evaluate (that's the tricky part)
• Use the better version

# Why A/B test?

• You don't know what drives user behaviour
• They can't tell you
• Subtle changes make a big difference

# What can you A/B test?

• Removing form fields
• Anything

# A/B tests do not substitute for

• Talking to users
• Usability tests
• Thinking

# What is chi-square?

• A method for evaluating A/B tests
• Only answers yes/no questions (but you pick the question)
• Only handles 2 versions (there is a workaround)
• Requires independence in samples
• Does not do confidence intervals
• There are alternatives

# What to measure

• Divide your users into groups A and B
• Decide whether each user did what you want
• Reduce your results to 4 numbers (\$a_yes, \$a_no, \$b_yes, \$b_no)

Yes No A B \$a_yes \$a_no \$a \$b_yes \$b_no \$b \$yes \$no \$total
• \$a_yes = # in A who are yes
• \$a_no = # in A who are no
• \$b_yes = # in B who are yes
• \$b_no = # in B who are no

# Scary Math Part 1 - Addition

Yes No A B \$a_yes \$a_no \$a \$b_yes \$b_no \$b \$yes \$no \$total
• \$a = \$a_yes + \$a_no
• \$b = \$b_yes + \$b_no
• \$yes = \$a_yes + \$b_yes
• \$no = \$a_no + \$b_no
• \$total = \$a + \$b (or \$yes + \$no)

# Scary Math Part 2 - Expectations

Yes No A B \$e_a_yes \$e_a_no \$a \$e_b_yes \$e_b_no \$b \$yes \$no \$total
• \$e_a_yes = \$a * \$yes / \$total
• \$e_a_no = \$a * \$no / \$total
• \$e_b_yes = \$b * \$yes / \$total
• \$e_b_no = \$b * \$no / \$total

# Scary Math Part 3 - Chi-square

• A basic chi-square term looks like this:
(\$measured - \$expected)**2 / \$expected
• It was invented by Karl Pearson in 1900.
• Statisticians know its distribution very well
• We will treat as magic

# Scary Math Part 4 - Calculation

We have 4 measurements and 4 expectations. So we have 4 chi-square terms. We add them:
```my \$chi_square =
(\$a_yes - \$e_a_yes)**2 / \$e_a_yes
+ (\$a_no  - \$e_a_no )**2 / \$e_a_no
+ (\$b_yes - \$e_b_yes)**2 / \$e_b_yes
+ (\$b_no  - \$e_b_no )**2 / \$e_b_no;
```

# Scary Math Part 5: Interpretation

```use Statistics::Distributions qw(chisqrprob);
my \$p = chisqrprob(1, \$chi_square);
```
1. If the samples are all independent...
2. and the expected predictions are all at least 10...
3. and the real performance of A and B is the same...
4. then \$p ≈ prob(chi-square should be > \$chi_square)
5. If \$p is "small", conclude #3 likely wrong

# How Small Is "Small"?

• Increased certainty needs a larger sample
• You'd have to be unlucky to be wrong when you decide...
• But you get multiple chances to be unlucky
• I push for 99% confidence (p < 0.01)

# Recap of A/B setup

• Develop two versions of a page
• Randomly divide users into two groups
• Show each group a different version
• Track how those users perform

# Recap of chi-square evaluation

• Select a yes/no question about users
• Divide users in A and B into yes/no
• Perform complicated chi-square calculation
• Make decision if \$p is small and sample is large enough

# A/B test simulation

• I ran several simulations of 100,000 parallel A/B tests...
• with known conversion rates and confidence levels.
• Found odds of a wrong conclusion...
• ...and how long until 25%, 50%, etc of tests finished

# Best Case Example

## A's true conversion: 50%, B's true conversion: 55%

Error Rate and Final Sample Size
confidence p(wrong)   25% 50% 75% 90%
95% 4.6% 146 570 1,500 2,750
98% 1.9% 383 1,170 2,430 3,920
99% 0.9% 670 1,680 3,120 4,770
99.5% 0.4% 1,020 2,200 3,770 5,500

# Low Conversion Example

## A's true conversion: 10%, B's true conversion: 11%

Error Rate and Final Sample Size
confidence p(wrong)   25% 50% 75% 90%
95% 7.2% 790 4,150 12,700 24,500
98% 3.3% 2,740 9,980 21,700 35,900
99% 1.5% 5,620 15,100 28,500 45,500
99.5% 0.8% 8,900 20,000 34,900 51,500

# Low Lift Example

## A's true conversion: 50%, B's true conversion: 51%

Error Rate and Final Sample Size
confidence p(wrong)   25% 50% 75% 90%
95% 16.2% 257 3,680 23,700 55,100
98% 8.0% 2,300 20,000 50,600 90,800
99% 4.5% 9,540 34,300 72,000 116,000
99.5% 2.4% 18,400 50,000 90,100 132,000

# A/B test Scaling Principles

• You can't really predict how long a test will take
• You should be prepared for tens of thousands
• 1/5 the conversion rate takes about 5x as long
• 1/5 the lift takes about 25x as long (with wide variation)

# A/B test scaling tips

• Test where you have volume
• Test a high converting question
• Be willing to stop a test that is going nowhere

# More A/B testing tips

• Give QA a way to force A versus B
• Test multiple questions in parallel
• Automate reporting
• Provide interactive A/B calculator
• Standardize testing metrics

# Compare apples to apples

• Make sure there are no differences between A and B
• Traffic behaves differently at different times
• Friday night ≠ Monday morning
• First week in month ≠ last week in month
• Do not try to compare people from different times

# A/B ratio need not be 50/50

• A and B can receive unequal traffic
• But do not change the mix they get
• Wrong Change(90/10) A vs B to (80/20)
• Right Change(10/10/80) A vs B vs Untested to (20/20/60)

# Beware of hidden correlations

• Correlations increase variability, and therefore \$chi_square
• ...even if you don't see them
• Mistake: Divide people into A vs B, test orders started to orders completed
• Wrong because different orders by the same person are correlated with each other.
• Fix: Test whether users completed an order

# Tip: Use rand()

• Can you assign A vs B based on \$user_id % 2?
• Yes, but assigning based on rand() is better
• Easier to write QA hook to force A vs B
• Can run any number of parallel A/B tests

# A/B/C... testing the wrong way

• Statistics books say chi-square works for more than 2x2
• With \$n versions, and \$m possible answers, you can set up the table like we did...
• do the same calculations (with more variables)...
• and \$p = chisqrprob((\$n-1)*(\$m-1), \$chi_square)

# Why is this wrong?

• The statistics books are right...
• You know that there is a difference, but not what it is

# Extreme example

• Suppose we have 10 million versions
• Half convert at 51%, half at 49%
• At 21 samples each you know they are different
• But you can't tell the 51% versions from the 49% versions!

# A/B/C... testing the right way

• With \$n versions there are (\$n choose 2) = \$n*(\$n-1)/2 ways to get an unlikely result
• So compare the best with the worst...
• and multiply \$p by (\$n choose 2)
• This overestimates \$p (but not by much when \$p is small)
• Remove one version at a time

# What is the chi-square distribution?

• Suppose that X is a standard normal
• Then X**2 has a 1-dimensional chi-square distribution
• The sum of n independent 1-dim chi-squares has an n-dimensional chi-square distribution
• Pearson proved that we should use 1 dimensional version (that is the 1 we passed to chisqrprob)

# Comparison with G-test

• Instead of the basic chi-square term use:
2 * \$measured * log( \$measured / \$expected )
• Otherwise exactly like chi-square - even same distribution! (See example
• More accurate than chi-square for small sample sizes
• Chi-square is far more widely known

# Testing non-yes/no questions

• The following few slides give a way to directly test questions like "Which makes more money/person?"
• I developed it and believe it works
• Be warned, while I have a strong math background, I am not a statistician
• Please see the example program

# Some basic terms

• Suppose X is a random variable
• E(X) called the expected value is what you expect the arithmetic average of many samples to be
• Var(X) called the variance is a measure of variability. Officially defined as
Var(X) = E((X - E(X))**2)
• The square root of the variance is the standard deviation

# Basic properties

• Let X and Y be independent random variables, and k be a constant
• Then the following hold
• E(X+Y) = E(X) + E(Y), and Var(X+Y) = Var(X) + Var(Y)
• E(k+X) = k + E(X), and Var(k+X) = Var(X)
• E(k*X) = k * E(X), and Var(k*X) = k**2 * Var(X)

# Estimating E(X) and Var(X)

• Suppose x1, x2, ..., xn are independent observations of a random variable X
• Then the arithmetic average is an estimate of expected value
E(X) ≈ (x1 + x2 + ... + xn)/n
• If m is the arithmetic average
Var(X) ≈ (Σi=1n(xi - m)**2)/(n - 1)

# Central Limit Theorem

• Suppose x1, x2, ..., xn are independent observations of a random variable X
• Then x1 + x2 + ... + xn has approximately a normal distribution with expected value n*E(X) and variance n*Var(X)
• This is one of the most important math theorems of the 1800s
• It underlies most of statistics

# A/B Testing Setup

• Divide people into A and B
• Track them, figure how well each performed (eg revenue per person)
• Create arrays @a and @b of their performances. Don't forget to include the 0's!
• That is our raw data

# Variance calculation

• Estimate the variance of (@a, @b) and assign that to \$var
• Make @c be (@a, @b) minus the largest value. Estimate its variance in \$var_c
• Let \$w = \$var/\$var_c - 1 This indicates how much of the overall variation is caused by the largest outlier.
• If \$w*(@a+@b)/@a < 0.1 and \$w*(@a+@b)/@b) < 0.1 then continue (remember that @a in scalar context is the size of the array)
• Otherwise we can continue, but shouldn't trust the results

# The difference of the averages is..?

• Let \$m_a = sum(@a)/@a and \$m_b = sum(@b)/@b.
• \$m_a is approximately normally distributed with variance \$var/@a
• \$m_b is approximately normally distributed with variance \$var/@b
• (\$m_a - \$m_b) is approximately normally distributed with variance (\$var/@a + \$var/@b)

# We can test that!

• If A and B are the same, then this calculates a p-value:
use Statistics::Distributions qw(uprob);
# time passes, the previous calculations happen..
my \$m_var = \$var/@a + \$var/@b;
# The 2 is because we're doing a 2-tailed test
my \$p = 2 * uprob( abs(\$m_a - \$m_b) / sqrt(\$m_var) );
• Conversely Statistics::Distributions::udistr can calculate confidence intervals
• See example

# Final technical note

• Purists will say we should estimate the variance of @a and @b separately
• Theoretically that's right
• But is error-prone if you have large outliers
• In revenue data, large outliers are common
• Combining them gives reliable answers sooner