A/B Testing
for Fun and Profit
Ben Tilly
Pictage
(These slides use S5.
Click anywhere to continue or use the keyboard shortcuts.)
Sample Programs
This is for after the presentation
What is A/B Testing?
 Develop two versions of a page
 Randomly show different versions to users
 Track how users perform
 Evaluate (that's the tricky part)
 Use the better version
Why A/B test?
 You don't know what drives user behaviour
 They can't tell you
 Subtle changes make a big difference
 How about +40%?
What can you A/B test?
 Removing form fields
 Adding relevant form fields
 Adding/removing explanations
 Adding/removing interstitial pages
 Anything
A/B tests do not substitute for
 Talking to users
 Usability tests
 Thinking
What is chisquare?
 A method for evaluating A/B tests
 Only answers yes/no questions (but you pick the question)
 Only handles 2 versions (there is a workaround)
 Requires independence in samples
 Does not do confidence intervals
 There are alternatives
What to measure
 Start your A/B test
 Divide your users into groups A and B
 Decide whether each user did what you want
 Reduce your results to 4 numbers ($a_yes, $a_no, $b_yes, $b_no)
Arrange your measurements

Yes 
No 

A 
$a_yes 
$a_no 
$a 
B 
$b_yes 
$b_no 
$b 

$yes 
$no 
$total 

 $a_yes = # in A who are yes
 $a_no = # in A who are no
 $b_yes = # in B who are yes
 $b_no = # in B who are no

Scary Math Part 1  Addition

Yes 
No 

A 
$a_yes 
$a_no 
$a 
B 
$b_yes 
$b_no 
$b 

$yes 
$no 
$total 

 $a = $a_yes + $a_no
 $b = $b_yes + $b_no
 $yes = $a_yes + $b_yes
 $no = $a_no + $b_no
 $total = $a + $b (or $yes + $no)

Scary Math Part 2  Expectations

Yes 
No 

A 
$e_a_yes 
$e_a_no 
$a 
B 
$e_b_yes 
$e_b_no 
$b 

$yes 
$no 
$total 

 $e_a_yes = $a * $yes / $total
 $e_a_no = $a * $no / $total
 $e_b_yes = $b * $yes / $total
 $e_b_no = $b * $no / $total

Scary Math Part 3  Chisquare
 A basic chisquare term looks like this:
($measured  $expected)**2 / $expected

It was invented by Karl Pearson in 1900.
 Statisticians know its distribution very well
 We will treat as magic
Scary Math Part 4  Calculation
We have 4 measurements and 4 expectations. So we have 4 chisquare
terms. We add them:
my $chi_square =
($a_yes  $e_a_yes)**2 / $e_a_yes
+ ($a_no  $e_a_no )**2 / $e_a_no
+ ($b_yes  $e_b_yes)**2 / $e_b_yes
+ ($b_no  $e_b_no )**2 / $e_b_no;
Scary Math Part 5: Interpretation
use Statistics::Distributions qw(chisqrprob);
my $p = chisqrprob(1, $chi_square);
 If the samples are all independent...
 and the expected predictions are all at least 10...
 and the real performance of A and B is the same...
 then $p ≈ prob(chisquare should be > $chi_square)
 If $p is "small", conclude #3 likely wrong
How Small Is "Small"?
 Increased certainty needs a larger sample
 You'd have to be unlucky to be wrong when you decide...
 But you get multiple chances to be unlucky
 I push for 99% confidence (p < 0.01)
Recap of A/B setup
 Develop two versions of a page
 Randomly divide users into two groups
 Show each group a different version
 Track how those users perform
Recap of chisquare evaluation
 Select a yes/no question about users
 Divide users in A and B into yes/no
 Perform complicated chisquare calculation
 Make decision if $p is small and sample is large enough
A/B test simulation
 I ran several simulations of 100,000 parallel A/B tests...
 with known conversion rates and confidence levels.
 Found odds of a wrong conclusion...
 ...and how long until 25%, 50%, etc of tests finished
Best Case Example
A's true conversion: 50%, B's true conversion: 55%
Error Rate and Final Sample Size

confidence 
p(wrong) 

25% 
50% 
75% 
90% 
95% 
4.6% 

146 
570 
1,500 
2,750 
98% 
1.9% 

383 
1,170 
2,430 
3,920 
99% 
0.9% 

670 
1,680 
3,120 
4,770 
99.5% 
0.4% 

1,020 
2,200 
3,770 
5,500 
Low Conversion Example
A's true conversion: 10%, B's true conversion: 11%
Error Rate and Final Sample Size

confidence 
p(wrong) 

25% 
50% 
75% 
90% 
95% 
7.2% 

790 
4,150 
12,700 
24,500 
98% 
3.3% 

2,740 
9,980 
21,700 
35,900 
99% 
1.5% 

5,620 
15,100 
28,500 
45,500 
99.5% 
0.8% 

8,900 
20,000 
34,900 
51,500 
Low Lift Example
A's true conversion: 50%, B's true conversion: 51%
Error Rate and Final Sample Size

confidence 
p(wrong) 

25% 
50% 
75% 
90% 
95% 
16.2% 

257 
3,680 
23,700 
55,100 
98% 
8.0% 

2,300 
20,000 
50,600 
90,800 
99% 
4.5% 

9,540 
34,300 
72,000 
116,000 
99.5% 
2.4% 

18,400 
50,000 
90,100 
132,000 
A/B test Scaling Principles
 You can't really predict how long a test will take
 You should be prepared for tens of thousands
 1/5 the conversion rate takes about 5x as long
 1/5 the lift takes about 25x as long (with wide variation)
A/B test scaling tips
 Test where you have volume
 Test a high converting question
 (eg "Did they register?" instead of "Did they buy?")
 Be willing to stop a test that is going nowhere
More A/B testing tips
 Give QA a way to force A versus B
 Test multiple questions in parallel
 Automate reporting
 Provide interactive A/B calculator
 Standardize testing metrics
Compare apples to apples
 Make sure there are no differences between A and B
 Traffic behaves differently at different times
 Friday night ≠ Monday morning
 First week in month ≠ last week in month
 Do not try to compare people from different times
A/B ratio need not be 50/50
 A and B can receive unequal traffic
 But do not change the mix they get
 Wrong Change(90/10) A vs B to (80/20)
 Right Change(10/10/80) A vs B vs Untested to (20/20/60)
Beware of hidden correlations
 Correlations increase variability, and therefore $chi_square
 ...even if you don't see them
 Mistake: Divide people into A vs B, test orders
started to orders completed
 Wrong because different orders by the same person are correlated
with each other.
 Fix: Test whether users completed an order
Tip: Use rand()
 Can you assign A vs B based on $user_id % 2?
 Yes, but assigning based on rand() is better
 Easier to write QA hook to force A vs B
 Can run any number of parallel A/B tests
A/B/C... testing the wrong way
 Statistics books say chisquare works for more than 2x2
 With $n versions, and $m possible answers, you can set up the
table like we did...
 do the same calculations (with more variables)...
 and $p = chisqrprob(($n1)*($m1), $chi_square)
Why is this wrong?
 The statistics books are right...
 but you can't interpret your answer!
 You know that there is a difference, but not what
it is
Extreme example
 Suppose we have 10 million versions
 Half convert at 51%, half at 49%
 At 21 samples each you know they are different
 But you can't tell the 51% versions from the 49% versions!
A/B/C... testing the right way
 With $n versions there are ($n choose 2) = $n*($n1)/2 ways to get
an unlikely result
 So compare the best with the worst...
 and multiply $p by ($n choose 2)
 This overestimates $p (but not by much when $p is small)
 Remove one version at a time
Questions?
(or we can continue on for advanced material)
What is the chisquare distribution?
 Suppose that X is a standard normal
 Then X**2 has a 1dimensional chisquare distribution
 The sum of n independent 1dim chisquares has an ndimensional
chisquare distribution
 Pearson proved that we should use 1 dimensional version
(that is the 1 we passed to chisqrprob)
Comparison with Gtest
 Instead of the basic chisquare term use:
2 * $measured * log( $measured / $expected )
 Otherwise exactly like chisquare  even same distribution!
(See example
 More accurate than chisquare for small sample sizes
 Chisquare is far more widely known
Testing nonyes/no questions
 The following few slides give a way to directly test questions
like "Which makes more money/person?"
 I developed it and believe it works
 Be warned, while I have a strong math background, I am not a
statistician
 Please see the example
program
Some basic terms
 Suppose X is a random variable
 E(X) called the expected value is what you
expect the arithmetic average of many samples to be
 Var(X) called the variance is a measure of
variability. Officially defined as
Var(X) = E((X  E(X))**2)
 The square root of the variance is the standard deviation
Basic properties
 Let X and Y be independent random variables, and k be a constant
 Then the following hold

E(X+Y) = E(X) + E(Y), and
Var(X+Y) = Var(X) + Var(Y)
 E(k+X) = k + E(X), and Var(k+X) = Var(X)

E(k*X) = k * E(X), and Var(k*X) = k**2 * Var(X)
Estimating E(X) and Var(X)
 Suppose x_{1}, x_{2}, ..., x_{n}
are independent observations of a random variable X
 Then the arithmetic average is an estimate of expected value
E(X) ≈ (x_{1} + x_{2} + ... + x_{n})/n
 If m is the arithmetic average
Var(X) ≈ (Σ_{i=1}^{n}(x_{i}  m)**2)/(n  1)
Central Limit Theorem
 Suppose x_{1}, x_{2}, ..., x_{n}
are independent observations of a random variable X
 Then x_{1} + x_{2} + ... + x_{n}
has approximately a normal distribution with expected value
n*E(X) and variance n*Var(X)
 This is one of the most important math theorems of the 1800s
 It underlies most of statistics
A/B Testing Setup
 Divide people into A and B
 Track them, figure how well each performed (eg revenue per person)
 Create arrays @a and @b of their performances. Don't forget to
include the 0's!
 That is our raw data
Variance calculation
 Estimate the variance of (@a, @b) and assign that to $var
 Make @c be (@a, @b) minus the largest value. Estimate its
variance in $var_c
 Let $w = $var/$var_c  1 This indicates how much of
the overall variation is caused by the largest outlier.
 If $w*(@a+@b)/@a < 0.1 and
$w*(@a+@b)/@b) < 0.1 then continue (remember that
@a in scalar context is the size of the array)
 Otherwise we can continue, but shouldn't trust the results
The difference of the averages is..?
 Let $m_a = sum(@a)/@a and $m_b = sum(@b)/@b.
 $m_a is approximately normally distributed with variance $var/@a
 $m_b is approximately normally distributed with variance $var/@b
 ($m_a  $m_b) is approximately normally distributed with variance
($var/@a + $var/@b)
We can test that!
 If A and B are the same, then this calculates a pvalue:
use Statistics::Distributions qw(uprob);
# time passes, the previous calculations happen..
my $m_var = $var/@a + $var/@b;
# The 2 is because we're doing a 2tailed test
my $p = 2 * uprob( abs($m_a  $m_b) / sqrt($m_var) );
 Conversely Statistics::Distributions::udistr can calculate
confidence intervals
 See example
Final technical note
 Purists will say we should estimate the variance of @a and @b
separately
 Theoretically that's right
 But is errorprone if you have large outliers
 In revenue data, large outliers are common
 Combining them gives reliable answers sooner