A/B Testing
for Fun and Profit
Ben Tilly
Pictage
(These slides use S5.
Click anywhere to continue or use the keyboard shortcuts.)
Sample Programs
This is for after the presentation
What is A/B Testing?
- Develop two versions of a page
- Randomly show different versions to users
- Track how users perform
- Evaluate (that's the tricky part)
- Use the better version
Why A/B test?
- You don't know what drives user behaviour
- They can't tell you
- Subtle changes make a big difference
- How about +40%?
What can you A/B test?
- Removing form fields
- Adding relevant form fields
- Adding/removing explanations
- Adding/removing interstitial pages
- Anything
A/B tests do not substitute for
- Talking to users
- Usability tests
- Thinking
What is chi-square?
- A method for evaluating A/B tests
- Only answers yes/no questions (but you pick the question)
- Only handles 2 versions (there is a workaround)
- Requires independence in samples
- Does not do confidence intervals
- There are alternatives
What to measure
- Start your A/B test
- Divide your users into groups A and B
- Decide whether each user did what you want
- Reduce your results to 4 numbers ($a_yes, $a_no, $b_yes, $b_no)
Arrange your measurements
|
Yes |
No |
|
| A |
$a_yes |
$a_no |
$a |
| B |
$b_yes |
$b_no |
$b |
|
$yes |
$no |
$total |
|
- $a_yes = # in A who are yes
- $a_no = # in A who are no
- $b_yes = # in B who are yes
- $b_no = # in B who are no
|
Scary Math Part 1 - Addition
|
Yes |
No |
|
| A |
$a_yes |
$a_no |
$a |
| B |
$b_yes |
$b_no |
$b |
|
$yes |
$no |
$total |
|
- $a = $a_yes + $a_no
- $b = $b_yes + $b_no
- $yes = $a_yes + $b_yes
- $no = $a_no + $b_no
- $total = $a + $b (or $yes + $no)
|
Scary Math Part 2 - Expectations
|
Yes |
No |
|
| A |
$e_a_yes |
$e_a_no |
$a |
| B |
$e_b_yes |
$e_b_no |
$b |
|
$yes |
$no |
$total |
|
- $e_a_yes = $a * $yes / $total
- $e_a_no = $a * $no / $total
- $e_b_yes = $b * $yes / $total
- $e_b_no = $b * $no / $total
|
Scary Math Part 3 - Chi-square
- A basic chi-square term looks like this:
($measured - $expected)**2 / $expected
-
It was invented by Karl Pearson in 1900.
- Statisticians know its distribution very well
- We will treat as magic
Scary Math Part 4 - Calculation
We have 4 measurements and 4 expectations. So we have 4 chi-square
terms. We add them:
my $chi_square =
($a_yes - $e_a_yes)**2 / $e_a_yes
+ ($a_no - $e_a_no )**2 / $e_a_no
+ ($b_yes - $e_b_yes)**2 / $e_b_yes
+ ($b_no - $e_b_no )**2 / $e_b_no;
Scary Math Part 5: Interpretation
use Statistics::Distributions qw(chisqrprob);
my $p = chisqrprob(1, $chi_square);
- If the samples are all independent...
- and the expected predictions are all at least 10...
- and the real performance of A and B is the same...
- then $p ≈ prob(chi-square should be > $chi_square)
- If $p is "small", conclude #3 likely wrong
How Small Is "Small"?
- Increased certainty needs a larger sample
- You'd have to be unlucky to be wrong when you decide...
- But you get multiple chances to be unlucky
- I push for 99% confidence (p < 0.01)
Recap of A/B setup
- Develop two versions of a page
- Randomly divide users into two groups
- Show each group a different version
- Track how those users perform
Recap of chi-square evaluation
- Select a yes/no question about users
- Divide users in A and B into yes/no
- Perform complicated chi-square calculation
- Make decision if $p is small and sample is large enough
A/B test simulation
- I ran several simulations of 100,000 parallel A/B tests...
- with known conversion rates and confidence levels.
- Found odds of a wrong conclusion...
- ...and how long until 25%, 50%, etc of tests finished
Best Case Example
A's true conversion: 50%, B's true conversion: 55%
|
Error Rate and Final Sample Size
|
| confidence |
p(wrong) |
|
25% |
50% |
75% |
90% |
| 95% |
4.6% |
|
146 |
570 |
1,500 |
2,750 |
| 98% |
1.9% |
|
383 |
1,170 |
2,430 |
3,920 |
99% |
0.9% |
|
670 |
1,680 |
3,120 |
4,770 |
| 99.5% |
0.4% |
|
1,020 |
2,200 |
3,770 |
5,500 |
Low Conversion Example
A's true conversion: 10%, B's true conversion: 11%
|
Error Rate and Final Sample Size
|
| confidence |
p(wrong) |
|
25% |
50% |
75% |
90% |
| 95% |
7.2% |
|
790 |
4,150 |
12,700 |
24,500 |
| 98% |
3.3% |
|
2,740 |
9,980 |
21,700 |
35,900 |
| 99% |
1.5% |
|
5,620 |
15,100 |
28,500 |
45,500 |
| 99.5% |
0.8% |
|
8,900 |
20,000 |
34,900 |
51,500 |
Low Lift Example
A's true conversion: 50%, B's true conversion: 51%
|
Error Rate and Final Sample Size
|
| confidence |
p(wrong) |
|
25% |
50% |
75% |
90% |
| 95% |
16.2% |
|
257 |
3,680 |
23,700 |
55,100 |
| 98% |
8.0% |
|
2,300 |
20,000 |
50,600 |
90,800 |
| 99% |
4.5% |
|
9,540 |
34,300 |
72,000 |
116,000 |
| 99.5% |
2.4% |
|
18,400 |
50,000 |
90,100 |
132,000 |
A/B test Scaling Principles
- You can't really predict how long a test will take
- You should be prepared for tens of thousands
- 1/5 the conversion rate takes about 5x as long
- 1/5 the lift takes about 25x as long (with wide variation)
A/B test scaling tips
- Test where you have volume
- Test a high converting question
- (eg "Did they register?" instead of "Did they buy?")
- Be willing to stop a test that is going nowhere
More A/B testing tips
- Give QA a way to force A versus B
- Test multiple questions in parallel
- Automate reporting
- Provide interactive A/B calculator
- Standardize testing metrics
Compare apples to apples
- Make sure there are no differences between A and B
- Traffic behaves differently at different times
- Friday night ≠ Monday morning
- First week in month ≠ last week in month
- Do not try to compare people from different times
A/B ratio need not be 50/50
- A and B can receive unequal traffic
- But do not change the mix they get
- Wrong Change(90/10) A vs B to (80/20)
- Right Change(10/10/80) A vs B vs Untested to (20/20/60)
Beware of hidden correlations
- Correlations increase variability, and therefore $chi_square
- ...even if you don't see them
- Mistake: Divide people into A vs B, test orders
started to orders completed
- Wrong because different orders by the same person are correlated
with each other.
- Fix: Test whether users completed an order
Tip: Use rand()
- Can you assign A vs B based on $user_id % 2?
- Yes, but assigning based on rand() is better
- Easier to write QA hook to force A vs B
- Can run any number of parallel A/B tests
A/B/C... testing the wrong way
- Statistics books say chi-square works for more than 2x2
- With $n versions, and $m possible answers, you can set up the
table like we did...
- do the same calculations (with more variables)...
- and $p = chisqrprob(($n-1)*($m-1), $chi_square)
Why is this wrong?
- The statistics books are right...
- but you can't interpret your answer!
- You know that there is a difference, but not what
it is
Extreme example
- Suppose we have 10 million versions
- Half convert at 51%, half at 49%
- At 21 samples each you know they are different
- But you can't tell the 51% versions from the 49% versions!
A/B/C... testing the right way
- With $n versions there are ($n choose 2) = $n*($n-1)/2 ways to get
an unlikely result
- So compare the best with the worst...
- and multiply $p by ($n choose 2)
- This overestimates $p (but not by much when $p is small)
- Remove one version at a time
Questions?
(or we can continue on for advanced material)
What is the chi-square distribution?
- Suppose that X is a standard normal
- Then X**2 has a 1-dimensional chi-square distribution
- The sum of n independent 1-dim chi-squares has an n-dimensional
chi-square distribution
- Pearson proved that we should use 1 dimensional version
(that is the 1 we passed to chisqrprob)
Comparison with G-test
- Instead of the basic chi-square term use:
2 * $measured * log( $measured / $expected )
- Otherwise exactly like chi-square - even same distribution!
(See example
- More accurate than chi-square for small sample sizes
- Chi-square is far more widely known
Testing non-yes/no questions
- The following few slides give a way to directly test questions
like "Which makes more money/person?"
- I developed it and believe it works
- Be warned, while I have a strong math background, I am not a
statistician
- Please see the example
program
Some basic terms
- Suppose X is a random variable
- E(X) called the expected value is what you
expect the arithmetic average of many samples to be
- Var(X) called the variance is a measure of
variability. Officially defined as
Var(X) = E((X - E(X))**2)
- The square root of the variance is the standard deviation
Basic properties
- Let X and Y be independent random variables, and k be a constant
- Then the following hold
-
E(X+Y) = E(X) + E(Y), and
Var(X+Y) = Var(X) + Var(Y)
- E(k+X) = k + E(X), and Var(k+X) = Var(X)
-
E(k*X) = k * E(X), and Var(k*X) = k**2 * Var(X)
Estimating E(X) and Var(X)
- Suppose x1, x2, ..., xn
are independent observations of a random variable X
- Then the arithmetic average is an estimate of expected value
E(X) ≈ (x1 + x2 + ... + xn)/n
- If m is the arithmetic average
Var(X) ≈ (Σi=1n(xi - m)**2)/(n - 1)
Central Limit Theorem
- Suppose x1, x2, ..., xn
are independent observations of a random variable X
- Then x1 + x2 + ... + xn
has approximately a normal distribution with expected value
n*E(X) and variance n*Var(X)
- This is one of the most important math theorems of the 1800s
- It underlies most of statistics
A/B Testing Setup
- Divide people into A and B
- Track them, figure how well each performed (eg revenue per person)
- Create arrays @a and @b of their performances. Don't forget to
include the 0's!
- That is our raw data
Variance calculation
- Estimate the variance of (@a, @b) and assign that to $var
- Make @c be (@a, @b) minus the largest value. Estimate its
variance in $var_c
- Let $w = $var/$var_c - 1 This indicates how much of
the overall variation is caused by the largest outlier.
- If $w*(@a+@b)/@a < 0.1 and
$w*(@a+@b)/@b) < 0.1 then continue (remember that
@a in scalar context is the size of the array)
- Otherwise we can continue, but shouldn't trust the results
The difference of the averages is..?
- Let $m_a = sum(@a)/@a and $m_b = sum(@b)/@b.
- $m_a is approximately normally distributed with variance $var/@a
- $m_b is approximately normally distributed with variance $var/@b
- ($m_a - $m_b) is approximately normally distributed with variance
($var/@a + $var/@b)
We can test that!
- If A and B are the same, then this calculates a p-value:
use Statistics::Distributions qw(uprob);
# time passes, the previous calculations happen..
my $m_var = $var/@a + $var/@b;
# The 2 is because we're doing a 2-tailed test
my $p = 2 * uprob( abs($m_a - $m_b) / sqrt($m_var) );
- Conversely Statistics::Distributions::udistr can calculate
confidence intervals
- See example
Final technical note
- Purists will say we should estimate the variance of @a and @b
separately
- Theoretically that's right
- But is error-prone if you have large outliers
- In revenue data, large outliers are common
- Combining them gives reliable answers sooner