# PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory

50 %
50 %
Information about PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
Technology

Published on March 5, 2014

Author: PyData

Source: slideshare.net

## Description

PyData London 2014 Martin Goodson - Most A/B Testing Results are Illusory

Most A/B testing results are Illusory Martin Goodson, Skimlinks

These are my opinions not those of my employer!

What’s an A/B test? Example: Free delivery A: Control B: Variant

‘How can you talk for 40 minutes about A/B testing?’

A/B tests are very easy to get wrong

What my experience is based on

What this talk is about 3 Statistical concepts Errors and consequences These errors are exactly how A/B testing software works

What this talk is about Statistical Power Multiple Testing Regression to the Mean

What is Statistical Power? The probability that you will detect a true difference between two samples

What is Statistical Power? Example: are men taller than women, on average?

What is Statistical Power? Example: free delivery on a website

Why is Statistical Power important? 1. False negatives 2. False positives

Precision Proportion of true positives in the positive results Its a function of power, significance level and prevalence.

If you have good power? Out of 100 tests 10 really drive uplift You detect 8 5 false positives 8/13 of positive tests are real

If you have bad power? Out of 100 tests 10 really drive uplift You detect 3 5 false positives 3/8 of winning tests are real!

Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’

Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’ Marketer: ‘Why are you being so negative?’

Calculating Power Alpha: probability of a positive result when the null hypothesis is true (5%) Beta: probability of not seeing a positive result when the null hypothesis is true Power = 1- Beta (80-90%)

Calculating Power Use a power calculator: Online R (power.prop.test) python (statsmodels.stats.power)

Approximate sample sizes Using a power calculator and asking for 80% power and significance level of 5%: 6000 conversions to detect 5% uplift 1600 conversions to detect 10% uplift

Multiple testing

Effect of multiple testing if you run 20 tests at a significance level of 5% you will obtain 1 win, just by chance.

Giving targets for successful tests.

Stopping tests early

Stopping tests early Simulations show that stopping an A/A test when you see a positive results will result in successful test 41% of the time.

Stopping tests early That works out to a precision of 20%

Negative uplift. Stopping an A/B test with negative effect results in a win 9% of the time!

A True Story

Regression to the mean Give 100 students a true/false test They all answer randomly Take only the top scoring 10% of the class Test them again What will the results be?

Estimates of uplift are generally wrong.

What you need to do to get it right ● Do a power calculation first to estimate sample size ● Use a valid hypothesis - don’t use a scattergun approach ● Do not stop the test early ● Perform a second ‘validation’ test

My details martingoodson@gmail.com @martingoodson http://goo.gl/jvhwmB Download my whitepaper on A/B testing here

Skimlinks After Party! Levante Bar 5 minutes away Come hungry! Invites + Map at the booth http://skimlinks.com/jobs

 User name: Comment:

## Related presentations

#### Neuquén y el Gobierno Abierto

October 30, 2014

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

#### Decision CAMP 2014 - Erik Marutian - Using rules-b...

October 16, 2014

In this presentation we will describe our experience developing with a highly dyna...

#### Schema.org: What It Means For You and Your Library

November 7, 2014

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

#### WearableTech: Una transformación social de los p...

November 3, 2014

Un recorrido por los cambios que nos generará el wearabletech en el futuro

#### O Impacto de Wearable Computers na vida das pessoa...

November 5, 2014

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

#### All you need to know about the Microsoft Band

November 6, 2014

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

## Related pages

### My opinionated talk on A/B testing - The Science of Data

Menu My opinionated talk on A/B testing 08 May 2015. Talk from PyData London 2014. Most Winning A/B Test Results are Illusory. Talk Summary. Many people ...

### Martin Goodson - Most Winning A/B Test Results are Illusory

http://www.slideshare.net/PyData/py-data-goodson ... Most Winning A/B Test Results are Illusory ... A/B testing results? | Qubit | March 2014 ...

### The Science of Data

Martin Goodson 26 June 2015 My ... PyData London 2014 Most Winning A/B Test Results are Illusory Talk Summary Many people have started to suspect that ...

### PyData 2014 | London | Feb 21 - 23 - PyData.org | Home

Martin Goodson (Skimlinks) Most Winning A/B Test Results are Illusory Level: Intermediate ... Schedule for Sunday, Feb 23, 2014. Time:

### PyData 2014 | London | Feb 21 - 23 - PyData.org | Home

... PyData London co-organiser, ... Martin Goodson Skimlinks Most Winning A/B Test Results are Illusory. ... and is chair of EuroPython 2014 in Berlin, ...

### PyData London 2014 - YouTube

http://pydata.org/ PyData ... Evidence from a Matched Dataset in Central London by PyData. ... Martin Goodson - Most Winning A/B Test Results are Illusory