The Definitive Guide To Ab Testing Mistakes
The Definitive Guide To Ab Testing Mistakes
A/B TESTING
MISTAKES
AND HOW TO AVOID THEM
TABLE OF CONTENTS
0 INTRODUCTION 5
1 7 BEGINNER MISTAKES 7
You’re most likely already doing A/B So grab a cup of coffee (or 3) and let’s
tests. begin!
How come?
7
7 BEGINNER MISTAKES
Let’s start off with 7 mistakes most beginners make when A/B testing:
1. They start with complicated tests 4. They don’t prioritize their tests
2. They don’t have a hypothesis for 5. They don’t optimize for the right
each test KPI’s
3. They don’t have a process and a 6. They ignore small gains
roadmap 7. They’re not testing at ALL times
For your first tests ever, start Starting with A/B Testing is a lot like
simple. Being successful at A/B starting weight training seriously.
Testing is all about process. So it’s You don’t start with your maximum
important that you first go through the charge and complicated exercices.
motions. It would be the best way to injure
yourself badly.
See how theory computes with reality, You start light, and you focus 100% of
what works for you and what doesn’t. your attention on the movement itself
Where you have problems, be it in the to achieve perfect form, with a series of
implementation of the tests, coming up checkpoints to avoid all the ways you
with ideas, analyzing the results, etc… get injured or develop bad habits—
that’ll end up hurting in the long run.
Think about how you’ll scale your
testing practice, or if you’ll need new By doing that, you’ll imprint it in your
hires for example. muscle memory so when you need
to be focused on the effort itself, you
won’t even have to think about the
movement. Your body will instinctively
do it.
Exact. Same. Thing. With. A/B Testing. You could be overwhelmed, get fake
results and get discouraged.
You start simple, focus all your
attention on each step of the
process, set up safeguards and
adjust as you go so you don’t have to Here are a couple examples of
worry about it later. things you could test to start
with:
Another benefit if you start with simple
tests is that you’ll get quick wins. • Test copy on your offers,
product pages, landing
Getting bummed out when your first pages (Make it focused on
tests fail (and most do, even when benefits not features, be
done by the best experts out there) is sure that what you mean is
often why people give up or struggle crystal clear)
to convince their boss/colleagues that
split testing is indeed worth the time • Removing distractions on
and investment. key pages (Is this slider
really necessary, or all these
Starting with quick wins allows extra buttons?)
you to create momentum and rally
people to the practice inside your
team/company.
VERSION A VERSION B
BUY ADD TO
VS BASKET
It’s easy to fail. So you need When you feel like you lost your focus,
safeguards. Steps that you will go use Brian Balfour’s question:
through each and every time without
having to think. No need to reinvent
the wheel every time.
With a process, a roadmap and the For the sake of argument, let’s say they
development of a testing culture inside did have this crazy lift.
your company, you’ll have a list of test
ideas longer than your arm. First thing first:
“Go big or go home.” It’s true, as we’ve How’s that for small!
just said, that you should focus your
tests on high impact changes first. Also, the more you test, the more
you’ll improve your website, the less
What you won’t hear us say anywhere you’ll have big results. Don’t be sad
though, is “if your test results in a small if you just get small lifts. It could mean
gain, drop it and move on.” your website is good. It’s pretty rare to
get big lifts on “normal” site.
Why?
Don’t be disheartened
Let’s take a quick example.
by small gains, it’ll pay
off big time over time.
If a test gives your variation winning
with 5% more conversions, and each
month you get similar results. That’s a
80% improvement over a year.
80%
J F M A M J J A S O N D
“
There’s a way to do it better—find it.
“
Thomas Edison
21
YOU’RE STOPPING
YOUR TESTS TOO EARLY
This is—without a doubt, the most First problem, most people doing A/B
common, and one of the most potent Testing don’t know—and don’t care,
A/B Testing mistake. whether their tool uses frequentist or
bayesian statistics.
But the answer is really not intuitive.
Second problem, when you dig a bit
We were going to write a chapter into the different solutions, you find
on “when to stop your a/b test for that no 2 softwares use exactly the
both main methods of A/B Testing, same statistical method.
frequentist and Bayesian”.
So, how coud we write something
But, we encountered two problems. helpful?
Here is what we came up with. We will do our best to answer the following
question: What concepts do I need to understand not to stop my A/B Test too
early?
Note: None of these elements is a stopping rule on its own, but having a better
grasp of them will allow you to make better decisions.
Well … no.
You need a sample representative So, what you need to ask yourself is:
of you overall audience (ignore that
if you want to target a segment in How do I determine if my sample
particular for your test) and large is representative of my entire
enough not to be vulnerable to the audience, in proportions and
natural variability of the data. composition?
When you do A/B Testing you can’t Another issue if your sample is too
measure your “true conversion rate”. small is the impact your outliers will
have on your experiment.
You arbitrarily choose a portion of your
audience with the assumption that
the behavior of the selected visitors
will correlate with what would have The smaller your
happened with your entire audience.
sample is, the higher
You need to really know your audience. the variations between
Conduct a thorough analysis of your
measures will be.
visitors before launching your A/B
Tests.
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th %H
T H T H T H T H H T 50
T H T T T H T H T T 30
H T H T T H H H H H 70
H T H T T T T H T H 40
H H H T H T H H H H 80
2. Same experience, but we toss the coin 100 times instead of 10.
Tosses %H
100 50 The larger your sample
100 49 size is, the closer your
100 50 result gets to the “true”
100 54
value.
100 47
The outcomes vary from 47% to
54%.
The real result could be the exact Let us insist on the fact that those
opposite for all you know. You made a numbers might not be optimal for you
business decision based on false data. if your tool doesn’t use frequentist
Woops … statistics.
How big should your sample be? That said, we advise our clients to use a
calculator like this one (we have one in
There is no magical number that our solution, but this one is extremely
will solve all your problems, sorry. good too).
It comes down to how much of an
improvement you want to be able to It gives an easy-to-read number,
detect. The bigger of a lift you want to without you having to worry about the
detect, the smaller sample size you’ll math too much. And it prevents you
need. from being tempted to stop your test
prematurely, as you’ll know that till this
And even if you have Google-like traffic, sample size is reached, you shouldn’t
it isn’t a stopping condition on its own. even look at your data.
We’ll see that next.
You’ll need to input the current
One thing is true for all methods conversion rate of your page, the
though: the more data you collect, minimum lift you want to track (i.e.
the more accurate or “trustworthy” what is the minimum improvement
your results will be. you’d be happy with).
We sometimes shoot for 1000 Welp, kudos on your traffic, but sorry
conversions if our client’s traffic allows no again …
to. The larger the better, as we saw
3. Duration
31
WARNING! IS THE
WORLD SABOTAGING
YOUR A/B TESTS ?
A/B Testing is great, but let’s be honest Did you raise an eyebrow? Are you
here: if you don’t pay attention, it’s slightly worried? Good. Let’s look into it
pretty easy to fail. together.
You have to create an entire process, What (or who) in the world is screwing
learn how to make hypothesis, analyze up your A/B Tests when you execute
data, know when it’s safe to stop a test them correctly?
or not, understand a bit of statistics …
The World is. Well, actually what are
And… Did you know, that even if you called validity threats. And you can’t
do all of the above perfectly, you ignore them.
might still get imaginary results?
1. You don’t send all test data to your 4. You don’t know about the Flicker
analytics tool Effect
2. You don’t make sure your sample 5. You run tests for too long
is representative of your overall 6. You don’t pay attention to real-
traffic world events
3. You run too many tests at the 7. You don’t check browser/device
same time compatibility
When you do A/B Testing, you take Whereas returning visitors—or worse,
a sample of your overall traffic and your subscribers, will be much more
expose these selected visitors to likely to convert. Because they already
your experiment. know you, trust you and maybe love
you (I hope you have some that do
You then assimilate the results as love you. As Paul Graham, Co-founder
representative of your overall traffic of Y Combinator said: “It’s better to
and the conversion rate measured have 100 people that love you than
sufficiently close to what would the a million people that just sort of like
“true” value be. you”).
This is why you should have traffic One other thing to keep in mind is: if
from all sources. New and returning your competitors are running some
visitors, social, mobile, email, etc in kind of big campaign, it could impact
your sample, mirroring what you your traffic as well.
currently have in your regular traffic.
So pay close attention to your
Unless you’re targeting a specific sample, make sure it’s not polluted
source of traffic for your A/B test that is in any way.
of course.
NEW VISITORS
OTHER
PAID ORGANIC SEARCH
DIRECT
REFERRAL
You’re maybe familiar with Sean Ellis’ and that all your tests actually increase
High Tempo Testing framework. Or you conversions.
just have enough traffic to be able to Let’s not forget that each test has a
run several tests at the same time (well chance to DECREASE conversions. Yup.
done you!). (*shudder*)
BUT—by doing that, you’re increasing You might also be messing up your
the difficulty of the whole process. traffic distribution. My what now?
A C
XX €
BACK ORDER
XX € BUY
B D
XX €
BUY
XX €
BACK ORDER
A C
XX €
BACK ORDER
XX € BUY
B D
XX €
BUY
XX €
BACK ORDER
“Don’t stop too early, don’t test for too But if you let your test run for too long,
long” … Give me a break, right? your cookies will expire. That’s a real
problem.
Why testing for too long is a problem?
You could also get penalized by Google
Cookies. No, not the chocolate ones if you run a test longer than they
(those are never a problem, hmm expect a classic experiment to last.
cookies).
We already talked about how conversion rates vary between week days. It’s
actually just one case among many other things, outside of your control, that
can influence conversion rates and traffic.
43
ARE YOU
MISINTERPRETING YOUR
TESTS RESULTS?
Once you have a process, know how A/B Testing is about learning, and
to formulate a hypothesis, set up your making informed decisions based on
A/B tests and know when to press stop, your results.
you’re all good, right? You’re on top of
your A/B Testing! So let’s make sure we don’t mess that
up!
Nope.
In this chapter, we’ll look into ways you’re possibly misinterpreting your
tests results:
1. You don’t know about false 3. You’re testing too many variables
positives at once
2. You’re not checking your segments 4. You’re giving up on a test after it
fails the first time
Are you aware that there are actually 4 “Yes, it’s true that a team at Google
outcomes to an A/B Test? couldn’t decide between two blues, so
they’re testing 41 shades between each
What do you mean, it’s either a win or blue to see which one performs better.
a loss, no? I had a recent debate over whether a
border should be 3, 4, or 5 pixels wide,
Nope, it can be: and was asked to prove my case. I can’t
operate in an environment like that. I’ve
• False positive (you detect a grown tired of debating such minuscule
winner when there are none) design decisions...”
• False negative (you don’t detect a (You can read the full article here).
winner when there is one)
• No difference between A & B Whether you agree or not with
(inconclusive) him that it’s wrong from a design
• Win (either A or B converts more) standpoint, it’s also mathematically
wrong depending on how you do it.
(If you’re a bit hardcore and want
to know more about this, check out
hypothesis testing. It’s the actual
mathematical method used for You have two ways of ap-
(frequentist) A/B Testing.) proaching this:
A VS B
5%
chance of being
1 a false positive
B VS C
9%
1
C VS D 40%
(...)
K VS (...)
A/B/n Testing
A VS B VS C VS D (...) K VS (...)
5%
chance of being
a false positive
Don’t make Avinash Kaushik sad (one of Web Analytics’ daddies if you’re
wondering).
He has a rule:
You got the message, you need to test What if one of those positively
high-impact changes. impacted conversions and the others
dragged it down? You counted the test
So you change the CTA, headline, add as a failure and it wasn’t one.
a video, a testimonial and change
the text. Then you test it against your Make sure to clearly specify what
current page. And it wins. success looks like and that you’re set
up to measure it.
Good, right?
CONTROL VARIATION
VS
VS
If you followed our guidelines on how mark in the “too bad” column, conclude
to craft an informed hypothesis, each clarity wasn’t in fact an issue and move
of your tests should be derived from on to another test?
(best should be a combination of
several): No, of course you don’t. A/B Testing
is an iterative process.
• Web Analytics
• Heatmaps Take another look at your data, devise
• Usability tests ways to improve your page.
• User interview
• Heuristic analysis • You could add testimonials
• You could remove information not
For example, you have analytics data relevant to the product
showing people staying a while on your • You could add a video
product page, then leaving. • …
You also have an on-page survey As you now know not to do cascade
where visitors told you they weren’t testing (which is completely different
quite convinced that your product than iterative because you don’t test X
answered their need. versions of the same headline/picture
against the winner of a previous test),
Your heuristic analysis showed you had or test everything at once, you can
clarity issues. embrace iterative testing.
Jeff Bezos
Let’s be a tad extreme to illustrate this. Same thing with your A/B tests!
When your internet cuts off, what do Don’t give up or jump to conclusion
you do? If you’re plugged through as soon as something doesn’t work.
an ethernet cable, maybe you try Look for other solutions and test
unplugging/re-plugging it. again, and again.
55
YOUR BRAIN
IS YOUR WORST ENEMY
WHEN A/B TESTING
Did you know that we, humans, SUCK Cognitive biases are then a real threat.
at statistical reasoning?
To get that out of the way, cognitive
We’re also irrational, flawed, and sub- biases are personal opinions, beliefs,
jective. preferences that influence your ability
to reason, remember, evaluating infor-
Why? Because we’re influenced by a mation.
list of cognitive biases longer than your
arm. Let’s go down the rabbit-brain (sorry,
had to do it) and make sure we’re not
You can perfectly live (but biased) subjectively influencing our tests too
without knowing about them, but much by being our flawed (but lovable)
if you’re here, it means you’re A/B selves.
Testing or contemplating to start so.
Remember how we talked about Why an A/B Test worked or not isn’t
external validity threats? straightforward. Be careful not to
rush your test analysis.
Well, if you didn’t know about them,
you could assume that the lift you see Our brain jumps to conclusions like
was indeed caused by the pink CTA there is no tomorrow. (A great book on
you put in your variation. Not because the subject is “Thinking Fast and Slow”,
there is a storm coming that scared by Daniel Kahneman.)
people into buying your product for
example.
You’d have been victim of the illusory Your results are what
correlation bias. You perceived a rela-
tionship between 2 unrelated events.
you’ll use to take
business decisions, so
don’t rush your analysis.
When we talked about fixing a sample size before testing, we actually were also
partially preventing another bias called insensitivity to sample size.
A certain town is served by two hospitals. In the larger hospital about 45 babies are
born each day, and in the smaller hospital about 15 babies are born each day. As
you know, about 50% of all babies are boys. However, the exact percentage varies
from day to day. Sometimes it may be higher than 50%, sometimes lower.
For a period of 1 year, each hospital recorded the days on which more than 60% of
the babies born were boys. Which hospital do you think recorded more such days?
“56% of subjects chose option 3, and 22% of subjects respectively chose options 1
or 2.
The clustering illusion: the intuition Okay, let’s keep flipping coins. Let’s say
that random events which occur in we flip another coin 39 times, and get
clusters are not really random events. 39 heads in a row.
A fun story illustrating the clustering What is the probability of having heads
illusion is the one about the Texas again for the 40th flip?
Sharpshooter.
50%. Just as any other coin flipping.
It’s the story of a Texan who shoots on
the blank wall of his barn then draws If you were a bit confused, you fell prey
a target centered where his shots are to the gambler (or hot hand) fallacy.
most clustered. And then he proceeds
to brag about his shooting skills. You thought that because you got
heads so many times in a row it would
Because you see similarities doesn’t somehow influence the probability of
mean there is a pattern. Nor because the last throw.
you made some good guesses in the
past mean you’ll keep making them.
Flipping a coin 10 times and getting
7 tails doesn’t necessarily means the
coin is biased. It just means you got
tails 7 time in a row
This is what D. Kahneman called “what you see is all there is” in his book. It’s the
notion that we draw conclusions based on information available to us, i.e. in front
of our eyes.
+ = 1€10
A bat and a ball together cost $1.10. Your brain is wired to look for patterns,
The bat costs $1.00 more than the ball. and drawing conclusions with what you
How much does the ball cost? have. Except he sometimes jumps the
gun.
50% of the students who were
asked this simple question, students Because you’ve got 2 pieces of data
attending either Harvard or Yale, got under your nose, doesn’t mean
this wrong. they’re all you need to draw a sensi-
ble conclusion.
80% of the students who were asked
this question from other universities
got it wrong.
Called the Anchoring bias, it’s the When the number given is precise,
fact that we allocate more importance people tend to negotiate in smaller
to the first piece of information we’re increments than with round numbers
given. (studies on the topic here).
Here is an example from a study by If the interviewer went first, with his
Fritz Strack and Thomas Mussweiler: highest bid—or what he said was his
2 groups of people were asked about highest bid, I’d wager you used it as a
Gandhi’s age when he died. base with your counter-offer instead of
what you thought you were worth.
The first group was asked if it was
before 9 years old or after. The second By now you must be getting weirdly
if it was before 140 or after. Both suspicious of your own brain. Good.
answers are pretty obvious. Being aware that we’re built to jump
to conclusions, consider only what’s
But what was very interesting, were the in front of our eyes and ignore the
answers from both groups when they big picture is the first step in the right
were asked to guess Gandhi’s actual direction.
age when he died.
Be extra-careful with your numbers
Answers from the first group had an and tests! When you feel you’re sure
average of 50 vs 67 for the second. about a result, pause and check
Why such a difference? again. Run a the test a second time
Because they were subconsciously if needed.
influenced by their respective first
questions.
Then you test it. And it flops. Badly. Being able to throw
Ouch …
out hours —days even,
What do you do? of work through the
window if your data
“Screw these people my design is
perfect, they don’t know what they’re says so, is a sign you’re
talking about!”
truly becoming data-
Or when you bring the news to your driven.
boss he says: “No way, this design is
clearly better, go with this one”
Called the curse of knowledge, it’s Don’t ask someone from your team
when you’ve been so absorbed by a though. You could all be victims of
subject that you’re having a hard time the bandwagon effect. Members of
thinking about problems like someone a group influence each other. And the
who has little to no knowledge about it. more people do something, the more
other people might be influenced to do
When you know something is there— the same.
say a new button or a new picture,
that’s all you see on the page. If you don’t regularly receive
external feedback, you might have
But your visitors could just as well not built yourself a distorted reality.
even see the difference.
66
10. Overestimating the degree at which people agree
with you