At Microsoft I experienced how A/B testing grew from being occasionally used by a few teams in Bing and MSN several years ago to becoming widely used by many Microsoft products including Office, Windows, xBox, Skype, Visual Studio, and others. In some products it is already a standard required part of the software release process, helping ensure software quality, understand customer value, and make better data driven decisions. In others products it is growing steadily. At Microsoft, A/B testing is winning and will soon be part of everyone's daily job. However, when I left Microsoft to join Outreach, a startup that makes sales automation software, I got exposed to a different world. Even though Outreach provided A/B testing functionality, it was rarely used and the usage was often incorrect. While the need for trustworthy decision making through A/B testing in sales was clear, it was also clear that simply giving sales teams an A/B testing system like the one we had at Microsoft will not be enough. I learned that there is a big difference between a Microsoft engineer and a sales representative, with respect to their needs for successfully using A/B testing. In this talk I will discuss the gaps. What are our experimentation platforms, tools and processes, which were built for highly trained engineers, missing to make A/B testing truly available to everyone? I will also discuss ongoing work and future research directions to fill these gaps. While required to make A/B testing a success in sales, I believe that solving these problem will also help to increase adoption and successful usage of A/B testing in the software industry.
Scaling API-first – The story of a global engineering organization
A/B Testing for Everyone
1.
2. A/B Testing for Everyone
Pavel Dmitriev
Some slides taken from talks by Ronny Kohavi
3. About Me
B.S. Applied Math @ Moscow State University, Russia
Ph.D. Computer Science @ Cornell University focused on applied Machine
Learning
3 years @ Yahoo!, worked on web crawling and indexing optimization
8 years @ Microsoft, worked on experimentation in Bing, MSN, O365,
Skype, Windows
5 months @ Outreach, working on experimentation, ML, NLP
3
4. About Me
B.S. Applied Math @ Moscow State University, Russia
Ph.D. Computer Science @ Cornell University focused on applied Machine
Learning
3 years @ Yahoo!, worked on web crawling and indexing optimization
8 years @ Microsoft, worked on experimentation in Bing, MSN, O365,
Skype, Windows
5 months @ Outreach, working on experimentation, ML, NLP
4
5. Outline
• Intro to A/B testing
• Examples of real experiments
• Experimentation adoption across industries
• Five challenges preventing faster adoption, in Sales and in Software
5
6. The Life of a Great Idea – True Bing
Story
6
Control – Existing Display Treatment – new idea called Long Ad Titles
7. The Life of a Great Idea
• It was one of hundreds of ideas on the table, and it seemed
• Stayed in the backlog in
• Many features were above it, it was clear the idea was not going to make it any time
soon
• The engineer thought it was trivial to implement. He implemented it and started an A/B
test.
• Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug)
• But in this case there was no bug. The idea increased Bing’s revenue by 12% (over
$100M/year), without hurting user experience metrics!
7
…meh…
Feb March April May June
8. We are bad at assessing the value of
ideas
• The best revenue generating idea in Bing history was badly rated and delayed for
months!
At Microsoft, we ran a study in Bing and found that only ~1/3 of ideas developed were actually
good for users and business, ~1/3 were neutral, and ~1/3 were bad
• Only in Software Engineering?
In Sales, contradicting “best practices” are abundant. For example, best day to contact the
prospect is …
In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death. FDA and
EMA do not trust expert opinions and mandates the use of Randomized Controlled Trials
8We can’t trust our gut! To make the right choices we need data from real users!
10. Collecting Usage Data
• Companies have always been collecting data to learn what their users appear to
value
Interviews, focus groups, questionnaires, and other similar techniques are great at revealing
what users say they do
Although rich with qualitative information, the learnings from these techniques are typically
based on small samples and risk being biased, making it hard to generalize
• With the internet connectivity of the products, companies can collect feedback data
to learn what their customers actually value
Telemetry and logging reveal what the customers actually do
10
11. Use Data Correctly - Correlation is not
Causation
• Seattle is known for its rain
• Whenever I see people on the street carrying
umbrellas, very soon it starts raining
• I may conclude that umbrellas cause the rain,
and decide to ban them
• Banning umbrellas, however, won’t stop the
rain; it will just make everyone more wet
11
Photo by Mike Waller, taken from Flickr
Relying on correlations isn’t just neutral, it’s often harmful to the business!
12. Correlation is not Causation – Real
Example
• You observe the churn rates for users using/not-using your feature:
25% of new users who do NOT use your feature churn (stop using product 30 days later); only
10% of new users who use your feature churn
• [Wrong] Conclusion: your feature reduces churn and thus critical for retention
Flaw: Relationship between the feature and retention is correlational, the data above is
insufficient for any causal conclusion
• Example: Users who see error messages in Office 365 churn less.
This does NOT mean we should show more error messages. They are just heavier users of
Office 365 12
13. Using Data Correctly – Before and After
13
Flaw: This approach misses
time related factors such as
external events, weekends,
holidays, seasonality, etc.
0
5
10
15
20
25
30
35
Amazon Kindle Sales
Website A Website B
Before and after example
0
5
10
15
20
25
30
35
Amazon Kindle Sales
Website A Website B
Oprah calls
Kindle "her new
favorite thing"
The new site (B) is always worse
than the original (A), opposite of
what observational data
suggests
14. A/B Tests in One Slide
• Other names: Controlled Experiments, Randomized Clinical Trials (RCTs)
• Can have more than two variants: A/B/C/etc. tests are common
• Must run statistical tests to confirm differences are not due to chance
14A/B Tests are the best scientific way to prove causality!
15. Real Examples
• Three experiments
• Each had enough users for statistical validity
• For each experiment I’ll tell you the success metric
• Your job is to guess the result
Please stand up
You’ll chose between three options by raising you left hand, right hand, or leave both
hand down
If you get it wrong, please sit down
• Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4% will
get all three questions right. Let’s see how much better than random you can do.
16. Example 1: Outreach Email (Step 9, Day 7)
• Success metric: Reply Rate 16
Hey {{first_name}},
In short, we're a sales automation platform that makes your
reps life a lot easier. Our average companies (based on 1100+
companies) have tripled their reply rates on cold outbound
emails and boosted rep productivity by 2x.
We take what your best reps are doing and automate that
across your entire team so your weaker reps can work at the
highest possible same level. We also solve the issue of follow
up falling through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are
spam you have no interest in. My goal is to provide enough value to warrant a 15 minute
call with you.
What we do is put your sales process into a structured series of touch points which takes
care of your follow-up process for you. This ramps up reps activities and ensures that every
lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of
sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your
reps time, logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is
performing, what activities they're doing, and what is and isn't working. This provides a solid
foundation to accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep
saves 2 hrs a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
{{sender.first_name}}
• Left: shorter, more “salesy”
• Right: longer, more “socially
• Raise your left hand if you think the Left version wins (stat-sig)
• Raise your right hand if you think the Right version wins (stat-sig)
• Don’t raise your hand if they are the about the same (no stat-sig difference)
17. Example 1: Outreach Email (Step 9, Day 7)
17
Hey {{first_name}},
In short, we're a sales automation platform that makes your
reps life a lot easier. Our average companies (based on 1100+
companies) have tripled their reply rates on cold outbound
emails and boosted rep productivity by 2x.
We take what your best reps are doing and automate that
across your entire team so your weaker reps can work at the
highest possible same level. We also solve the issue of follow
up falling through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are
spam you have no interest in. My goal is to provide enough value to warrant a 15 minute
call with you.
What we do is put your sales process into a structured series of touch points which takes
care of your follow-up process for you. This ramps up reps activities and ensures that every
lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of
sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your
reps time, logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is
performing, what activities they're doing, and what is and isn't working. This provides a solid
foundation to accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep
saves 2 hrs a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
{{sender.first_name}}
• Left template has 70% higher reply rate… However, most replies are
negative or unsubscribe requests. The right template has higher positive
• If you did not raise your hand, sit down…
• If you raise your right hand, sit down…
18. Example 2: SERP Truncation
• SERP is a Search Engine Result Page
(shown on the right)
• Success Metric: Clickthrough Rate on first SERP
(ignore issues with click/back, page 2, etc.)
• Version A: show 10 algorithmic results
• Version B: show 8 algorithmic results by
removing the last two results (shown on the right)
• All else the same: task pane, ads, related
searches
18
• Raise your left hand if you think version A wins (10 results)
• Raise your right hand if you think version B wins (8 results)
• Don’t raise your hand if they are the about the same
19. Example 2: SERP Truncation
• If you raised your left hand, sit down…
• If you raised your right hand, sit down…
• With over 3M users in each variant, we could not
detect a stat-sig delta. Users simply shifted the
clicks from the last two algorithmic results to
other elements of the page.
• Rule of Thumb: Shifting clicks is easy. Reducing
abandonment is hard.
19
20. Example 3: Windows Search Box
• The search box in the lower left corner of the screen on Windows machines
20
• Success metrics: more searches (and thus more Bing revenue)
• Raise your left hand if you think the Left version wins
• Raise your right hand if you think the Right version wins
• Don’t raise your hand if they are the about the same
21. Example 3: Windows Search Box
21
• If you did not raise your hand, sit down…
• If you raised your left hand, sit down…
• The four variants we actually tested in order of performance are:
Type here to search (winner)
What can I help you find?
Ask me anything (Control - the design that shipped with Windows 10)
Search the web and Windows (worst)
Stop guessing – get the data!
23. Experimentation Adoption: Software
Industry
• http://www.exp-growth.com/ -
survey to determine the state
of experimentation maturity
(Fabijan et al, ICSE 2017,
SEAA 2018)
23
0
5
10
15
20
25
Crawl Walk Run Fly
State of Exp Growth
24. Other industries? Let’s look at
Sales
• Most of Outreach ~2500 customers fall into Crawl stage, with many not doing any
A/B testing at all
• Few sales organizations have a systematic experimentation program
• Huge potential: some experiments we ran doubled reply rates!
24
25. What are the reasons for low adoption in
Sales?
• A few facts about sales
Very traditional industry (some say the oldest profession on earth), slow to change
No formal education or degrees, considered entry level and pays low
Requires extreme mental toughness. You are constantly ignored and told no. You’ve got a
monthly quota, and if you don’t meet it 3 months in row – you are fired
• There’s a fear of change: sales managers are afraid to try new ideas, fearing it may
cause harm and result in missing their quota
25
26. What are the reasons for low adoption in
Sales?
• Inadequate support for experimentation in sales tools, leading to most tests being
invalid, and inability to confidently make decisions even on valid tests
26
no statistical
testing
any user can turn the
variants on/off any time
during the test
Any user can edit the
email being tested any
time during the test
Vast majority of the
tests are broken (e.g.
imbalance in deliveries)
27. How to increase the adoption?
• We need to make experimentation
Trustworthy – results are correct and easily understood
Safe – impact of testing bad ideas is limited
Easy to use – enable non-technical sales managers and executives answer their
questions
• These are the same things I worked on trying to increase adoption of
experimentation at Microsoft!
Except… the bar is higher!!!
27
28. Five Gaps
1. No open source trustworthy A/B testing solution
2. Difficult to come up with the right metrics
3. Small sample sizes
4. Difficult to understand results of statistical tests
5. Hard to translate business questions into experiment designs
Between the needs of Sales Industry and the experimentation State of the
Art
28
Solving these issues will help accelerate experimentation adoption in Sales, Software, and other domains
29. #1. Open Source A/B Testing
Platform
• Pretty much anything “platform” is open
source, except A/B testing
Wasabi, the only option, is not maintained
• Our http://exp-growth.com survey showed
that most companies build their own platform
from scratch (Fabijan et al, SEAA 2018).
This is hard - a big investment few
companies can afford.
29
0%
10%
20%
30%
40%
50%
60%
70%
80%
Internally developed
platform
Third party platform No platform (manual coding
of experiments)
Type of Exp Platform
• There’s a need for an easy to deploy and integrate open source A/B testing solution
that is easy to use, supports several common experiment designs, and provides safety
features
30. #2. Determining the right metrics
• How to judge the result of an A/B test?
OEC = Overall Evaluation Criteria, or OMTM = One Metric That Matters
A single metric or a few key metrics with a well-defined decision criteria
• Two key properties:
1. Alignment with long-term company goals (directionality)
2. Ability to impact (sensitivity)
• Finding a good OMTM is hard, in Sales and in Software Products
Simple metrics like Opens or Replies to sales emails are not predictive of future sale (fail directionality)
Long-term metrics like Sales or Revenue take too long to measure - typical sales cycle takes months - and are
hard to impact via small changes like email content (fail sensitivity)
Outreach solution – Positive Replies, where “positive” is determined via an ML classifier
See A/B Testing at Scale Tutorial for examples from Software industry 30
31. #3. Small Sample Sizes
• A typical 2-week A/B test for a mid-size Outreach customer will only have hundreds-
to-thousands data points in each variant
This translates to being able to detect only changes of ~20% or more
• Solutions:
Run bigger tests (at Outreach we recommend to always run 50/50 tests)
Select more sensitive metrics: 20% increase in Revenue is hard, 20% increase in Positive
Replies is easier
Start by focusing on bigger changes rather than small tweaks. As the company grows and
volume of sales activity increases, can focus on smaller and smaller changes
Implement smarter experiment designs (e.g. cross-over design) and analysis methods (e.g.
CUPED)
31
32. #4. Understanding Experiment
Results
• Standard way of evaluating experiments via Null
Hypothesis Testing can be easily misinterpreted, leading
to wrong conclusions
See Steve Goodman’s A Dirty Dozen for 12 ways to get it
wrong
Can’t show p-values to sales reps, need an easier way to
interpret results
32
• Treatment effect may be different on different sub-populations
Results may vary depending on country, browser, location, prospect persona, sales step, etc.
How to automatically detect and visualize such heterogeneous results?
33. #4. Understanding Experiment
Results
• Each experiment needs to have clear success criteria, mapping unambiguously to
positive/negative outcomes
• Summarize results and learnings in an easy to understand visual way (Fabijan et al,
SEAA 2018)
33
34. #5. Answering Business Questions
• Traditionally, A/B testing have been used to answer simple yes/no questions like
Does my new medicine help?
Should I ship my new feature?
Is my new email subject line better?
• However, managers and execs think of bigger more difficult questions
Does embedding videos in e-mails help?
How urgently should sales reps reply to prospects?
How much should I invest in improving performance of my site?
• Using A/B testing to help answer these questions can help greatly accelerate adoption of
experimentation
Run a series of experiments on embedding video across all key scenarios
Run a series of experiments notifying users to reply with different delays across multiple scenarios
Run a series of “slowdown” experiments to estimate impact of performance on revenue
• Need to develop design patterns for such “learning experiment series”
34
35. Summary
• We are bad at assessing the value of our ideas. Don’t trust experts – get the data!
• A/B testing is the best scientific way to measure causal impact of your work on users and business
• Experimentation adoption is growing in Software Industry, but very low in other industries like
Sales
• Five challenges slowing down the adoption:
1. No open source trustworthy A/B testing solution
2. Difficult to come up with the right metrics
3. Small sample sizes
4. Difficult to understand results of statistical tests
5. Hard to translate business questions into experiment designs
• Solving these challenges will not only help Sales, it will accelerate experimentation
adoption in Software and other industries, bringing experimentation to Everyone!35
I wanted to enable experimentation for Sales. On the one hand, there already was A/B testing support in Outreach product. So looks like I had a head start. But on the other hand, few organizations were using it.
Today I’m going to share with you what I learned while working on increasing adoption of A/B testing in Sales, what challenges we need to solve to do it, and how solving these challenges will actually help us grow adoption of experimentation in Software and other industries.
References for similar observations made by many others
“Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes” was stated in Uncontrolled by Jim Manzi
“80% of the time you/we are wrong about what a customer wants” was stated in Experimentation and Testing: A Primer by Avinash Kaushik, author of Web Analytics and Web Analytics 2.0
QualPro tested 150,000 ideas over 22 years and founder Charles Holland stated, “75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance…” in Breakthrough Business Results With MVT
At Amazon, half of the experiments failed to show improvement
What customers say they like during a study may be different from what they actually like in their daily life.
Example: In user studies customers always prefer richer web pages (with more images/videos/carousels/etc.). In real life, when page load time and the speed of getting to the result is important, richer pages slow users down and introduce distractions from the task at hand, and often end up doing worse.
It’s not enough to collect the data, we need to use it correctly
At Microsoft experimentation is winning. When I left 5 months ago more than 20 large Microsoft products were actively running experiments.
In software industry at large, experimentation adoption is growing fast. More and more companies are adoption “experiment with everything” culture. Experimentation is becoming the norm, with many successful companies like Netflix, Pinterest, Intuit, LinkedIn, etc. citing it as one of the key factors of their success. This peer pressure forces even the companies that do not have good understanding of the benefits of experimentation to adopt it.
In sales, adoption of experimentation is low.
Cultural issues
Technical issues
Outreach had ”state of the art” experimentation capabilities. Others are similar or worse.
What these issues mean is that even if someone tries to run an experiment, they are very likely to fail to obtain correct actionable results, which in turn creates even bigger barrier for them to try it again.
Even though on the surface the sales industry is so different from Software Engineering, the issues wrt experimentation adoption are the same.
For example, while for Microsoft engineers you can teach them how statistical testing works and just send just give them p-values and confidence intervals and reasonably trust they will interpret it all correctly, this doesn’t work for sales people. Sales people need a clear answer – which variant wins and why.
You will see that, while I’m going to be using the example of Sales, these are again the challenges faced by the Software industry as well. By solving them, we can increase experimentation adoption not only in Sales but in Software and other industries.
From hosting of services, to data collection and processing, to data analysis and real time machine learning and AI – there are open source solutions supporting all of that.
A lot of work to develop a trustworthy A/B testing platform. Most companies end up with very basic, often untrusworthy solutions, like the one we had at Outreach.
If we are to increase the adoption of experimentation in any industry, we need a quality open source solution.
If we run an A/B test but do not have a clear way to determine success, the value is much less.
Often the reason for complexity is that experimentation system does not know what the success criteria area, so it has to return back everything and have the user understand and interpret the results. If, on the other hand, experimentation system knows the precise success criteria, it can do analysis automatically and just give the user the answer. Interpretation difficulties and shortcomings of NHST can be greatly reduced in this case.
Translation of the video question: Should I pay for a video tool and train my team on it?
It is executives and managers who decide on adopting experimentation program. If A/B testing can help them answer the questions they really care about, it will greatly accelerate the adoption.