Combining Statistics and Expert Judgment for Better Recommendations

•

5 gefällt mir•2,608 views

This document discusses lessons learned from combining human judgment with algorithmic recommendations at Stitch Fix. It outlines three key lessons: 1) Success can be measured in multiple ways, like agreement between humans and algorithms or user experience; 2) Algorithms should predict both item selection and success, and disagreements can provide useful feedback; 3) Having humans involved makes experiments more complicated since humans may selectively not comply with proposed recommendations or policies. The overall message is that combining human and algorithmic systems can be effective but requires careful consideration of different objectives and how humans interact with and provide feedback to the algorithms.

Technologie

Combining Statistics and
Expert Human Judgment
for Better Recommendations
Brad Klingenberg, Stitch Fix
brad@stitchfix.com MLconf San Francisco 2015
Three lessons

Lessons from having humans in the loop
Humans in the loop

Lessons from having humans in the loop
Humans in the loop
It works really well, but it’s complicated

Lessons from having humans in the loop
Humans in the loop:
It works really well, but it’s complicated
Lesson 1: There’s more than one way to measure success

Styling at Stitch Fix
Personal styling
Inventory

Styling at Stitch Fix: personalized recommendations
Inventory
Algorithmic
recommendations
Statistics

Styling at Stitch Fix: expert human curation
Human curation
Algorithmic
recommendations

Lesson 1: There’s more than one way
to measure success

Traditional recommenders
Learning through feedback

Humans in the loop
Learning through feedback

Measuring success
In the end, you are usually interested in optimizing
and this may make sense for the combined system.
But when optimizing an algorithm, it is important to consider selection

Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that

Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that
● increase agreement and reduce needed searching (credible and useful
recommendations)

Logging selection
This means logging and analyzing selection data

Lesson 2: You have to think carefully about
what you’re predicting

Training a model
What should you predict?
Naive approach: ignore selection and train on success data
Advantages
● “traditional” supervised problem
● simple historical data

Censoring through selection
Problem: selection can censor your data

Censoring through selection
Problem: selection can censor your data
Arms flaunted
Success
Yes
No
Yes No
?
?
p
1-p

Predicting selection
What about predicting selection?

Predicting selection
● Simple, but selection is not really success
● There is a much more direct feedback loop

Training a model
You should probably consider both.
It is most interesting when they disagree
Selection model Success model
vs

Good disagreement
Ignoring an inappropriate recommendation
Client request: “I need an outfit for a glamorous night out!”

Bad disagreement
Stylist not choosing something that would be successful
Predicted probability
of success = 85%
?

Bad disagreement
Stylist not choosing something that would be successful
Could lack trust in the recommendation: importance of transparency
Predicted probability
of success = 85%
?
Based on her
recent purchase

Lesson 3: Humans can say “no”, and this
complicates experiments
-or-
“the downside of free will”

Testing with humans in the loop
Toy example: Suppose we want to test a (bad) new policy

Testing with humans in the loop
New rule: all fixes must contain polka dots!
Toy example: Suppose we want to test a (bad) new policy

An experiment
Control Test
(Polka Dots Rule)

Selective non-compliance
Humans may not comply. Or, they may comply only selectively
Hmm, no
“Please don’t send me
any polka dots” - client X
Test (Polka Dots Rule)

Selective non-compliance
Control Test
(Polka Dots Rule)

Selective non-compliance
Humans help avoid bad choices - this is great for the client!
But, this can obscure the effect you are trying to measure.

Selective non-compliance
Humans help avoid bad choices - this is great for the client!
But, this can obscure the effect you are trying to measure.
Helpful analogy: non-compliance in clinical trials. This has
been intensively studied

Empfohlen

Evan Estola, Lead Machine Learning Engineer, Meetup at MLconf SEA - 5/20/16MLconf

Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15MLconf

Making fashion recommendations with human-in-the-loop machine learningBrad Klingenberg

Combining statistics and human judgementBrad Klingenberg

Personalization and retail: lessons from Stitch FixBrad Klingenberg

Renat Gilmanov "Visual search results validation or where is my ninja turtles?"Lviv Startup Club

Research Traps: 7 ways of thinking that keep you from doing great customer re...Wendy Castleman

9 17-16 - when recommendation systems go bad - rec sysEvan Estola

Empfohlen

Evan Estola, Lead Machine Learning Engineer, Meetup at MLconf SEA - 5/20/16MLconf

Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15MLconf

Making fashion recommendations with human-in-the-loop machine learningBrad Klingenberg

Combining statistics and human judgementBrad Klingenberg

Personalization and retail: lessons from Stitch FixBrad Klingenberg

Renat Gilmanov "Visual search results validation or where is my ninja turtles?"Lviv Startup Club

Research Traps: 7 ways of thinking that keep you from doing great customer re...Wendy Castleman

9 17-16 - when recommendation systems go bad - rec sysEvan Estola

Root cause analysis apr 2010Michael Sahota

Asking Questions and Writing EffectivelyChristopher Lopez

5 Why Training Slides Oct 14, 2009ExerciseLeanLLC

Creating a culture that provokes failure and boosts improvementBen Dressler

Failing: The Very Human Side of TestingSimon Morley

Lifelong Analysis Skills for Explorers and Process Junkies alike!Simon Morley

5 why analysisAmit Shrivastava

Leveraging Social Media with Computer VisionTJ Torres

Experiences with Semi-Scripted Exploratory TestingSimon Morley

10 Guidelines for A/B TestingEmily Robinson

Presenting with Confidencemarketingfreemium

A Guide to the Five Whys TechniqueOlivier Serrat

Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...Gaurav Singh Rajput

4YFN 2016 Guerrilla UXSarah Rink

It's Only Design - On making the least worst solution in the best possible way.Tom Jepson

Managing folk in a workshop - Nosh DabariProduct Anonymous

Sidekicktoherom-bright

Sprint school slidepack finalDesign Lab

Building an A/B Testing Analytics System with R and ShinyEmily Robinson

5 whysAakash Kulkarni

Blending Human Computing and Recommender Systems for Personalized Style Recom...Eric Colson

Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16MLconf

Weitere ähnliche Inhalte

Was ist angesagt?

Root cause analysis apr 2010Michael Sahota

Asking Questions and Writing EffectivelyChristopher Lopez

5 Why Training Slides Oct 14, 2009ExerciseLeanLLC

Creating a culture that provokes failure and boosts improvementBen Dressler

Failing: The Very Human Side of TestingSimon Morley

Lifelong Analysis Skills for Explorers and Process Junkies alike!Simon Morley

5 why analysisAmit Shrivastava

Leveraging Social Media with Computer VisionTJ Torres

Experiences with Semi-Scripted Exploratory TestingSimon Morley

10 Guidelines for A/B TestingEmily Robinson

Presenting with Confidencemarketingfreemium

A Guide to the Five Whys TechniqueOlivier Serrat

Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...Gaurav Singh Rajput

4YFN 2016 Guerrilla UXSarah Rink

It's Only Design - On making the least worst solution in the best possible way.Tom Jepson

Managing folk in a workshop - Nosh DabariProduct Anonymous

Sidekicktoherom-bright

Sprint school slidepack finalDesign Lab

Building an A/B Testing Analytics System with R and ShinyEmily Robinson

5 whysAakash Kulkarni

Was ist angesagt? (20)

Root cause analysis apr 2010

Asking Questions and Writing Effectively

5 Why Training Slides Oct 14, 2009

Creating a culture that provokes failure and boosts improvement

Failing: The Very Human Side of Testing

Lifelong Analysis Skills for Explorers and Process Junkies alike!

5 why analysis

Leveraging Social Media with Computer Vision

Experiences with Semi-Scripted Exploratory Testing

10 Guidelines for A/B Testing

Presenting with Confidence

A Guide to the Five Whys Technique

Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...

4YFN 2016 Guerrilla UX

It's Only Design - On making the least worst solution in the best possible way.

Managing folk in a workshop - Nosh Dabari

Sidekicktohero

Sprint school slidepack final

Building an A/B Testing Analytics System with R and Shiny

5 whys

Andere mochten auch

Blending Human Computing and Recommender Systems for Personalized Style Recom...Eric Colson

Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16MLconf

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf

Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf

Data (art &) ScienceEric Colson

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf

Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf

Sci am special online issue 2003.no09 - germ warsDholon Paul

Historia del internetjorgeesparza1999

Inscripciónvueltaiberica

Spec Sheet - PitaJen McCollum Roberts

frostii friday1Chesne Pakulak

Voet2010 sofiekdg1

Regimento Interno da Camara de Vereadores de Astorga aJoao Carlos Passari

Métodos de planificación de la concepciónprofeguerrini

1(3) ijrpbdebjit bhowmik

COMUNICACION Anderzon Herrera Casallas

Bondia Lleida 19/01/2010Bondia Lleida Sl

Cuspide 4 (3)Lino Javier Calderon Armenta

Andere mochten auch (20)

Blending Human Computing and Recommender Systems for Personalized Style Recom...

Geetu Ambwani, Principal Data Scientist, Huffington Post at MLconf NYC - 4/15/16

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...

Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15

Data (art &) Science

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

Sci am special online issue 2003.no09 - germ wars

Historia del internet

Inscripción

Spec Sheet - Pita

frostii friday1

Voet2010

Regimento Interno da Camara de Vereadores de Astorga a

Métodos de planificación de la concepción

1(3) ijrpb

COMUNICACION

Bondia Lleida 19/01/2010

Cuspide 4 (3)

Ähnlich wie Combining Statistics and Expert Judgment for Better Recommendations

Nabep analytics presentationaarongblack1

Being a Data-Driven CommunicatorJames Valentine, MSC

Technology Discovery Chapter 10Timothy Bryant

The Art of Questioning to improve Software Testing, Agile and AutomatingAlan Richardson

SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OKDistilled

Customer Discovery: Interviewing Tips and TechniquesCIMIT

A Great Idea Isn't Enough for Successful Change - FinalKaiNexus

Adam Wesolowski "How to start working on growth?"IT Event

[CXL Live 16] The Grand Unified Theory of Conversion Optimization by John EkmanCXL

How to Increase Your Testing Success by Combining Qualitative and Quantitativ...Optimizely

5 Cycles Remote Innovation - SystemsBryan Cassady

Module 4: Model Selection and EvaluationSara Hooker

School customer service presentationsteve muzzy

QA's lead role in agile transformationsDave Ungar

Selenium Users AnonymousDave Haeffner

Problem Solving Tools.pdfABOOMAR42

Behavioral Econ 101 for Product Design - Action Design DC 12 August 2014Stephen Wendel

Behavior Based Approach to Experiment Designcolemanerine

Formulate better hypothesesCarmen Brion

Optimizely Workshop 1: Prioritize your roadmapOptimizely

Ähnlich wie Combining Statistics and Expert Judgment for Better Recommendations (20)

Nabep analytics presentation

Being a Data-Driven Communicator

Technology Discovery Chapter 10

The Art of Questioning to improve Software Testing, Agile and Automating

SearchLove Boston 2017 | Richard Fergie | You Aren't Doing Science and That's OK

Customer Discovery: Interviewing Tips and Techniques

A Great Idea Isn't Enough for Successful Change - Final

Adam Wesolowski "How to start working on growth?"

[CXL Live 16] The Grand Unified Theory of Conversion Optimization by John Ekman

How to Increase Your Testing Success by Combining Qualitative and Quantitativ...

5 Cycles Remote Innovation - Systems

Module 4: Model Selection and Evaluation

School customer service presentation

QA's lead role in agile transformations

Selenium Users Anonymous

Problem Solving Tools.pdf

Behavioral Econ 101 for Product Design - Action Design DC 12 August 2014

Behavior Based Approach to Experiment Design

Formulate better hypotheses

Optimizely Workshop 1: Prioritize your roadmap

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf

Josh Wills - Data Labeling as Religious ExperienceMLconf

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf

Meghana Ravikumar - Optimized Image Classification on the CheapMLconf

Noam Finkelstein - The Importance of Modeling Data CollectionMLconf

June Andrews - The Uncanny Valley of MLMLconf

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf

Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf

Neel Sundaresan - Teaching a machine to codeMLconf

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf

Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf

Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush

Josh Wills - Data Labeling as Religious Experience

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...

Meghana Ravikumar - Optimized Image Classification on the Cheap

Noam Finkelstein - The Importance of Modeling Data Collection

June Andrews - The Uncanny Valley of ML

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...

Vito Ostuni - The Voice: New Challenges in a Zero UI World

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...

Neel Sundaresan - Teaching a machine to code

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...

Soumith Chintala - Increasing the Impact of AI Through Better Software

Roy Lowrance - Predicting Bond Prices: Regime Changes

Kürzlich hochgeladen

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Digital Tools & AI in Career DevelopmentMahmoud Rabie

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada

Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos

Kürzlich hochgeladen (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Generative AI - Gitex v1Generative AI - Gitex v1.pptx

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Digital Tools & AI in Career Development

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

Zeshan Sattar- Assessing the skill requirements and industry expectations for...

React JS; all concepts. Contains React Features, JSX, functional & Class comp...

Decarbonising Buildings: Making a net-zero built environment a reality

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...

Potential of AI (Generative AI) in Business: Learnings and Insights

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

Testing tools and AI - ideas what to try with some tool examples

QCon London: Mastering long-running processes in modern architectures

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...

Design pattern talk by Kaya Weers - 2024 (v2)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)

Combining Statistics and Expert Judgment for Better Recommendations

1. Combining Statistics and Expert Human Judgment for Better Recommendations Brad Klingenberg, Stitch Fix brad@stitchfix.com MLconf San Francisco 2015 Three lessons

2. Lessons from having humans in the loop Humans in the loop

3. Lessons from having humans in the loop Humans in the loop It works really well, but it’s complicated

4. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success

5. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting

6. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting Lesson 3: Humans can say “no”, and this complicates experiments

7. Humans in the loop at Stitch Fix

8. Stitch Fix

9. Stitch Fix

10. Stitch Fix

11. Stitch Fix

12. Styling at Stitch Fix Personal styling Inventory

13. Styling at Stitch Fix: personalized recommendations Inventory Algorithmic recommendations Statistics

14. Styling at Stitch Fix: expert human curation Human curation Algorithmic recommendations

15. Lesson 1: There’s more than one way to measure success

16. Traditional recommenders Learning through feedback

17. Humans in the loop Learning through feedback

18. Measuring success In the end, you are usually interested in optimizing and this may make sense for the combined system. But when optimizing an algorithm, it is important to consider selection

19. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that

20. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations)

21. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations) ● make the humans more efficient (effortless curation)

22. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations) ● make the humans more efficient (effortless curation) ● have a better user experience (fewer bad or annoying recommendations)

23. Logging selection This means logging and analyzing selection data

24. Lesson 2: You have to think carefully about what you’re predicting

25. Training a model What should you predict? Naive approach: ignore selection and train on success data Advantages ● “traditional” supervised problem ● simple historical data

26. Censoring through selection Problem: selection can censor your data

27. Censoring through selection Problem: selection can censor your data

28. Censoring through selection Problem: selection can censor your data Arms flaunted Success Yes No Yes No ? ? p 1-p

29. Predicting selection What about predicting selection?

30. Predicting selection ● Simple, but selection is not really success ● There is a much more direct feedback loop

31. Training a model You should probably consider both. It is most interesting when they disagree Selection model Success model vs

32. Good disagreement Ignoring an inappropriate recommendation Client request: “I need an outfit for a glamorous night out!”

33. Good disagreement Ignoring an inappropriate recommendation Client request: “I need an outfit for a glamorous night out!”

34. Bad disagreement Stylist not choosing something that would be successful Predicted probability of success = 85% ?

35. Bad disagreement Stylist not choosing something that would be successful Could lack trust in the recommendation: importance of transparency Predicted probability of success = 85% ? Based on her recent purchase

36. Lesson 3: Humans can say “no”, and this complicates experiments -or- “the downside of free will”

37. Testing with humans in the loop Toy example: Suppose we want to test a (bad) new policy

38. Testing with humans in the loop New rule: all fixes must contain polka dots! Toy example: Suppose we want to test a (bad) new policy

39. An experiment Control Test (Polka Dots Rule)

40. Selective non-compliance Humans may not comply. Or, they may comply only selectively Hmm, no “Please don’t send me any polka dots” - client X Test (Polka Dots Rule)

41. Selective non-compliance Control Test (Polka Dots Rule)

42. Selective non-compliance Control Test (Polka Dots Rule)

43. Selective non-compliance Humans help avoid bad choices - this is great for the client! But, this can obscure the effect you are trying to measure.

44. Selective non-compliance Humans help avoid bad choices - this is great for the client! But, this can obscure the effect you are trying to measure. Helpful analogy: non-compliance in clinical trials. This has been intensively studied

45. Lessons from having humans in the loop Humans in the loop It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting Lesson 3: Humans can say “no”, and this complicates experiments

46. Thanks! Questions? (we’re hiring!)