This document discusses lessons learned from combining human judgment with algorithmic recommendations at Stitch Fix. It outlines three key lessons: 1) Success can be measured in multiple ways, like agreement between humans and algorithms or user experience; 2) Algorithms should predict both item selection and success, and disagreements can provide useful feedback; 3) Having humans involved makes experiments more complicated since humans may selectively not comply with proposed recommendations or policies. The overall message is that combining human and algorithmic systems can be effective but requires careful consideration of different objectives and how humans interact with and provide feedback to the algorithms.
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Combining Statistics and Expert Judgment for Better Recommendations
1. Combining Statistics and
Expert Human Judgment
for Better Recommendations
Brad Klingenberg, Stitch Fix
brad@stitchfix.com MLconf San Francisco 2015
Three lessons
3. Lessons from having humans in the loop
Humans in the loop
It works really well, but it’s complicated
4. Lessons from having humans in the loop
Humans in the loop:
It works really well, but it’s complicated
Lesson 1: There’s more than one way to measure success
5. Lessons from having humans in the loop
Humans in the loop:
It works really well, but it’s complicated
Lesson 1: There’s more than one way to measure success
Lesson 2: You have to think carefully about what you’re predicting
6. Lessons from having humans in the loop
Humans in the loop:
It works really well, but it’s complicated
Lesson 1: There’s more than one way to measure success
Lesson 2: You have to think carefully about what you’re predicting
Lesson 3: Humans can say “no”, and this complicates experiments
18. Measuring success
In the end, you are usually interested in optimizing
and this may make sense for the combined system.
But when optimizing an algorithm, it is important to consider selection
19. Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that
20. Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that
● increase agreement and reduce needed searching (credible and useful
recommendations)
21. Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that
● increase agreement and reduce needed searching (credible and useful
recommendations)
● make the humans more efficient (effortless curation)
22. Optimizing interaction
For a set of algorithms with the same marginal performance,
We generally prefer the algorithms that
● increase agreement and reduce needed searching (credible and useful
recommendations)
● make the humans more efficient (effortless curation)
● have a better user experience (fewer bad or annoying recommendations)
24. Lesson 2: You have to think carefully about
what you’re predicting
25. Training a model
What should you predict?
Naive approach: ignore selection and train on success data
Advantages
● “traditional” supervised problem
● simple historical data
31. Training a model
You should probably consider both.
It is most interesting when they disagree
Selection model Success model
vs
32. Good disagreement
Ignoring an inappropriate recommendation
Client request: “I need an outfit for a glamorous night out!”
33. Good disagreement
Ignoring an inappropriate recommendation
Client request: “I need an outfit for a glamorous night out!”
34. Bad disagreement
Stylist not choosing something that would be successful
Predicted probability
of success = 85%
?
35. Bad disagreement
Stylist not choosing something that would be successful
Could lack trust in the recommendation: importance of transparency
Predicted probability
of success = 85%
?
Based on her
recent purchase
36. Lesson 3: Humans can say “no”, and this
complicates experiments
-or-
“the downside of free will”
37. Testing with humans in the loop
Toy example: Suppose we want to test a (bad) new policy
38. Testing with humans in the loop
New rule: all fixes must contain polka dots!
Toy example: Suppose we want to test a (bad) new policy
40. Selective non-compliance
Humans may not comply. Or, they may comply only selectively
Hmm, no
“Please don’t send me
any polka dots” - client X
Test (Polka Dots Rule)
43. Selective non-compliance
Humans help avoid bad choices - this is great for the client!
But, this can obscure the effect you are trying to measure.
44. Selective non-compliance
Humans help avoid bad choices - this is great for the client!
But, this can obscure the effect you are trying to measure.
Helpful analogy: non-compliance in clinical trials. This has
been intensively studied
45. Lessons from having humans in the loop
Humans in the loop
It works really well, but it’s complicated
Lesson 1: There’s more than one way to measure success
Lesson 2: You have to think carefully about what you’re predicting
Lesson 3: Humans can say “no”, and this complicates experiments