2. Background
• Large ecommerce platform
• 240K+ current customers
• Many more shops created (churned or
didn’t make it to customer status)
3.
4. Problem
● No information about their industry in most cases
1st solution
● ask them
2nd solution
● We have html product descriptions for each shop
● We have labelled data (mechanical turk)
Classifier
5. Context
• Started during a Shopify Hack Day
• Pursued as a side project at work
• Used sk-learn and
• Moved to Spark MLlib for full scale testing
and production
• Now in production
7. Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores
• Having to choose among 15 verticals
• Involved hundreds of turkers
8. 80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input
9. 80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed
• Stop word removed
• Words stemmed
10. Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency
80Kshops
Joining
mech
turk
12. Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
15labels
Naïve Bayes Model
13. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
• These are the model parameters
• Needed as input to the prediction formula
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
14. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
! &* $2&) =
4 15 ∗4 781 15)
4(781)
∝ ! &* ∗ ! $2& &*)
= ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
(Bayes Theorem)
with conditional independence
assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
15. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
16. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Food Log(P(..))
Numerical limitation
?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log (! ;$= &*)) + … + log(! ;$> &*))
• Way around: take log -> leads to summation instead of multiplication
• No impact on comparisons across classes
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:
17. Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Getting cell probabilities
! ;$> &*) =
DEF GH
∑ DEFKLEMN
Dealing with P(wd|cl)=0
which makes P(cl|doc)=0
regardless of other words
!(&*) =
DEF
D
≈
DEF GH P<
∑ (DEFP<)KLEMN
=
DEF GH P<
∑ (DEF)PQ81RSKLEMN
21. class LabeledDataFilter():
...
class Featurizer():
...
class Trainer()
...
class Evaluator()
...
class Predictor()
...
class verticalPredictor():
use Featurizer()
use Predictor()
...
product_data
Training job (every 7 days) Prediction job (every day)
model
accuracy
product_data
shop+industry
model
Code
22. Change in Training Set
• Start of home card
• Allowed asking for Industry in
a voluntary way
• Quickly grew to 50K shops
• Advantage: growing over time
• Issue: training set is not fully
random
23. Shop Name
Shop URL
Shop Address
Shop City
…
Shop Predicted Industry
…
Shop Dimension
In the Data Warehouse
Updated daily
24. Results
Shops top
category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%
25. Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%
26. Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%
27. Business Use
Management or product teams:
• What are the biggest industries per shop count, per sales made?
• How does that evolve over time ?
Theme team:
• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:
• We want to be part of an event in the music business, can we get
interesting shops in this field ?
28. Could be improved
●More metrics: Add multiclass precision/recall
○Now available in mllib
●Better performances: Rerun for combination
of parameters
○Also added recently to mllib but missing some
components