Shop vertical classification - Meetup Presentation

Shop Vertical
Classification
@
Arthur Prévot
Meetup Machine Learning – Toronto – March 1st 2016

Background
• Large ecommerce platform
• 240K+ current customers
• Many more shops created (churned or
didn’t make it to customer status)

Problem
● No information about their industry in most cases
1st solution
● ask them
2nd solution
● We have html product descriptions for each shop
● We have labelled data (mechanical turk)
Classifier

Context
• Started during a Shopify Hack Day
• Pursued as a side project at work
• Used sk-learn and
• Moved to Spark MLlib for full scale testing
and production
• Now in production

Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores
• Having to choose among 15 verticals
• Involved hundreds of turkers

80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input

80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed
• Stop word removed
• Words stemmed

Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency
80Kshops
Joining
mech
turk

Model
• Few quick tests using sklearn and settled
on Naïve Bayes

Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
15labels
Naïve Bayes Model

1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
• These are the model parameters
• Needed as input to the prediction formula
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)

1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
! &* $2&) =
4 15 ∗4 781 15)
4(781)
∝ ! &* ∗ ! $2& &*)
= ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
(Bayes Theorem)
with conditional independence
assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$ )*+,, = +"./+01 ! &* $2&)

1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)

1, 2 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(.
.))
))
Log(P(..)) Log(P(..))
Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(.
.))
))
Log(P(..)) Log(P(..))
Food Log(P(..))
Numerical limitation
?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log (! ;$= &*)) + … + log(! ;$> &*))
• Way around: take log -> leads to summation instead of multiplication
• No impact on comparisons across classes
! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:

1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Getting cell probabilities
! ;$> &*) =
DEF GH
∑ DEFKLEMN
Dealing with P(wd|cl)=0
which makes P(cl|doc)=0
regardless of other words
!(&*) =
DEF
D
≈
DEF GH P<
∑ (DEFP<)KLEMN
=
DEF GH P<
∑ (DEF)PQ81RSKLEMN

1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
2 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 Auto
4 Food
15labels

1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 Food
15labels

1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
2 + 1
2 + 8
0 + 1
2 + 8
Food 1
5
15labels

class LabeledDataFilter():
...
class Featurizer():
...
class Trainer()
...
class Evaluator()
...
class Predictor()
...
class verticalPredictor():
use Featurizer()
use Predictor()
...
product_data
Training job (every 7 days) Prediction job (every day)
model
accuracy
product_data
shop+industry
model
Code

Change in Training Set
• Start of home card
• Allowed asking for Industry in
a voluntary way
• Quickly grew to 50K shops
• Advantage: growing over time
• Issue: training set is not fully
random

Shop Name
Shop URL
Shop Address
Shop City
…
Shop Predicted Industry
…
Shop Dimension
In the Data Warehouse
Updated daily

Results
Shops top
category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%

Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%

Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%

Business Use
Management or product teams:
• What are the biggest industries per shop count, per sales made?
• How does that evolve over time ?
Theme team:
• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:
• We want to be part of an event in the music business, can we get
interesting shops in this field ?

Could be improved
●More metrics: Add multiclass precision/recall
○Now available in mllib
●Better performances: Rerun for combination
of parameters
○Also added recently to mllib but missing some
components

Shop vertical classification - Meetup Presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Shop vertical classification - Meetup Presentation