SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Shop Vertical
Classification
@
Arthur Prévot
Meetup Machine Learning – Toronto – March 1st 2016
Background
• Large ecommerce platform
• 240K+ current customers
• Many more shops created (churned or
didn’t make it to customer status)
Problem
● No information about their industry in most cases
1st solution
● ask them
2nd solution
● We have html product descriptions for each shop
● We have labelled data (mechanical turk)
Classifier
Context
• Started during a Shopify Hack Day
• Pursued as a side project at work
• Used sk-learn and
• Moved to Spark MLlib for full scale testing
and production
• Now in production
Product Description
Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores
• Having to choose among 15 verticals
• Involved hundreds of turkers
80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input
80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed
• Stop word removed
• Words stemmed
Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency
80Kshops
Joining
mech
turk
Model
• Few quick tests using sklearn and settled
on Naïve Bayes
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
15labels
Naïve Bayes Model
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
• These are the model parameters
• Needed as input to the prediction formula
!"#$%&'#$	)*+,, = +"./+01	! &*	 	$2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
! &*	 	$2&) =	
4 15 ∗4 781	 	15)
4(781)
∝ ! &* ∗ ! $2&	 	&*)
= ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*)
(Bayes Theorem)
with conditional independence
assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$	)*+,, = +"./+01	! &*	 	$2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &*	 	$2&) ∝ ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Food Log(P(..))
Numerical limitation
?2. ! &*	 	$2&) ∝ log ! &* + log( ! ;$<	 	&*)) + log	(! ;$=	 	&*)) + … + log(! ;$>	 	&*))
• Way around: take log -> leads to summation instead of multiplication
• No impact on comparisons across classes
! &*	 	$2&) ∝ ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*) From before, so:
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Getting cell probabilities
! ;$>	 	&*) =	
DEF	GH
∑ DEFKLEMN
Dealing with P(wd|cl)=0
which makes P(cl|doc)=0
regardless of other words
!(&*) =	
DEF
D
≈	
DEF	GH	P<
∑ (DEFP<)KLEMN
=	
DEF	GH	P<
∑ (DEF)PQ81RSKLEMN
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
2 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 Auto
4 Food
15labels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 Food
15labels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
2 + 1
2 + 8
0 + 1
2 + 8
Food 1
5
15labels
class LabeledDataFilter():
...
class Featurizer():
...
class Trainer()
...
class Evaluator()
...
class Predictor()
...
class verticalPredictor():
use Featurizer()
use Predictor()
...
product_data
Training job (every 7 days) Prediction job (every day)
model
accuracy
product_data
shop+industry
model
Code
Change in Training Set
• Start of home card
• Allowed asking for Industry in
a voluntary way
• Quickly grew to 50K shops
• Advantage: growing over time
• Issue: training set is not fully
random
Shop Name
Shop URL
Shop Address
Shop City
…
Shop Predicted Industry
…
Shop Dimension
In the Data Warehouse
Updated daily
Results
Shops top
category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%
Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%
Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%
Business Use
Management or product teams:
• What are the biggest industries per shop count, per sales made?
• How does that evolve over time ?
Theme team:
• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:
• We want to be part of an event in the music business, can we get
interesting shops in this field ?
Could be improved
●More metrics: Add multiclass precision/recall
○Now available in mllib
●Better performances: Rerun for combination
of parameters
○Also added recently to mllib but missing some
components
DEMO
THE END

Weitere ähnliche Inhalte

Andere mochten auch

3Com 3C13635-US
3Com 3C13635-US3Com 3C13635-US
3Com 3C13635-USsavomir
 
Mini clase sobre el acoso escolar
Mini clase sobre el acoso escolarMini clase sobre el acoso escolar
Mini clase sobre el acoso escolarNicolle Sanchez
 
LiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Inc
 
Učící se společnost 3
Učící se společnost 3Učící se společnost 3
Učící se společnost 3Michal Černý
 
MonoGame extensions & engines
MonoGame extensions & enginesMonoGame extensions & engines
MonoGame extensions & enginesSimon Jackson
 
Дизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsДизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsKate
 
3Com 7030-10021
3Com 7030-100213Com 7030-10021
3Com 7030-10021savomir
 
Ventajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshareVentajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshareemerson arismendi
 
Divagas flutuações
Divagas flutuaçõesDivagas flutuações
Divagas flutuaçõesJose Maia
 
Comunicación escrita
Comunicación escritaComunicación escrita
Comunicación escritaCeleste09nov
 
Bases de datos de libre acceso
Bases de datos de libre accesoBases de datos de libre acceso
Bases de datos de libre accesoAnibal Torres
 
Učící se společnost 2
Učící se společnost 2Učící se společnost 2
Učící se společnost 2Michal Černý
 
Učící se společnost 1
Učící se společnost 1Učící se společnost 1
Učící se společnost 1Michal Černý
 
Histeria 1 (madame bovary)
Histeria 1 (madame bovary)Histeria 1 (madame bovary)
Histeria 1 (madame bovary)larissanasantos
 
Bloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiadosBloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiadoscastignanimauro
 
Argumentos a favor de la existencia de dios
Argumentos a favor de la existencia de diosArgumentos a favor de la existencia de dios
Argumentos a favor de la existencia de diosAriMaya900
 

Andere mochten auch (19)

3Com 3C13635-US
3Com 3C13635-US3Com 3C13635-US
3Com 3C13635-US
 
Mini clase sobre el acoso escolar
Mini clase sobre el acoso escolarMini clase sobre el acoso escolar
Mini clase sobre el acoso escolar
 
LiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) Presentation
 
Učící se společnost 3
Učící se společnost 3Učící se společnost 3
Učící se společnost 3
 
MonoGame extensions & engines
MonoGame extensions & enginesMonoGame extensions & engines
MonoGame extensions & engines
 
Дизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsДизайн презентаций для Epic Skills
Дизайн презентаций для Epic Skills
 
3Com 7030-10021
3Com 7030-100213Com 7030-10021
3Com 7030-10021
 
8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer
 
Ventajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshareVentajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshare
 
Divagas flutuações
Divagas flutuaçõesDivagas flutuações
Divagas flutuações
 
La comercializadora de productos
La comercializadora de productosLa comercializadora de productos
La comercializadora de productos
 
Comunicación escrita
Comunicación escritaComunicación escrita
Comunicación escrita
 
Bases de datos de libre acceso
Bases de datos de libre accesoBases de datos de libre acceso
Bases de datos de libre acceso
 
Učící se společnost 2
Učící se společnost 2Učící se společnost 2
Učící se společnost 2
 
Učící se společnost 1
Učící se společnost 1Učící se společnost 1
Učící se společnost 1
 
Histeria 1 (madame bovary)
Histeria 1 (madame bovary)Histeria 1 (madame bovary)
Histeria 1 (madame bovary)
 
Bloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiadosBloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiados
 
Argumentos a favor de la existencia de dios
Argumentos a favor de la existencia de diosArgumentos a favor de la existencia de dios
Argumentos a favor de la existencia de dios
 
Tabelas hash
Tabelas hashTabelas hash
Tabelas hash
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Kürzlich hochgeladen (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

Shop vertical classification - Meetup Presentation

  • 1. Shop Vertical Classification @ Arthur Prévot Meetup Machine Learning – Toronto – March 1st 2016
  • 2. Background • Large ecommerce platform • 240K+ current customers • Many more shops created (churned or didn’t make it to customer status)
  • 3.
  • 4. Problem ● No information about their industry in most cases 1st solution ● ask them 2nd solution ● We have html product descriptions for each shop ● We have labelled data (mechanical turk) Classifier
  • 5. Context • Started during a Shopify Hack Day • Pursued as a side project at work • Used sk-learn and • Moved to Spark MLlib for full scale testing and production • Now in production
  • 7. Getting Label Data • Asked Amazon Mechanical Turkers to assess 80K stores • Having to choose among 15 verticals • Involved hundreds of turkers
  • 8. 80K shops Shop Aggregated product data 1 “Nice octopolo shirt !…” 2 “Nice hat and nice shirt …” 3 “Set of <b> tires </b> …” 4 “Beef and more beef…” 5 “Tire set for bikes” ... ... Input
  • 9. 80K shops Shop Text 1 “nice octopolo shirt…” 2 “nice hat and nice shirt…” 3 “set tire…” 4 “beef beef…” 5 “tire set bike” ... ... Cleaning • HTML code removed • Stop word removed • Words stemmed
  • 10. Shops nice octopolo shirt hat set tires beef bike ... label 1 1 1 1 ... Apparel 2 2 1 1 ... Apparel 3 1 1 ... Auto 4 2 … Food 5 1 1 1 … Auto ... ... ... ... … … … … … ... … 10K words (8 in ex) Term Frequency 80Kshops Joining mech turk
  • 11. Model • Few quick tests using sklearn and settled on Naïve Bayes
  • 12. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) 15labels Naïve Bayes Model
  • 13. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apprel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) What and why • These are the model parameters • Needed as input to the prediction formula !"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
  • 14. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) What and why ! &* $2&) = 4 15 ∗4 781 15) 4(781) ∝ ! &* ∗ ! $2& &*) = ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) (Bayes Theorem) with conditional independence assumption, actually violated.. denominator not important to compare likelihoods !"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
  • 15. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) Numerical Limitation • Multiplying many values close to 0 -> float underflow ! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
  • 16. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Apparel Log(P(..)) 3, 5 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Auto Log(P(..)) 4 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Food Log(P(..)) Numerical limitation ?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log (! ;$= &*)) + … + log(! ;$> &*)) • Way around: take log -> leads to summation instead of multiplication • No impact on comparisons across classes ! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:
  • 17. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apprel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) Getting cell probabilities ! ;$> &*) = DEF GH ∑ DEFKLEMN Dealing with P(wd|cl)=0 which makes P(cl|doc)=0 regardless of other words !(&*) = DEF D ≈ DEF GH P< ∑ (DEFP<)KLEMN = DEF GH P< ∑ (DEF)PQ81RSKLEMN
  • 18. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 2 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 Auto 4 Food 15labels
  • 19. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 2 + 1 5 + 8 2 + 1 5 + 8 0 + 1 5 + 8 1 + 1 5 + 8 Auto 2 5 4 Food 15labels
  • 20. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 2 + 1 5 + 8 2 + 1 5 + 8 0 + 1 5 + 8 1 + 1 5 + 8 Auto 2 5 4 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 2 + 1 2 + 8 0 + 1 2 + 8 Food 1 5 15labels
  • 21. class LabeledDataFilter(): ... class Featurizer(): ... class Trainer() ... class Evaluator() ... class Predictor() ... class verticalPredictor(): use Featurizer() use Predictor() ... product_data Training job (every 7 days) Prediction job (every day) model accuracy product_data shop+industry model Code
  • 22. Change in Training Set • Start of home card • Allowed asking for Industry in a voluntary way • Quickly grew to 50K shops • Advantage: growing over time • Issue: training set is not fully random
  • 23. Shop Name Shop URL Shop Address Shop City … Shop Predicted Industry … Shop Dimension In the Data Warehouse Updated daily
  • 24. Results Shops top category turker 1 turker2 turker 3 Chive Apparel Apparel Apparel Art Lackers Sports Sports Apparel Sports Tesla Auto Auto Auto Sports ... ... ... ... 60-80%
  • 25. Results Shops top category turker 1 turker2 turker 3 algo top1 algo top2 algo top3 Chive Apparel Apparel Apparel Art Apparel Sport Art Lackers Sports Sports Apparel Sports Sports Apparel Food Tesla Auto Auto Auto Sports Fashion Auto Electro ... ... ... ... 60-80% ~65%
  • 26. Results Shops top category turker 1 turker2 turker 3 algo top1 algo top2 algo top3 Chive Apparel Apparel Apparel Art Apparel Sport Art Lackers Sports Sports Apparel Sports Sports Apparel Food Tesla Auto Auto Auto Sports unknown Auto Electro ... ... ... ... 90% ~75%
  • 27. Business Use Management or product teams: • What are the biggest industries per shop count, per sales made? • How does that evolve over time ? Theme team: • We want to develop new themes for a given vertical, can we see the top stores in this vertical to understand trends ? Event team: • We want to be part of an event in the music business, can we get interesting shops in this field ?
  • 28. Could be improved ●More metrics: Add multiclass precision/recall ○Now available in mllib ●Better performances: Rerun for combination of parameters ○Also added recently to mllib but missing some components
  • 29. DEMO