SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
RecSys Challenge 2015: ensemble learning with
categorical features
Peter Romov, Evgeny Sokolov
•  Logs from e-commerce website: collection of sessions
•  Session
•  sequence of clicks on the item pages
•  could end with or without purchase
•  Click
•  Timestamp
•  ItemID (≈54k unique IDs)
•  CategoryID (≈350 unique IDs)
•  Purchase
•  set of bought items with price and quantity
•  Known purchases for the train-set, need to predict on the test-set
Problem statement
2
Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(
; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sb
test
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =
|y(s)  h(s)|
|y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:
|h(s)|>0
8
<
:
|Sb
test|
|Stest| + J(y(s), h(s)), if y(s) 6= ;
|Sb
test|
|Stest| , otherwise
Problem statement
4
First observations (from the task):
•  the task is uncommon (set prediction with specific loss function)
•  evaluation measure could be rewritten
•  the original problem can be divided into two well-known
binary classification problems;
1.  predict purchase given session
optimize Purchase score
2.  predict bought items given session with purchase
optimize Jaccard score
P(y(s) 6= ;|s)
Q(h, Stest) =
|Sb
test|
|Stest|
(TP FP)
| {z }
purchase score
+
X
s2Stest
J(y(s), h(s))
| {z }
Jaccard score
,
P(i 2 y(s)|s, y(s) 6= ;)
•  Two-stage prediction
•  Two binary classification models learned on the train-set
•  Both classifiers require thresholds
•  Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn
90%
Svalid
10%
s 7! P(purchase|s)
(s, i) 7! P(i 2 y(s)|s, y(s) 6= ;)
classifier
thresholds
Some relationships from data
6
Next observations (from the data):
•  Buying rate strongly depends on time features
•  Buying rate varies highly between categorical features
he
er-
87
e-
s)
es-
me
s.
he
or
ht
is
)) ,
ne
ty
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
The total number of item IDs and category IDs is 54, 287
and 347 correspondingly. Both training and test sets be-
long to an interval of 6 months. The target function y(s)
corresponds to the set of items that were bought in the ses-
sion s 1
. In other words, the target function gives some
subset of the universal item set I for each user session s.
We are given sets of bought items y(s) for all sessions s the
training set Strain, and are required to predict these sets for
test sessions s ∈ Stest.
2.2 Evaluation Measure
Denote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis is
measured by the following formula:
Q(h, Stest) =
s∈Stest:
|h(s)|>0
|Sb
test|
|Stest|
(−1)isEmpty(y(s))
+J(y(s), h(s)) ,
where Sb
test is the set of all test sessions with at least one
purchase event, J(A, B) = |A∩B|
|A∪B|
is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h, Stest) =
|Sb
test|
|Stest|
(TP − FP)
purchase score
+
s∈Stest
J(y(s), h(s))
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >
0 (i.e. true positives), and FP is the number of sessions
with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it
is easy to see that the score consists of two parts. The first
one gives a reward for each correctly guessed session with
buy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to
|Sb
test|
|Stest|
. The
second part calculates the total similarity of predicted sets
of bought items to the real sets.
2.3 Purchase statistics
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
is that a higher number of items clicked during the session
leads to a higher chance of a purchase.
Lots of information could be extracted from the data by
considering item identifiers and categories clicked during the
Buying rate — fraction of buyer sessions in some subset of sessions
Feature extraction
7
•  Purchase classifier: features from session (sequence of clicks)
•  Bought item classifier: features from pair session+itemID
•  Observation: bought item is a clicked item
•  We use two types of features
•  Numerical: real number, e.g. seconds between two clicks
•  Categorical: element of the unordered set of values (levels),
e.g. ItemID
Feature extraction: session
8
1.  Start/end of the session (month, day, hour, etc.)
[numerical + categorical with few levels]
2.  Number of clicks, unique items, categories, item-category pairs
[numerical]
3.  Top 10 items and categories by the number of clicks
[categorical with ≈50k levels]
4.  ID of the first/last item clicked at least k times
[categorical with ≈50k levels]
5.  Click counts for 100 items and 50 categories that were most
popular in the whole training set
[sparse numerical]
Feature extraction:
session+ItemID
9
1.  All session features
2.  ItemID
[categorical with ≈50k levels]
3.  Timestamp of the first/last click on the item (month, day, hour, etc.)
[numerical + categorical with few levels]
4.  Number of clicks in the session for the given item
[numerical]
5.  Total duration (by analogy with dwell time) of the clicks on the item
[numerical]
•  GBM and similar ensemble learning techniques
•  useful with numerical features
•  one-hot encoding of categorical features doesn’t perform well
•  Matrix decompositions, FM
•  useful with categorical features
•  hard to incorporate numerical features because of rough (bi-linear)
model
•  We used our internal learning algorithm: MatrixNet
•  GBM with oblivious decision trees
•  trees properly handle categorical features (multi-split decision trees)
•  SVD-like decompositions for new feature value combinations
Classification method
10
Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item…
… … ……
•  Training classifiers
•  GB with 10k trees for each classifier
•  ≈12 hours to train both models on 150 machines
•  Making predictions
•  We made 4000 predictions per second per thread
Classification method: speed
12
Threshold optimization
13
We optimized thresholds using validation set (10% hold-out
from train-set)
1)  Maximize Jaccard score
2)  Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid) =
|Sb
valid|
|Svalid|
(TP FP)
| {z }
purchase score
+
X
s2Svalid
J(y(s), h(s))
| {z }
Jaccard score
,
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
we train purchase detection and purchased item detection
classifiers. The purchase detection classifier hp(s) predicts
the outcome of the function yp(s) = isNotEmpty(y(s)) and
uses the entire training set in the learning phase. The item
detection classifier hi(s, j) approximates the indicator func-
tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought
items in the learning phase. Of course, it would be wise to
use classifiers that output probabilities rather than binary
predictions, because in this case we will be able to select
thresholds that directly optimize evaluation metric (1) in-
stead of the classifier’s internal quality measure. So, our
final expression for the hypothesis can be written as
h(s) =
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.
(2)
3.2 Feature Extraction
We have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-
chase detection classifier uses only session features and the
item detection classifier uses both groups. The full feature
listing can be found in Table 1; for further details, please
refer to our code2
. We give some comments on our feature
extraction decisions below.
One could use sophisticated aggregations to extract nu-
merical features that describe items and categories. How-
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
•  Leaderboard: 63102 (1st place)
•  Purchase detection on validation (10% hold-out):
•  16% precision
•  77% recall
•  AUC 0.85
•  Purchased item detection on validation:
•  Jaccard measure 0.765
•  Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge
1.  Observations from the problem statement
›  The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2.  Observations from the data (user click sessions)
›  Features from sessions and (session, ItemID)-pairs
›  Easy to develop many meaningful categorical features
3.  The algorithm
›  Gradient boosting on trees with categorical features
›  No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15

Weitere ähnliche Inhalte

Was ist angesagt?

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
SGP 2023 graduate school - A quick journey into geometry processing
SGP 2023 graduate school - A quick journey into geometry processingSGP 2023 graduate school - A quick journey into geometry processing
SGP 2023 graduate school - A quick journey into geometry processingBruno Levy
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++sabbirantor
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsYves Raimond
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practicetuxette
 
MovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterMovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterSimon Dooms
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenFrosmo
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift ModellingPierre Gutierrez
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksZimin Park
 
Narrative Design, the case of “Horizon Zero Dawn”
Narrative Design, the case of “Horizon Zero Dawn”Narrative Design, the case of “Horizon Zero Dawn”
Narrative Design, the case of “Horizon Zero Dawn”Nelson Zagalo
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basicsBrodmann17
 

Was ist angesagt? (20)

K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
SGP 2023 graduate school - A quick journey into geometry processing
SGP 2023 graduate school - A quick journey into geometry processingSGP 2023 graduate school - A quick journey into geometry processing
SGP 2023 graduate school - A quick journey into geometry processing
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
MovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterMovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitter
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni Turunen
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift Modelling
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Session-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networksSession-based recommendations with recurrent neural networks
Session-based recommendations with recurrent neural networks
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Narrative Design, the case of “Horizon Zero Dawn”
Narrative Design, the case of “Horizon Zero Dawn”Narrative Design, the case of “Horizon Zero Dawn”
Narrative Design, the case of “Horizon Zero Dawn”
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Decision trees
Decision treesDecision trees
Decision trees
 

Andere mochten auch

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)romovpa
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Predictionromovpa
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in PythonJaidev Deshpande
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachYury Leonychev
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 

Andere mochten auch (7)

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Prediction
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in Python
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
 
Xgboost
XgboostXgboost
Xgboost
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Ähnlich wie RecSys Challenge 2015: ensemble learning with categorical features

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_finalShanglin Yang
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Infrastructure Facility
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
Observations
ObservationsObservations
Observationsbutest
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptxrajesshs31r
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxrajesshs31r
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...PAPIs.io
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxmelbruce90096
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2jcvd12
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)Maurício Aniche
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computingjayavignesh86
 

Ähnlich wie RecSys Challenge 2015: ensemble learning with categorical features (20)

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_final
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
Observations
ObservationsObservations
Observations
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptx
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptx
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2
 
Ali upload
Ali uploadAli upload
Ali upload
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computing
 

Mehr von romovpa

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаromovpa
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...romovpa
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭromovpa
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхromovpa
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...romovpa
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаromovpa
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахromovpa
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойromovpa
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVMromovpa
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовromovpa
 

Mehr von romovpa (10)

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнеса
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭ
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данных
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспорта
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системах
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовой
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVM
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графов
 

Kürzlich hochgeladen

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 

Kürzlich hochgeladen (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 

RecSys Challenge 2015: ensemble learning with categorical features

  • 1. RecSys Challenge 2015: ensemble learning with categorical features Peter Romov, Evgeny Sokolov
  • 2. •  Logs from e-commerce website: collection of sessions •  Session •  sequence of clicks on the item pages •  could end with or without purchase •  Click •  Timestamp •  ItemID (≈54k unique IDs) •  CategoryID (≈350 unique IDs) •  Purchase •  set of bought items with price and quantity •  Known purchases for the train-set, need to predict on the test-set Problem statement 2
  • 3. Problem statement 3 Clicks from session s Purchase (actual) Purchase (predicted) c(s) = (c1(s), . . . , cn(s)(s)) h(s) ⇡ y(s) y(s) = ( ; — no purchase {i1, . . . , im(s)} (bought items) — otherwise Evaluation measure: — Jaccard distance between two sets Sb test Stest where — all sessions from test-set — sessions from test-set with purchase J(y(s), h(s)) = |y(s) h(s)| |y(s) [ h(s)| Q(h, Stest) = X s2Stest: |h(s)|>0 8 < : |Sb test| |Stest| + J(y(s), h(s)), if y(s) 6= ; |Sb test| |Stest| , otherwise
  • 4. Problem statement 4 First observations (from the task): •  the task is uncommon (set prediction with specific loss function) •  evaluation measure could be rewritten •  the original problem can be divided into two well-known binary classification problems; 1.  predict purchase given session optimize Purchase score 2.  predict bought items given session with purchase optimize Jaccard score P(y(s) 6= ;|s) Q(h, Stest) = |Sb test| |Stest| (TP FP) | {z } purchase score + X s2Stest J(y(s), h(s)) | {z } Jaccard score , P(i 2 y(s)|s, y(s) 6= ;)
  • 5. •  Two-stage prediction •  Two binary classification models learned on the train-set •  Both classifiers require thresholds •  Set up thresholds to optimize Purchase score and Jaccard score using hold-out subsample of the train-set Solution schema 5 Strain purchase classifier bought item classifier Slearn 90% Svalid 10% s 7! P(purchase|s) (s, i) 7! P(i 2 y(s)|s, y(s) 6= ;) classifier thresholds
  • 6. Some relationships from data 6 Next observations (from the data): •  Buying rate strongly depends on time features •  Buying rate varies highly between categorical features he er- 87 e- s) es- me s. he or ht is )) , ne ty Figure 1: Dynamics of the buying rate in time Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) The total number of item IDs and category IDs is 54, 287 and 347 correspondingly. Both training and test sets be- long to an interval of 6 months. The target function y(s) corresponds to the set of items that were bought in the ses- sion s 1 . In other words, the target function gives some subset of the universal item set I for each user session s. We are given sets of bought items y(s) for all sessions s the training set Strain, and are required to predict these sets for test sessions s ∈ Stest. 2.2 Evaluation Measure Denote by h(s) a hypothesis that predicts a set of bought items for any user session s. The score of this hypothesis is measured by the following formula: Q(h, Stest) = s∈Stest: |h(s)|>0 |Sb test| |Stest| (−1)isEmpty(y(s)) +J(y(s), h(s)) , where Sb test is the set of all test sessions with at least one purchase event, J(A, B) = |A∩B| |A∪B| is the Jaccard similarity measure. It is easy to rewrite this expression as Q(h, Stest) = |Sb test| |Stest| (TP − FP) purchase score + s∈Stest J(y(s), h(s)) Jaccard score , (1) where TP is the number of sessions with |y(s)| > 0 and |h(s)| > 0 (i.e. true positives), and FP is the number of sessions with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it is easy to see that the score consists of two parts. The first one gives a reward for each correctly guessed session with buy events and a penalty for each false alarm; the absolute values of penalty and reward are both equal to |Sb test| |Stest| . The second part calculates the total similarity of predicted sets of bought items to the real sets. 2.3 Purchase statistics Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) is that a higher number of items clicked during the session leads to a higher chance of a purchase. Lots of information could be extracted from the data by considering item identifiers and categories clicked during the Buying rate — fraction of buyer sessions in some subset of sessions
  • 7. Feature extraction 7 •  Purchase classifier: features from session (sequence of clicks) •  Bought item classifier: features from pair session+itemID •  Observation: bought item is a clicked item •  We use two types of features •  Numerical: real number, e.g. seconds between two clicks •  Categorical: element of the unordered set of values (levels), e.g. ItemID
  • 8. Feature extraction: session 8 1.  Start/end of the session (month, day, hour, etc.) [numerical + categorical with few levels] 2.  Number of clicks, unique items, categories, item-category pairs [numerical] 3.  Top 10 items and categories by the number of clicks [categorical with ≈50k levels] 4.  ID of the first/last item clicked at least k times [categorical with ≈50k levels] 5.  Click counts for 100 items and 50 categories that were most popular in the whole training set [sparse numerical]
  • 9. Feature extraction: session+ItemID 9 1.  All session features 2.  ItemID [categorical with ≈50k levels] 3.  Timestamp of the first/last click on the item (month, day, hour, etc.) [numerical + categorical with few levels] 4.  Number of clicks in the session for the given item [numerical] 5.  Total duration (by analogy with dwell time) of the clicks on the item [numerical]
  • 10. •  GBM and similar ensemble learning techniques •  useful with numerical features •  one-hot encoding of categorical features doesn’t perform well •  Matrix decompositions, FM •  useful with categorical features •  hard to incorporate numerical features because of rough (bi-linear) model •  We used our internal learning algorithm: MatrixNet •  GBM with oblivious decision trees •  trees properly handle categorical features (multi-split decision trees) •  SVD-like decompositions for new feature value combinations Classification method 10
  • 11. Classification method 11 Oblivious decision tree with categorical features duration > 20 user item item [numerical] [categorical] [categorical] yes no … user item item… … … ……
  • 12. •  Training classifiers •  GB with 10k trees for each classifier •  ≈12 hours to train both models on 150 machines •  Making predictions •  We made 4000 predictions per second per thread Classification method: speed 12
  • 13. Threshold optimization 13 We optimized thresholds using validation set (10% hold-out from train-set) 1)  Maximize Jaccard score 2)  Maximize Purchase+Jaccard scores using fixed bought item threshold Q(h, Svalid) = |Sb valid| |Svalid| (TP FP) | {z } purchase score + X s2Svalid J(y(s), h(s)) | {z } Jaccard score , Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set. we train purchase detection and purchased item detection classifiers. The purchase detection classifier hp(s) predicts the outcome of the function yp(s) = isNotEmpty(y(s)) and uses the entire training set in the learning phase. The item detection classifier hi(s, j) approximates the indicator func- tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought items in the learning phase. Of course, it would be wise to use classifiers that output probabilities rather than binary predictions, because in this case we will be able to select thresholds that directly optimize evaluation metric (1) in- stead of the classifier’s internal quality measure. So, our final expression for the hypothesis can be written as h(s) = ∅ if hp(s) < αp, {j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp. (2) 3.2 Feature Extraction We have outlined two groups of features: one describes a session and the other describes a session-item pair. The pur- chase detection classifier uses only session features and the item detection classifier uses both groups. The full feature listing can be found in Table 1; for further details, please refer to our code2 . We give some comments on our feature extraction decisions below. One could use sophisticated aggregations to extract nu- merical features that describe items and categories. How- Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set.
  • 14. •  Leaderboard: 63102 (1st place) •  Purchase detection on validation (10% hold-out): •  16% precision •  77% recall •  AUC 0.85 •  Purchased item detection on validation: •  Jaccard measure 0.765 •  Features / datasets used to learn classifiers / evaluation process can be reproduced, see our code1 Final results 14 1https://github.com/romovpa/ydf-recsys2015-challenge
  • 15. 1.  Observations from the problem statement ›  The task is complex but decomposable into two well-known: binary classification of sessions and (session, ItemID)-pairs 2.  Observations from the data (user click sessions) ›  Features from sessions and (session, ItemID)-pairs ›  Easy to develop many meaningful categorical features 3.  The algorithm ›  Gradient boosting on trees with categorical features ›  No sophisticated mixtures of Machine Learning techniques: one algorithm to work with many numerical and categorical features Summary / Questions? 15