Play with Kaggle

•

1 gefällt mir•2,685 views

Matthieu Scordia

Slides de ma présentation du 11 décembre au Meetup de Machine Learning.

Technologie Design

Kaggle?

Kaggle is a platform for predictive modeling competitions.

“We're making data science into a sport.”

The Data

Noteworthy characteristics of the dataset:
●

Unique queries:

21,073,569

●

Unique urls:

703,484,26

●

Unique users:

5,736,333

●

Training sessions:

34,573,630

●

Test sessions:

797,867

●

Clicks in the training data:

64,693,054

Total records in the log: 167,413,039 (=15Go!)

Logs format.

Session 0

Session 1
Session 2
Session 3

Session 4

Sessions format.

session 5
day 15
user 1

Term IDs

(Url,Domain) ranked

Session metadata

Url clicked
time passed

Evaluation.

The URLs are labeled using 3 grades of relevance: {0, 1, 2}
The labeling is done automatically, based on dwell-time.

0 : irrelevant - no clicks and clicks with dwell time < 50 time units.
1 : relevant - clicks with dwell time > 50 and < 399 time units.
2 : highly relevant - clicks with dwell time > 400 time units.

How to beat Yandex?

Clicks
count

Rank

So we have to sort better than that!

Step 1 : Reshape it!

For each user we would estimate his click probability on an url

Step 2 : Cross validate

Split your dataset like Yandex did:
- On the last 3 days.
- Only one session by user.

Goal: auto-evaluate our model.

Step 3 : Add new features

We add some informations on each user:
- Did he see this url in the past?
- Did he click on it?
- How many times?
- Did he skip it?
- Had he ever click on a rank 9 url in the past?

Step 3 : Add new features

The thing is, we don't want to re-rank all...
So we add click entropy:

For each query:

Where p(x) is the percentage of clicks on document x among all clicks.

Example:
Small click entropy query: yahoo, youtube.
Large click entropy query: photos, jobs.

Step 4 : the model.

Goal: Predict the probability of click of an user on an url.
Our training set:
session

url

features...

We use logistic regression and random forest.

target

Thanks !

If you want to enter with us in a future challenge:

contact@dataiku.com

Empfohlen

Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou

Kaggle: Crowd Sourcing for Data AnalyticsJeffrey Funk Business Models

Winning Kaggle 101: Introduction to StackingTed Xiao

Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang

Winning data science competitions, presented by Owen ZhangVivian S. Zhang

Tips for data science competitionsOwen Zhang

Florian Douetteau @ DataikuPAPIs.io

Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku

Empfohlen

Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou

Kaggle: Crowd Sourcing for Data AnalyticsJeffrey Funk Business Models

Winning Kaggle 101: Introduction to StackingTed Xiao

Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang

Winning data science competitions, presented by Owen ZhangVivian S. Zhang

Tips for data science competitionsOwen Zhang

Florian Douetteau @ DataikuPAPIs.io

Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku

Welcome Webinar SlidesSumo Logic

SEMNE Google Analytics Master Class - 15 Oct 2014Jay Murphy

Recsys2016 Tutorial by Xavier and DeepakDeepak Agarwal

Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah

Google Analytics Newport Interactive Marketers

Iterative Methodology for Personalization Models OptimizationSonya Liberman

moma-django overview --> Django + MongoDB: building a custom ORM layerGadi Oren

Supercharging your Organic CTRPhil Pearce

Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group

kdd2015Deepak Agarwal

Search Engine OptimizationSD Sharma

How can a data layer help my seoPhil Pearce

Mozilla Foundation Metrics - presentation to engineersJohn Schneider

Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Lippo Group Digital

A new algorithm for inferring user search goals with feedback sessionsJPINFOTECH JAYAPRAKASH

Recommender Systems @ Scale - PyData 2019Sonya Liberman

A new algorithm for inferring user search goals with feedback sessionsJPINFOTECH JAYAPRAKASH

Find and be Found: Information Retrieval at LinkedInDaniel Tunkelang

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit

Intro to Google Analytics and Google AdWords (March 19 2013)Chester County Marketing Group

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Weitere ähnliche Inhalte

Ähnlich wie Play with Kaggle

Welcome Webinar SlidesSumo Logic

SEMNE Google Analytics Master Class - 15 Oct 2014Jay Murphy

Recsys2016 Tutorial by Xavier and DeepakDeepak Agarwal

Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah

Google Analytics Newport Interactive Marketers

Iterative Methodology for Personalization Models OptimizationSonya Liberman

moma-django overview --> Django + MongoDB: building a custom ORM layerGadi Oren

Supercharging your Organic CTRPhil Pearce

Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group

kdd2015Deepak Agarwal

Search Engine OptimizationSD Sharma

How can a data layer help my seoPhil Pearce

Mozilla Foundation Metrics - presentation to engineersJohn Schneider

Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Lippo Group Digital

A new algorithm for inferring user search goals with feedback sessionsJPINFOTECH JAYAPRAKASH

Recommender Systems @ Scale - PyData 2019Sonya Liberman

A new algorithm for inferring user search goals with feedback sessionsJPINFOTECH JAYAPRAKASH

Find and be Found: Information Retrieval at LinkedInDaniel Tunkelang

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit

Intro to Google Analytics and Google AdWords (March 19 2013)Chester County Marketing Group

Ähnlich wie Play with Kaggle (20)

Welcome Webinar Slides

SEMNE Google Analytics Master Class - 15 Oct 2014

Recsys2016 Tutorial by Xavier and Deepak

Cloudera Movies Data Science Project On Big Data

Google Analytics

Iterative Methodology for Personalization Models Optimization

moma-django overview --> Django + MongoDB: building a custom ORM layer

Supercharging your Organic CTR

Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data

kdd2015

Search Engine Optimization

How can a data layer help my seo

Mozilla Foundation Metrics - presentation to engineers

Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...

A new algorithm for inferring user search goals with feedback sessions

Recommender Systems @ Scale - PyData 2019

A new algorithm for inferring user search goals with feedback sessions

Find and be Found: Information Retrieval at LinkedIn

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...

Intro to Google Analytics and Google AdWords (March 19 2013)

Kürzlich hochgeladen

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

CloudStudio User manual (basic edition):comworks

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Training state-of-the-art general text embeddingZilliz

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

WordPress Websites for Engineers: Elevate Your Brandgvaughan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Kürzlich hochgeladen (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

The Future of Software Development - Devin AI Innovative Approach.pdf

CloudStudio User manual (basic edition):

DevEX - reference for building teams, processes, and platforms

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Dev Dives: Streamline document processing with UiPath Studio Web

Training state-of-the-art general text embedding

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Vector Databases 101 - An introduction to the world of Vector Databases

Powerpoint exploring the locations used in television show Time Clash

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

WordPress Websites for Engineers: Elevate Your Brand

SIP trunking in Janus @ Kamailio World 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

What's New in Teams Calling, Meetings and Devices March 2024

DevoxxFR 2024 Reproducible Builds with Apache Maven

My Hashitalk Indonesia April 2024 Presentation

Play with Kaggle

1. ith y w Pla Matthieu Scordia

2. Kaggle? Kaggle is a platform for predictive modeling competitions. “We're making data science into a sport.”

3. Let's enter a challenge!

4. The Data Noteworthy characteristics of the dataset: ● Unique queries: 21,073,569 ● Unique urls: 703,484,26 ● Unique users: 5,736,333 ● Training sessions: 34,573,630 ● Test sessions: 797,867 ● Clicks in the training data: 64,693,054 Total records in the log: 167,413,039 (=15Go!)

5. Let's shake the data

6. Logs format. Session 0 Session 1 Session 2 Session 3 Session 4

7. Sessions format. session 5 day 15 user 1 Term IDs (Url,Domain) ranked Session metadata Url clicked time passed

8. Evaluation. The URLs are labeled using 3 grades of relevance: {0, 1, 2} The labeling is done automatically, based on dwell-time. 0 : irrelevant - no clicks and clicks with dwell time < 50 time units. 1 : relevant - clicks with dwell time > 50 and < 399 time units. 2 : highly relevant - clicks with dwell time > 400 time units.

9. How to beat Yandex? Clicks count Rank So we have to sort better than that!

10. Step 1 : Reshape it! For each user we would estimate his click probability on an url

11. Step 2 : Cross validate Split your dataset like Yandex did: - On the last 3 days. - Only one session by user. Goal: auto-evaluate our model.

12. Step 3 : Add new features We add some informations on each user: - Did he see this url in the past? - Did he click on it? - How many times? - Did he skip it? - Had he ever click on a rank 9 url in the past?

13. Step 3 : Add new features The thing is, we don't want to re-rank all... So we add click entropy: For each query: Where p(x) is the percentage of clicks on document x among all clicks. Example: Small click entropy query: yahoo, youtube. Large click entropy query: photos, jobs.

14. Step 4 : the model. Goal: Predict the probability of click of an user on an url. Our training set: session url features... We use logistic regression and random forest. target

15. The leaderboard.

16. Thanks ! If you want to enter with us in a future challenge: contact@dataiku.com