SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Alexey Grigorev
Team ololobhi (Abhishek & ololo)
Data set
● ~3 mln train pairs, ~1 mln test pairs
● ~10.8 mln images (~45 gb) Target
Evaluation
metric: AUC
Title
Category_ID
Pictures
Price
Description
locationID
attrsJSON
No seller
data
Process Overview
mongo
- text comparisons
- image comparisons
- text cleaning
- image raw features
CSVs with pair features
(than train the models on it)
ItemInfo train/test
images
Preprocessing Computing features
features computed
in small batches
(4k-20k rows in batch)
Features 1
● Simple Features
○ CategoryID (plain, no OHE)
○ Number of images
○ Absolute price difference
● Simple Text Features
○ Num of Rus/Eng/Digits chars
○ Length of Title, Description
○ 2-4 ngram similarity on char level
○ Fuzzy string matches (via FuzzyWuzzy)
● Simple Picture Features
○ Channel statistics (min, mean, max, etc)
○ File size differences
○ Geometry matches
○ Num of exact matches via md5 hash
● Simple GEO Features
○ MetroID
○ LocationID
○ Euclidean distance
|10 - 1| |5 - 1| |15 - 1|
|10 - 50| |5 - 50| |15 - 50|
9 4 14 40 45 35
reshape
stats
1
50
10 5 15
Features 2: Attributes
● Regularized Jaccard of values and key=value pairs
● Number of fields both ads didn’t fill
● TF-IDF on key=value pairs
○ dot product in TF-IDF space (norm=None) was better than cosine
● Cosine in SVD of TF-IDF
Features 3: Text
● Jaccard & Cosine on digits only and on English tokens only
● Russian chars in English words
○ E.g. “о” in “iphоne” is Cyrillic
● Cosine in TF, TF-IDF, BM25, cosine of SVD of them
● Common tokens & differences:
○ Text1: “продам iphone”, Text2: “продам айфон”
○ Common: {продам}, Difference: {iphone, айфон}
○ Cosine in TF (binary), SVD of it
● Word2Vec & GloVe
○ Cosine and manhattan b/w average title vectors
○ Stats of pairwise cosines between all tokens excluding the same ones
○ Tokens from title, description, title + description, nouns only
Features 3 cont’d: Word’s Mover Distance
● “True” WMD is complex and slow
● “Poor Man’s” WMD is faster:
○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them
○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A)
figure from https://github.com/mkusner/wmd
Features 3 cont’d: Misspellings
● Idea: same author can make same types of mistakes
○ No space after dot/comma (“продам айфон.дешево”)
○ Morphological errors
○ And others
● Represent ads as “Bag of Misspellings”
● Use Regularized Jaccard and Cosine
● Misspellings extracted with languagetool.org
● Stuff that everybody used
○ Image hashes from imagehash library and forums
○ Chi2 & Bhattacharyya on histograms (with openIMAJ)
○ SIFT keypoints + matching (with openIMAJ)
○ Structural Similarity (computed with pyssim)
● Perceptive hashes computed with imagemagick
○ hashes computed on each channel separately and on the mean channel
● Image moments: Centroids (“Ellipses” in imagemagick)
○ Centroids = centers of masses of each channel
○ Distances between image centroids in each channel
● Image moment invariants (imagemagick)
○ 7 moments, invariant to translation, scale and rotation
○ Put all 7 invariants in a vector, compute cosine and distance
Features 4: Images
Features 5: GEO
● Reverse (lat, lon) code to location
● Features like same_region, same_city, same_zip
● |zip1 - zip2|
(53.87, 27.66) (region, city, zip)
OpenStreetMap
Feature Selection
● Correlation
○ A lot of features. Many correlated ones
○ Find feature groups of 0.90 correlation
○ Keep only one of the features
● XGBoost Feature Importance
○ https://github.com/Far0n/xgbfi
○ Run xgb on a sample with 100 trees
○ Use xgbfi to extract most important features
● Combined
○ In a correlated group, choose the most important feature using xgbfi output
Most Important Features
SVM fit on common & diff tokens
Models & Ensembling
● Parameter tuning
○ Random search
● My best model: 0.939 public LB
○ XGB with depth=8 and 2.5k trees
○ Trained a few days
● Ensembling:
○ Sample group of features
○ Randomly choose the parameters
○ Build ETs and XGBs
○ Stack with Log Reg (L2 regularization with low C)
● Our final model:
○ Neural network on ETs and XGBs outputs + some selected 1st level features
Lessons Learned
● It’s important to get CV right
○ My scheme: shuffle 3 fold (leaky)
○ Couldn’t use some nice features because of it
○ CV score of the ensemble was too good
○ Result: CV via LB
○ The right one: by connected components
● A lot of features is not always good
○ Computed too many features
○ Had hard time managing to use them all
○ Had to start stacking early
That’s a graph!
https://github.com/alexeygrigorev/avito-duplicates-kaggle
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion Summit
TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion SummitTCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion Summit
TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion SummitRoland Frasier
 
The SaaS business model
The SaaS business modelThe SaaS business model
The SaaS business modelDavid Skok
 
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Association for Computational Linguistics
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Startup Co.
 
The Biggest Artificial Intelligence Milestones Of The Decade So Far
The Biggest Artificial Intelligence Milestones Of The Decade So FarThe Biggest Artificial Intelligence Milestones Of The Decade So Far
The Biggest Artificial Intelligence Milestones Of The Decade So FarBernard Marr
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Perficient, Inc.
 
Psychologische Sicherheit für Teams
Psychologische Sicherheit für TeamsPsychologische Sicherheit für Teams
Psychologische Sicherheit für TeamsScreamin Wrba
 
The Lean Startup - Einführung
The Lean Startup - EinführungThe Lean Startup - Einführung
The Lean Startup - EinführungDr. Judith Grummer
 
Business analytics awareness presentation
Business analytics  awareness presentationBusiness analytics  awareness presentation
Business analytics awareness presentationRamakrishna BE PGDM
 
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfSome Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfKent Bye
 
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)Neil Turner
 
The Near Future: AI in 2024
The Near Future: AI in 2024The Near Future: AI in 2024
The Near Future: AI in 2024JosiahSeaman1
 
Better decision making with proper business intelligence
Better decision making with proper business intelligenceBetter decision making with proper business intelligence
Better decision making with proper business intelligencemadhavlankapati
 

Was ist angesagt? (19)

TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion Summit
TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion SummitTCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion Summit
TCS: MORE Freakishly Effective Marketing Hacks From Traffic & Conversion Summit
 
The SaaS business model
The SaaS business modelThe SaaS business model
The SaaS business model
 
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business Faster
 
The Biggest Artificial Intelligence Milestones Of The Decade So Far
The Biggest Artificial Intelligence Milestones Of The Decade So FarThe Biggest Artificial Intelligence Milestones Of The Decade So Far
The Biggest Artificial Intelligence Milestones Of The Decade So Far
 
UTILITY OF AI
UTILITY OF AIUTILITY OF AI
UTILITY OF AI
 
1_Introduction
1_Introduction1_Introduction
1_Introduction
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
 
Psychologische Sicherheit für Teams
Psychologische Sicherheit für TeamsPsychologische Sicherheit für Teams
Psychologische Sicherheit für Teams
 
The Lean Startup - Einführung
The Lean Startup - EinführungThe Lean Startup - Einführung
The Lean Startup - Einführung
 
Business analytics awareness presentation
Business analytics  awareness presentationBusiness analytics  awareness presentation
Business analytics awareness presentation
 
The Art of Growth Hacking by Neil Patel
The Art of Growth Hacking by Neil PatelThe Art of Growth Hacking by Neil Patel
The Art of Growth Hacking by Neil Patel
 
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfSome Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
 
The New Moats
The New MoatsThe New Moats
The New Moats
 
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)
Using jobs-to-be-done to design better user experiences (UX Cambridge 2017)
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Apply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business ApplicationApply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business Application
 
The Near Future: AI in 2024
The Near Future: AI in 2024The Near Future: AI in 2024
The Near Future: AI in 2024
 
Better decision making with proper business intelligence
Better decision making with proper business intelligenceBetter decision making with proper business intelligence
Better decision making with proper business intelligence
 

Andere mochten auch

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click PredictionAlexey Grigorev
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionAlexey Grigorev
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForOutspoken Media
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingAlexey Grigorev
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016MLconf
 

Andere mochten auch (7)

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism Detection
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 

Ähnlich wie Avito Duplicate Ads Detection @ kaggle

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxUniversity of York
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015Jorg Janke
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTatu Saloranta
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdfdsfajkh
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts SummaryMohamed Magdy
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on KaggleNasuka Sumino
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicInstitute of Contemporary Sciences
 
Why go ?
Why go ?Why go ?
Why go ?Mailjet
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOMatteo Grella
 

Ähnlich wie Avito Duplicate Ads Detection @ kaggle (20)

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntax
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processor
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and Golang
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
 
Golang
GolangGolang
Golang
 
Why go ?
Why go ?Why go ?
Why go ?
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Druid
DruidDruid
Druid
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 

Mehr von Alexey Grigorev

Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour KaressliAlexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAlexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesAlexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data ScienceAlexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningAlexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningAlexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentAlexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationAlexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationAlexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursAlexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAlexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectAlexey Grigorev
 

Mehr von Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Kürzlich hochgeladen

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 

Kürzlich hochgeladen (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Avito Duplicate Ads Detection @ kaggle

  • 1. Alexey Grigorev Team ololobhi (Abhishek & ololo)
  • 2. Data set ● ~3 mln train pairs, ~1 mln test pairs ● ~10.8 mln images (~45 gb) Target Evaluation metric: AUC
  • 4. Process Overview mongo - text comparisons - image comparisons - text cleaning - image raw features CSVs with pair features (than train the models on it) ItemInfo train/test images Preprocessing Computing features features computed in small batches (4k-20k rows in batch)
  • 5. Features 1 ● Simple Features ○ CategoryID (plain, no OHE) ○ Number of images ○ Absolute price difference ● Simple Text Features ○ Num of Rus/Eng/Digits chars ○ Length of Title, Description ○ 2-4 ngram similarity on char level ○ Fuzzy string matches (via FuzzyWuzzy) ● Simple Picture Features ○ Channel statistics (min, mean, max, etc) ○ File size differences ○ Geometry matches ○ Num of exact matches via md5 hash ● Simple GEO Features ○ MetroID ○ LocationID ○ Euclidean distance
  • 6. |10 - 1| |5 - 1| |15 - 1| |10 - 50| |5 - 50| |15 - 50| 9 4 14 40 45 35 reshape stats 1 50 10 5 15
  • 7. Features 2: Attributes ● Regularized Jaccard of values and key=value pairs ● Number of fields both ads didn’t fill ● TF-IDF on key=value pairs ○ dot product in TF-IDF space (norm=None) was better than cosine ● Cosine in SVD of TF-IDF
  • 8. Features 3: Text ● Jaccard & Cosine on digits only and on English tokens only ● Russian chars in English words ○ E.g. “о” in “iphоne” is Cyrillic ● Cosine in TF, TF-IDF, BM25, cosine of SVD of them ● Common tokens & differences: ○ Text1: “продам iphone”, Text2: “продам айфон” ○ Common: {продам}, Difference: {iphone, айфон} ○ Cosine in TF (binary), SVD of it ● Word2Vec & GloVe ○ Cosine and manhattan b/w average title vectors ○ Stats of pairwise cosines between all tokens excluding the same ones ○ Tokens from title, description, title + description, nouns only
  • 9. Features 3 cont’d: Word’s Mover Distance ● “True” WMD is complex and slow ● “Poor Man’s” WMD is faster: ○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them ○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A) figure from https://github.com/mkusner/wmd
  • 10. Features 3 cont’d: Misspellings ● Idea: same author can make same types of mistakes ○ No space after dot/comma (“продам айфон.дешево”) ○ Morphological errors ○ And others ● Represent ads as “Bag of Misspellings” ● Use Regularized Jaccard and Cosine ● Misspellings extracted with languagetool.org
  • 11. ● Stuff that everybody used ○ Image hashes from imagehash library and forums ○ Chi2 & Bhattacharyya on histograms (with openIMAJ) ○ SIFT keypoints + matching (with openIMAJ) ○ Structural Similarity (computed with pyssim) ● Perceptive hashes computed with imagemagick ○ hashes computed on each channel separately and on the mean channel ● Image moments: Centroids (“Ellipses” in imagemagick) ○ Centroids = centers of masses of each channel ○ Distances between image centroids in each channel ● Image moment invariants (imagemagick) ○ 7 moments, invariant to translation, scale and rotation ○ Put all 7 invariants in a vector, compute cosine and distance Features 4: Images
  • 12. Features 5: GEO ● Reverse (lat, lon) code to location ● Features like same_region, same_city, same_zip ● |zip1 - zip2| (53.87, 27.66) (region, city, zip) OpenStreetMap
  • 13. Feature Selection ● Correlation ○ A lot of features. Many correlated ones ○ Find feature groups of 0.90 correlation ○ Keep only one of the features ● XGBoost Feature Importance ○ https://github.com/Far0n/xgbfi ○ Run xgb on a sample with 100 trees ○ Use xgbfi to extract most important features ● Combined ○ In a correlated group, choose the most important feature using xgbfi output
  • 14. Most Important Features SVM fit on common & diff tokens
  • 15. Models & Ensembling ● Parameter tuning ○ Random search ● My best model: 0.939 public LB ○ XGB with depth=8 and 2.5k trees ○ Trained a few days ● Ensembling: ○ Sample group of features ○ Randomly choose the parameters ○ Build ETs and XGBs ○ Stack with Log Reg (L2 regularization with low C) ● Our final model: ○ Neural network on ETs and XGBs outputs + some selected 1st level features
  • 16. Lessons Learned ● It’s important to get CV right ○ My scheme: shuffle 3 fold (leaky) ○ Couldn’t use some nice features because of it ○ CV score of the ensemble was too good ○ Result: CV via LB ○ The right one: by connected components ● A lot of features is not always good ○ Computed too many features ○ Had hard time managing to use them all ○ Had to start stacking early That’s a graph!