SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Target Leakage in ML
Yuriy Guts
Machine Learning Engineer
Automated ML, time series forecasting, NLP
Lecturer
AI, Machine Learning, Summer/Winter ML Schools
Compete sometimes
Currently hold an Expert rank, top 2% worldwide
https://github.com/YuriyGuts/odsc-target-leakage-workshop
Follow my presentation and code at:
Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
Data known at prediction time Data unknown at prediction time
timeprediction point
Leakage in a Nutshell
contamination
Training on contaminated data leads to overly optimistic
expectations about model performance in production
“But I always validate on random K-fold CV. I should be fine, right?”
Data
Collection
Data
Preparation
Feature
Engineering
Partitioning
Training and
Tuning
Evaluation
Leakage can happen anywhere during the project lifecycle
Leakage in Data Collection
Where is the leakage?
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
Target is a function of another column
EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD
315981 Data Scientist 3 5,000.00 78,895.44
4691 Data Scientist 4 5,500.00 86,784.98
23598 Data Scientist 5 6,200.00 97,830.35
The target can have different formatting or measurement units in different columns.
Forgetting to remove the copies will introduce target leakage.
Check out the example: example-01-data-collection.ipynb
Where is the leakage?
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
Feature is an aggregate of the target
SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender
24092091 M18-25 15.31 25 135.10 0
4092034091 F40-60 35.81 3 5.01 1
329815 F25-40 13.09 32 128.52 1
94721835 M25-40 18.52 21 259.34 0
E.g., the data can have derived columns created after the fact for reporting purposes
Where is the leakage?
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Mutable data due to lack of snapshot-ability
Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan
1 Y 80k Car Purchase 0 0
3 N 120k Small Business 3 1
1 Y 85k House Purchase 5 1
2 N 72k Marriage 1 0
Database records get overwritten as more facts become available.
But these later facts won’t be available at prediction time.
Leakage in Feature Engineering
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
My model is sensitive to feature scaling...
Dataset StandardScaler
Partition into
CV/Holdout
Train
Evaluate
OOPS. WE’RE LEAKING THE TEST FEATURE DISTRIBUTION INFO
INTO THE TRAINING SET
Check out the example: example-02-data-prep.ipynb
Removing leakage in feature engineering
Dataset Partition
Train
StandardScaler
(fit)
Evaluate
StandardScaler
(transform)
train
eval
apply mean and std
obtained on train
Obtain feature engineering/transformation parameters only on the training set
Apply them to transform the evaluation sets (CV, holdout, backtests, …)
Encoding of different variable types
Text:
Learn DTM columns from the training set only, then transform the evaluation sets
(avoid leaking possible out-of-vocabulary words into the training pipeline)
Categoricals:
Create mappings on the training set only, then transform the evaluation sets
(avoid leaking cardinality/frequency info into the training pipeline)
Leakage in Partitioning
https://twitter.com/AndrewYNg/status/931026446717296640
Group Leakage
OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
Paper v1 (AUC) Paper v3 (AUC)
The Cold Start Problem
Observe:
Predict:
Group Partitioning, Out-of-Group Validation
Training: Validation:
Fold 1
Fold 2
Fold 3
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
Leakage in oversampling / augmentation
Random
Partitioning
Train
Evaluate
Original Data Oversampling or
Augmentation
Random
Partitioning
Train
Evaluate
Pairwise Data
(features for
A and B)
Augmentation
(AB → BA)
First partition, then augment the training data.
Random Partitioning for Time-Aware Models
Random Partitioning for Time-Aware Models
Out-of-Time Validation (OTV)
Leakage in Training & Tuning
Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
Reusing a CV split for multiple tasks
V
V
V
V
V
H
H
H
H
H
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Feature selection, hyperparameter tuning, model selection...
OOPS. CAN OVERTUNE TO THE VALIDATION SET
BETTER USE DIFFERENT SPLITS FOR DIFFERENT TASKS
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
Model stacking on in-sample predictions
Model 1
Train Predict
Model 2
Train Predict
2nd Level Model
Train
OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
Better way to stack
Model 1
Train Predict
V
Model 2
Train Predict
V
2nd Level Model
Train
Compute all meta-features only out-of-fold
Leakage in Competitive ML
Case Studies
● Removing user IDs does not necessarily mean data anonymization
(Kaggle: Expedia Hotel Recommendations, 2016)
● Anonymizing feature names does not mean anonymization either
(Kaggle: Santander Value Prediction competition, 2018)
● Target can sometimes be recovered from the metadata or external datasets
(Kaggle: Dato “Truly Native?” competition, 2015)
● Overrepresented minority class in pairwise data enables reverse engineering
(Kaggle: Quora Question Pairs competition, 2017)
Prevention Measures
Leakage prevention checklist (not exhaustive!)
● Split the holdout away immediately and do not preprocess it in any way before final model
evaluation.
● Make sure you have a data dictionary and understand the meaning of every column, as well
as unusual values (e.g. negative sales) or outliers.
● For every column in the final feature set, try answering the question:
“Will I have this feature at prediction time in my workflow? What values can it have?”
● Figure out preprocessing parameters on the training subset, freeze them elsewhere.
● Treat feature selection, model tuning, model selection as separate “machine learning
models” that need to be validated separately.
● Make sure your validation setup represents the problem you need to solve with the model.
● Check feature importance and prediction explanations: do top features make sense?
Thank you!
yuriy.guts@gmail.com
linkedin.com/in/yuriyguts

Weitere ähnliche Inhalte

Was ist angesagt?

【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究
【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究
【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究harmonylab
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会Shotaro Sano
 
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論Nobukazu Yoshioka
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionChristoph Körner
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
20190619 オートエンコーダーと異常検知入門
20190619 オートエンコーダーと異常検知入門20190619 オートエンコーダーと異常検知入門
20190619 オートエンコーダーと異常検知入門Kazuki Motohashi
 
Best Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaBest Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaEdureka!
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
機械学習 / Deep Learning 大全 (1) 機械学習基礎編
機械学習 / Deep Learning 大全 (1) 機械学習基礎編機械学習 / Deep Learning 大全 (1) 機械学習基礎編
機械学習 / Deep Learning 大全 (1) 機械学習基礎編Daiyu Hatakeyama
 

Was ist angesagt? (20)

【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究
【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究
【卒業論文】B2Bオークションにおけるユーザ別 入札行動予測に関する研究
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Clustering
ClusteringClustering
Clustering
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会
ディリクレ過程に基づく無限混合線形回帰モデル in 機械学習プロフェッショナルシリーズ輪読会
 
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論
AIシステムの要求とプロジェクトマネジメント-前半:機械学習工学概論
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
20190619 オートエンコーダーと異常検知入門
20190619 オートエンコーダーと異常検知入門20190619 オートエンコーダーと異常検知入門
20190619 オートエンコーダーと異常検知入門
 
Best Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaBest Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | Edureka
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
機械学習 / Deep Learning 大全 (1) 機械学習基礎編
機械学習 / Deep Learning 大全 (1) 機械学習基礎編機械学習 / Deep Learning 大全 (1) 機械学習基礎編
機械学習 / Deep Learning 大全 (1) 機械学習基礎編
 

Ähnlich wie Target Leakage in ML Detection and Prevention

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaDatabricks
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
 
Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018Analytics8
 
Human-Centered Interpretable Machine Learning
Human-Centered Interpretable  Machine LearningHuman-Centered Interpretable  Machine Learning
Human-Centered Interpretable Machine LearningPrzemek Biecek
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...PAPIs.io
 
Open06
Open06Open06
Open06butest
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleSongtao Guo
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 

Ähnlich wie Target Leakage in ML Detection and Prevention (20)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
 
Validation Is (Not) Easy
Validation Is (Not) EasyValidation Is (Not) Easy
Validation Is (Not) Easy
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclust
 
Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018Demystifying Data Science Webinar - February 14, 2018
Demystifying Data Science Webinar - February 14, 2018
 
Human-Centered Interpretable Machine Learning
Human-Centered Interpretable  Machine LearningHuman-Centered Interpretable  Machine Learning
Human-Centered Interpretable Machine Learning
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
 
Open06
Open06Open06
Open06
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 

Mehr von Yuriy Guts

Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

Mehr von Yuriy Guts (17)

Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Kürzlich hochgeladen

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Kürzlich hochgeladen (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Target Leakage in ML Detection and Prevention

  • 1. Target Leakage in ML Yuriy Guts
  • 2. Machine Learning Engineer Automated ML, time series forecasting, NLP Lecturer AI, Machine Learning, Summer/Winter ML Schools Compete sometimes Currently hold an Expert rank, top 2% worldwide
  • 4. Data known at prediction time Data unknown at prediction time timeprediction point Leakage in a Nutshell contamination
  • 5. Data known at prediction time Data unknown at prediction time timeprediction point Leakage in a Nutshell contamination Training on contaminated data leads to overly optimistic expectations about model performance in production
  • 6. “But I always validate on random K-fold CV. I should be fine, right?”
  • 7.
  • 9. Leakage in Data Collection
  • 10. Where is the leakage? EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD 315981 Data Scientist 3 5,000.00 78,895.44 4691 Data Scientist 4 5,500.00 86,784.98 23598 Data Scientist 5 6,200.00 97,830.35
  • 11. Target is a function of another column EmployeeID Title ExperienceYears MonthlySalaryGBP AnnualIncomeUSD 315981 Data Scientist 3 5,000.00 78,895.44 4691 Data Scientist 4 5,500.00 86,784.98 23598 Data Scientist 5 6,200.00 97,830.35 The target can have different formatting or measurement units in different columns. Forgetting to remove the copies will introduce target leakage. Check out the example: example-01-data-collection.ipynb
  • 12. Where is the leakage? SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender 24092091 M18-25 15.31 25 135.10 0 4092034091 F40-60 35.81 3 5.01 1 329815 F25-40 13.09 32 128.52 1 94721835 M25-40 18.52 21 259.34 0
  • 13. Feature is an aggregate of the target SubscriberID Group DailyVoiceUsage DailySMSUsage DailyDataUsage Gender 24092091 M18-25 15.31 25 135.10 0 4092034091 F40-60 35.81 3 5.01 1 329815 F25-40 13.09 32 128.52 1 94721835 M25-40 18.52 21 259.34 0 E.g., the data can have derived columns created after the fact for reporting purposes
  • 14. Where is the leakage? Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0
  • 15. Mutable data due to lack of snapshot-ability Education Married AnnualIncome Purpose LatePaymentReminders IsBadLoan 1 Y 80k Car Purchase 0 0 3 N 120k Small Business 3 1 1 Y 85k House Purchase 5 1 2 N 72k Marriage 1 0 Database records get overwritten as more facts become available. But these later facts won’t be available at prediction time.
  • 16. Leakage in Feature Engineering
  • 17. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate
  • 18. My model is sensitive to feature scaling... Dataset StandardScaler Partition into CV/Holdout Train Evaluate OOPS. WE’RE LEAKING THE TEST FEATURE DISTRIBUTION INFO INTO THE TRAINING SET Check out the example: example-02-data-prep.ipynb
  • 19. Removing leakage in feature engineering Dataset Partition Train StandardScaler (fit) Evaluate StandardScaler (transform) train eval apply mean and std obtained on train Obtain feature engineering/transformation parameters only on the training set Apply them to transform the evaluation sets (CV, holdout, backtests, …)
  • 20. Encoding of different variable types Text: Learn DTM columns from the training set only, then transform the evaluation sets (avoid leaking possible out-of-vocabulary words into the training pipeline) Categoricals: Create mappings on the training set only, then transform the evaluation sets (avoid leaking cardinality/frequency info into the training pipeline)
  • 22.
  • 24. Group Leakage OOPS. THERE ARE FOUR TIMES MORE UNIQUE IMAGES THAN PATIENTS
  • 25. Paper v1 (AUC) Paper v3 (AUC)
  • 26. The Cold Start Problem Observe: Predict:
  • 27. Group Partitioning, Out-of-Group Validation Training: Validation: Fold 1 Fold 2 Fold 3
  • 28. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation
  • 29. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA)
  • 30. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA) OOPS. WE MAY GET COPIES SPLIT BETWEEN TRAINING AND EVALUATION
  • 31. Leakage in oversampling / augmentation Random Partitioning Train Evaluate Original Data Oversampling or Augmentation Random Partitioning Train Evaluate Pairwise Data (features for A and B) Augmentation (AB → BA) First partition, then augment the training data.
  • 32. Random Partitioning for Time-Aware Models
  • 33. Random Partitioning for Time-Aware Models
  • 36. Reusing a CV split for multiple tasks V V V V V H H H H H Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Feature selection, hyperparameter tuning, model selection...
  • 37. Reusing a CV split for multiple tasks V V V V V H H H H H Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Feature selection, hyperparameter tuning, model selection... OOPS. CAN OVERTUNE TO THE VALIDATION SET BETTER USE DIFFERENT SPLITS FOR DIFFERENT TASKS
  • 38. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train
  • 39. Model stacking on in-sample predictions Model 1 Train Predict Model 2 Train Predict 2nd Level Model Train OOPS. WILL LEAK THE TARGET IN THE META-FEATURES
  • 40. Better way to stack Model 1 Train Predict V Model 2 Train Predict V 2nd Level Model Train Compute all meta-features only out-of-fold
  • 42. Case Studies ● Removing user IDs does not necessarily mean data anonymization (Kaggle: Expedia Hotel Recommendations, 2016) ● Anonymizing feature names does not mean anonymization either (Kaggle: Santander Value Prediction competition, 2018) ● Target can sometimes be recovered from the metadata or external datasets (Kaggle: Dato “Truly Native?” competition, 2015) ● Overrepresented minority class in pairwise data enables reverse engineering (Kaggle: Quora Question Pairs competition, 2017)
  • 44. Leakage prevention checklist (not exhaustive!) ● Split the holdout away immediately and do not preprocess it in any way before final model evaluation. ● Make sure you have a data dictionary and understand the meaning of every column, as well as unusual values (e.g. negative sales) or outliers. ● For every column in the final feature set, try answering the question: “Will I have this feature at prediction time in my workflow? What values can it have?” ● Figure out preprocessing parameters on the training subset, freeze them elsewhere. ● Treat feature selection, model tuning, model selection as separate “machine learning models” that need to be validated separately. ● Make sure your validation setup represents the problem you need to solve with the model. ● Check feature importance and prediction explanations: do top features make sense?