SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Counteracting Selection
Bias in Machine Learning
1
Noam Finkelstein
MLConf SF
November 8th 2019
Overview
2
➣ Data are collected in all kinds of ways
➣ We pretend they are collected “At Random”
➣ This creates poor predictive performance in important
regions of the input space
➣ We can model the collection process to improve
performance
Takeaways
3
➣ Understand the importance of selection bias in ML
○ Not discussed as much in ML as in statistics
➣ Be able to identify when our data might have this problem.
➣ Learn how to model data collection.
➣ Learn how use our data to learn about selection bias when
possible.
Data Collection Step 1:
Things happen
4
Data Collection Step 2:
Some of them get recorded
5
Bias in Data Collection Step 1
Things happen
6
➣ Selection bias: Correlation between how likely we are to
see a data point (X, Y), and the outcome Y
➣ Example 1:
○ We are asked to create a tool to help project managers
predict profit of software projects
○ Our data include all software projects previously
undertaken at the company
○ PMs are good at their jobs, so projects that lose money
are not in the data much . They just don’t happen.
Bias in Data Collection Step 1
Things happen
7
Project Complexity
Profit
Approved Projects
Bias in Data Collection Step 1
Some Things Don’t Happen
8
Project Complexity
Profit
Approved Projects
Rejected Projects
Bias in Data Collection
99
➣ No ML model can learn about the
“complexity boundary”, even
though we have access to all the
projects that were undertaken.
Nothing is “missing”.
➣ This is a very bad way to fail!
Our model will do badly specifically
where we want it to protect us from
poor decisions.
Modeling the Data Collection Process
1010
➣ We know proposals that are
unlikely to be profitable are unlikely
to occur in the data.
➣ We can incorporate that
knowledge about the data
collection process into our model
to address this problem.
Bias in Data Collection Step 2
We don’t see everything
Weeks
WhiteBloodCellCount
➣ We want to know how patients are doing when they’re away from the clinic
➣ Patients come in when they’re feeling unwell, elevated WBC
➣ We’ll generally predict that they’re worse off than they are
Prediction in Machine Learning
➣ We generally model
➣ g is our favourite class of functions for regression or
classification, parameterized by
➣ “Easy” to do because Y is one dimensional, and
expectations are summary statistics
Modeling Data Collection
➣ Modeling the probability of observing some data,
is too hard (w/ finite data)!
➣ X is high dimensional
➣ Densities are complicated
Modeling Data Collection
➣ In many problems we care about, the probability of making
an observation is a function only of the outcome.
➣ Then the probability of making on observation is:
➣ Which, for (X, Y) pairs we don’t see, can be approximated:
Incorporating Knowledge on Data Collection
➣ If we’re being frequentists, we can define a loss function
that captures both how well we do on prediction outcome,
and how well we do on predicting observation:
Modeling Data Collection
➣ We can now learn from what we don’t see.
➣ We know there are regions of the input space w/ no data
➣ We know we’re less likely to see data w/ low profit
➣ Therefore: profit must be low in those regions
Project Complexity
Profit
Approved Projects
What if we don’t know the data
collection process?
17
➣ We can’t learn p entirely from data - would require us to
know the outcome specifically where we don’t observe it
(in most cases).
➣ If we have beliefs about p and g, we can be Bayesian about
things.
➣ If we have a few data points collected “at random” - i.e. not
according to p - then we can learn p
A Worked Example
18
➣ We have data collected according to some unknown,
non-random process p
WhiteBloodCellCount
Weeks
A Worked Example
19
➣ Functions compatible with this data will have different
behavior in unobserved regions
WhiteBloodCellCount
Weeks
A Worked Example
20
➣ We assume all data are “observed at random”, as usual. Fit
looks good!
➣ Validation data collected by the same process will not help!
WhiteBloodCellCount
Weeks
A Worked Example
21
➣ But it turns out the data was not collected at random -
we’re systematically way off in unobserved regions!
WhiteBloodCellCount
Weeks
A Worked Example
22
➣ What if we know how much more likely we are to make an
observation when the outcome is high?
WhiteBloodCellCount
Weeks
A Worked Example
23
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
A Worked Example
24
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
Conclusions
25
➣ Selection bias hurts us in ML in ways we can’t detect
through normal validation procedures
➣ If we know something about the data collection process
we can incorporate it into our model to improve prediction.
➣ If we happen to have some data collected “at random”, we
can use it to learn about selection bias elsewhere in our
data.
Thank you!
Get in touch
noam@jhu.edu
@nsfinkelstein
26

Weitere ähnliche Inhalte

Ähnlich wie Noam Finkelstein - The Importance of Modeling Data Collection

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and SplunkTom Chavez
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfVinayShekarReddy
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notesAnthony Shingleton
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making processPeter R Breach
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The DataAngel Evans
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomnesssuncil0071
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUGGlen Alleman
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMetroWater
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 

Ähnlich wie Noam Finkelstein - The Importance of Modeling Data Collection (20)

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and Splunk
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdf
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notes
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomness
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUG
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomes
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...MLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 

Kürzlich hochgeladen

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Kürzlich hochgeladen (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Noam Finkelstein - The Importance of Modeling Data Collection

  • 1. Counteracting Selection Bias in Machine Learning 1 Noam Finkelstein MLConf SF November 8th 2019
  • 2. Overview 2 ➣ Data are collected in all kinds of ways ➣ We pretend they are collected “At Random” ➣ This creates poor predictive performance in important regions of the input space ➣ We can model the collection process to improve performance
  • 3. Takeaways 3 ➣ Understand the importance of selection bias in ML ○ Not discussed as much in ML as in statistics ➣ Be able to identify when our data might have this problem. ➣ Learn how to model data collection. ➣ Learn how use our data to learn about selection bias when possible.
  • 4. Data Collection Step 1: Things happen 4
  • 5. Data Collection Step 2: Some of them get recorded 5
  • 6. Bias in Data Collection Step 1 Things happen 6 ➣ Selection bias: Correlation between how likely we are to see a data point (X, Y), and the outcome Y ➣ Example 1: ○ We are asked to create a tool to help project managers predict profit of software projects ○ Our data include all software projects previously undertaken at the company ○ PMs are good at their jobs, so projects that lose money are not in the data much . They just don’t happen.
  • 7. Bias in Data Collection Step 1 Things happen 7 Project Complexity Profit Approved Projects
  • 8. Bias in Data Collection Step 1 Some Things Don’t Happen 8 Project Complexity Profit Approved Projects Rejected Projects
  • 9. Bias in Data Collection 99 ➣ No ML model can learn about the “complexity boundary”, even though we have access to all the projects that were undertaken. Nothing is “missing”. ➣ This is a very bad way to fail! Our model will do badly specifically where we want it to protect us from poor decisions.
  • 10. Modeling the Data Collection Process 1010 ➣ We know proposals that are unlikely to be profitable are unlikely to occur in the data. ➣ We can incorporate that knowledge about the data collection process into our model to address this problem.
  • 11. Bias in Data Collection Step 2 We don’t see everything Weeks WhiteBloodCellCount ➣ We want to know how patients are doing when they’re away from the clinic ➣ Patients come in when they’re feeling unwell, elevated WBC ➣ We’ll generally predict that they’re worse off than they are
  • 12. Prediction in Machine Learning ➣ We generally model ➣ g is our favourite class of functions for regression or classification, parameterized by ➣ “Easy” to do because Y is one dimensional, and expectations are summary statistics
  • 13. Modeling Data Collection ➣ Modeling the probability of observing some data, is too hard (w/ finite data)! ➣ X is high dimensional ➣ Densities are complicated
  • 14. Modeling Data Collection ➣ In many problems we care about, the probability of making an observation is a function only of the outcome. ➣ Then the probability of making on observation is: ➣ Which, for (X, Y) pairs we don’t see, can be approximated:
  • 15. Incorporating Knowledge on Data Collection ➣ If we’re being frequentists, we can define a loss function that captures both how well we do on prediction outcome, and how well we do on predicting observation:
  • 16. Modeling Data Collection ➣ We can now learn from what we don’t see. ➣ We know there are regions of the input space w/ no data ➣ We know we’re less likely to see data w/ low profit ➣ Therefore: profit must be low in those regions Project Complexity Profit Approved Projects
  • 17. What if we don’t know the data collection process? 17 ➣ We can’t learn p entirely from data - would require us to know the outcome specifically where we don’t observe it (in most cases). ➣ If we have beliefs about p and g, we can be Bayesian about things. ➣ If we have a few data points collected “at random” - i.e. not according to p - then we can learn p
  • 18. A Worked Example 18 ➣ We have data collected according to some unknown, non-random process p WhiteBloodCellCount Weeks
  • 19. A Worked Example 19 ➣ Functions compatible with this data will have different behavior in unobserved regions WhiteBloodCellCount Weeks
  • 20. A Worked Example 20 ➣ We assume all data are “observed at random”, as usual. Fit looks good! ➣ Validation data collected by the same process will not help! WhiteBloodCellCount Weeks
  • 21. A Worked Example 21 ➣ But it turns out the data was not collected at random - we’re systematically way off in unobserved regions! WhiteBloodCellCount Weeks
  • 22. A Worked Example 22 ➣ What if we know how much more likely we are to make an observation when the outcome is high? WhiteBloodCellCount Weeks
  • 23. A Worked Example 23 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 24. A Worked Example 24 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 25. Conclusions 25 ➣ Selection bias hurts us in ML in ways we can’t detect through normal validation procedures ➣ If we know something about the data collection process we can incorporate it into our model to improve prediction. ➣ If we happen to have some data collected “at random”, we can use it to learn about selection bias elsewhere in our data.
  • 26. Thank you! Get in touch noam@jhu.edu @nsfinkelstein 26