The Importance of Modeling Data Collection
Data sets used in machine learning are often collected in a systematically biased way - certain data points are more likely to be collected than others. We call this "observation bias". For example, in health care, we are more likely to see lab tests when the patient is feeling unwell than otherwise. Failing to account for observation bias can, of course, result in poor predictions on new data. By contrast, properly accounting for this bias allows us to make better use of the data we do have.
In this presentation, we discuss practical and theoretical approaches to dealing with observation bias. When the nature of the bias is known, there are simple adjustments we can make to nonparametric function estimation techniques, such as Gaussian Process models. We also discuss the scenario where the data collection model is unknown. In this case, there are steps we can take to estimate it from observed data. Finally, we demonstrate that having a small subset of data points that are known to be collected at random - that is, in an unbiased way - can vastly improve our ability to account for observation bias in the rest of the data set.
My hope is that attendees of this presentation will be aware of the perils of observation bias in their own work, and be equipped with tools to address it.
2. Overview
2
➣ Data are collected in all kinds of ways
➣ We pretend they are collected “At Random”
➣ This creates poor predictive performance in important
regions of the input space
➣ We can model the collection process to improve
performance
3. Takeaways
3
➣ Understand the importance of selection bias in ML
○ Not discussed as much in ML as in statistics
➣ Be able to identify when our data might have this problem.
➣ Learn how to model data collection.
➣ Learn how use our data to learn about selection bias when
possible.
6. Bias in Data Collection Step 1
Things happen
6
➣ Selection bias: Correlation between how likely we are to
see a data point (X, Y), and the outcome Y
➣ Example 1:
○ We are asked to create a tool to help project managers
predict profit of software projects
○ Our data include all software projects previously
undertaken at the company
○ PMs are good at their jobs, so projects that lose money
are not in the data much . They just don’t happen.
7. Bias in Data Collection Step 1
Things happen
7
Project Complexity
Profit
Approved Projects
8. Bias in Data Collection Step 1
Some Things Don’t Happen
8
Project Complexity
Profit
Approved Projects
Rejected Projects
9. Bias in Data Collection
99
➣ No ML model can learn about the
“complexity boundary”, even
though we have access to all the
projects that were undertaken.
Nothing is “missing”.
➣ This is a very bad way to fail!
Our model will do badly specifically
where we want it to protect us from
poor decisions.
10. Modeling the Data Collection Process
1010
➣ We know proposals that are
unlikely to be profitable are unlikely
to occur in the data.
➣ We can incorporate that
knowledge about the data
collection process into our model
to address this problem.
11. Bias in Data Collection Step 2
We don’t see everything
Weeks
WhiteBloodCellCount
➣ We want to know how patients are doing when they’re away from the clinic
➣ Patients come in when they’re feeling unwell, elevated WBC
➣ We’ll generally predict that they’re worse off than they are
12. Prediction in Machine Learning
➣ We generally model
➣ g is our favourite class of functions for regression or
classification, parameterized by
➣ “Easy” to do because Y is one dimensional, and
expectations are summary statistics
13. Modeling Data Collection
➣ Modeling the probability of observing some data,
is too hard (w/ finite data)!
➣ X is high dimensional
➣ Densities are complicated
14. Modeling Data Collection
➣ In many problems we care about, the probability of making
an observation is a function only of the outcome.
➣ Then the probability of making on observation is:
➣ Which, for (X, Y) pairs we don’t see, can be approximated:
15. Incorporating Knowledge on Data Collection
➣ If we’re being frequentists, we can define a loss function
that captures both how well we do on prediction outcome,
and how well we do on predicting observation:
16. Modeling Data Collection
➣ We can now learn from what we don’t see.
➣ We know there are regions of the input space w/ no data
➣ We know we’re less likely to see data w/ low profit
➣ Therefore: profit must be low in those regions
Project Complexity
Profit
Approved Projects
17. What if we don’t know the data
collection process?
17
➣ We can’t learn p entirely from data - would require us to
know the outcome specifically where we don’t observe it
(in most cases).
➣ If we have beliefs about p and g, we can be Bayesian about
things.
➣ If we have a few data points collected “at random” - i.e. not
according to p - then we can learn p
18. A Worked Example
18
➣ We have data collected according to some unknown,
non-random process p
WhiteBloodCellCount
Weeks
19. A Worked Example
19
➣ Functions compatible with this data will have different
behavior in unobserved regions
WhiteBloodCellCount
Weeks
20. A Worked Example
20
➣ We assume all data are “observed at random”, as usual. Fit
looks good!
➣ Validation data collected by the same process will not help!
WhiteBloodCellCount
Weeks
21. A Worked Example
21
➣ But it turns out the data was not collected at random -
we’re systematically way off in unobserved regions!
WhiteBloodCellCount
Weeks
22. A Worked Example
22
➣ What if we know how much more likely we are to make an
observation when the outcome is high?
WhiteBloodCellCount
Weeks
23. A Worked Example
23
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
24. A Worked Example
24
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
25. Conclusions
25
➣ Selection bias hurts us in ML in ways we can’t detect
through normal validation procedures
➣ If we know something about the data collection process
we can incorporate it into our model to improve prediction.
➣ If we happen to have some data collected “at random”, we
can use it to learn about selection bias elsewhere in our
data.