Healthcare data is messy. Tree-based models provide robust first-cut solutions to such data. I introduce various kinds of trees and how they are different from each other. After understanding these trees, you can build better custom models of your own.
2. Who am I
• Co-founder and Chief Technology Officer of Accordion Health, Inc.
• PhD from the University of Texas at Austin
• Advisor: Professor JoydeepGhosh
• Studied Machine Learning and Data Mining, with a special focus on
healthcare data
• Involved in various industry data mining projects
• USAA: Life-time modeling of customers
• SK Telecom: Smartphone purchase prediction, usage pattern analysis
• LinkedIn Corp.: Related search keywords recommendation
• Whole Foods Market: Price elasticity modeling
• …
2
5. Healthcare Data is Messy
• Data structure
• Unstructured data such as EHR
• Structured data such as claims
• Location
• Doctors’ offices, insurance companies, governments,
etc.
• Data definition
• Different definitions for different communities
• Data format
• Various industry formats
• Data complexity
• Patients going in and out of systems
• Incomplete data
• Regulations & requirements
• Source: Health Catalyst
5
9. Various Kinds of Trees – C4.5, CART
1. Start with a dataset
2. Pick a splitting feature
3. Pick a splitting cut-point
4. Split the dataset into two sets based on the splitting feature and
cut-point
5. Repeat from Step 2 with the partitioned datasets
9
Information Gain à C4.5
Gini Impurity, Variance Reduction à CART
- Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann
Publishers.
- Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and
regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
11. Various Kinds of Forests – Bagged Trees
1. Start with a dataset
2. Pick a splitting feature
3. Pick a splitting cut-point
4. Split the dataset into two sets based on the splitting feature and
cut-point
5. Repeat from Step 2 with the partitioned datasets
11
Sample with replacement, and many trees
à Bagged Trees
- Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123–140.
12. Various Kinds of Forests – Random Subspace
1. Start with a dataset
2. Pick a splitting feature
3. Pick a splitting cut-point
4. Split the dataset into two sets based on the splitting feature and
cut-point
5. Repeat from Step 2 with the partitioned datasets
12
Select a random subset of features
Then find the best feature/cut-point
- Ho, T. (1998). The Random subspace method for constructing decision forests.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:8, 832–844.
13. Various Kinds of Forests – Random Forests
1. Start with a dataset
2. Pick a splitting feature
3. Pick a splitting cut-point
4. Split the dataset into two sets based on the splitting feature and
cut-point
5. Repeat from Step 2 with the partitioned datasets
13
Sample with replacement
Select a random subset of features
Then find the best feature/cut-point
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
14. Various Kinds of Trees – ExtraTrees
1. Start with a dataset
2. Pick a splitting feature
3. Pick a splitting cut-point
4. Split the dataset into two sets based on the splitting feature and
cut-point
5. Repeat from Step 2 with the partitioned datasets
14
Select a random subset of (feature, cut-point) pairs
Then find the best (feature, cut-point) pair
- Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees.
Machine learning 63.1, 3-42.
26. Predict MedAdh Scores
• Where can I find data
• Download from the CMS Part C and D Performance Data webpage
• Constructing datasets
• MedAdh Data from 2012, 2013 à Training Features, Xtrain
• MedAdh Data from 2015 à Training Label, Ytrain
• MedAdh Data from 2013, 2014 à Test Features, Xtest
• MedAdh Data from 2016 à Test Label, Ytest
26