This document discusses big data and data science projects at a Center for Data Science. It provides an overview of various research areas like healthcare informatics, intelligent systems, social computing, and big data security. It also describes technologies used for big data like machine learning, distributed databases, and data integration. Specific projects are summarized, such as predicting hospital readmissions for congestive heart failure patients and detecting malware activity based on domain names. The document outlines the steps involved in building predictive models, from data understanding to predictive modeling. Performance of initial models is discussed, with areas for improvement noted.
2. • Bioinformatics
• Health and
Wellness
• Predictive Analytics
Health
Informatics
• Distributed Systems
• Databases
• Geo-‐Spatial
• Embedded Systems
Geo-‐Spatial Data
Management
• Machine Learning
• Data Mining
• Computation
Intelligence
• Computer Vision
Intelligent
Systems
• Web
• Devices
• Mobile Networks
• UX / UI
Social Computing
• Cryptology
• Secure Machine
Learning
Big Data Security
• Engineering
• Dev-‐Ops
Big Data
Infrastructure
Center for Data Science: Societal Impact
4. Data Mining: 1989 -‐ 2010
• Data Science and
Applications move and
transform sizeable amounts
of data out of the native
database or file systems.
Applications
SQL/ODBC/JDBC Data Access
Distributed Database
Multi-Core, Columnar,
Key-Value
Distributed Database
Multi-Core, Columnar,
Key-Value
Distributed Database
Multi-Core, Columnar,
Key-Value
Distributed Database
Multi-Core, Columnar,
Key-Value
Data Science using R,
SAS, SPSS, Weka, MAHOUT
H
I
G
H
V
O
L
U
M
E
H
I
G
H
L
A
T
E
N
C
Y
H
I
G
H
V
O
L
U
M
E
Application Ecosystem Integration
5. Data Science uses native data
representation and inherent distribution
and parallelism
Minimal data movement
Rapid Application development using
data science constructs
5
Big Data Science
Application Ecosystem Integration
Applications
SQL/ODBC/JDBC Data Access
Data Science
•Internal Algorithms for clustering,
•classification, regression
Distributed Database
Multi-Core, Columnar, Key-Value
L
O
W
E
R
V
O
L
U
M
E
L
O
W
E
R
L
A
T
E
N
C
Y
H
I
G
H
V
O
L
U
M
E
L
O
W
L
A
T
E
N
C
YBig Data Science Components
6. A Short History of (Big) Data Technology
1970: Codd invents “A
Relational Model of
Data for Large Shared
Data Banks”
1985: Copeland –
Decomposition Storage
Model (essentially the
first Columnar Store)
1989: Shared-‐Nothing
Architecture
2004: Google –
MapReduce
2005: C-‐Store
(Eventually Vertica),
layers WS/RS
2007: Materialization
Optimizations in
Columnar Stores and
Hadoop Implementation
2005-‐07: Star-‐Schema
Benchmark
+ Hadoop
2008: Attempts to
backport columnar
advances to row
storage, not very
effective
Today: BIG DATA
7. Technology Decisions
7
Columnar Vs Relational Storage
Technologies
Infinite scale using commodity
hardware
Private or Public Cloud
Massively Distributed and
Parallel Architecture: Hadoop
Stream Query Processing for
trillions of events and petabytes of
data
Real-time classification and
clustering: Approximate scoring
and segmentation + Reporting
and Data Visualization
8. Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set
Approximation
CHF Risk of
Readmission
Geo
Routing
Random Forests KNN
Industry Partners and Domain Experts
Other
Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian
Networks
Support Vector
Machines
8
Cost of Chronic
Interventions
Age/Gender
Prediction
Malware
Analytics
Personalized
Cancer Therapy
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoop
9. iTornado
Routing Service With Real World Severe
Weather
Demo Paper in ACM SIGSPATIAL 2014
(Best Demo paper award)
Fatalities Stats byWeather Related Hazards
http://www.nws.noaa.gov, June 2014.
11. PreGo
Dynamic Multi-‐Preference Routing
Single
Attribute
Multiple
Attribute
Time-‐
Homogenous
Dijkstra, A* Stewart et al 91
Time-‐Variant Betsy et al 07 ?
<3,4>
<2,2>
<5,7>
<0,0>
a
s
b
e
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[1,2,3,4,5]
R=[1,2,3,4,5]
d
c g
f h
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[5,1,3,4,5]
R=[7,1,2,4,5]
T=[1,1,3,4,5]
R=[1,2,3,4,5]
T=[2,1,3,4,5]
R=[2,1,3,4,5]
T=[1,2,2,4,3]
R=[2,1,5,4,3]
T=[1,2,3,1,1]
R=[1,2,3,0,1]
<1,1>
<4,4>
T=[4,2,1,3,5]
R=[3,2,1,4,5]
12. Special Needs Education: Teacher Trainer Effectiveness Analysis
Customized Surveys
Training Registration
Survey Management
To support streamlined data collection and
performance evaluation across the State Needs
Projects.
Project Stakeholders
Office of the Superintendent of Public
Instruction
Center for Data Science
Data Dashboard Purpose Report Generation
Geographic Distribution Maps
Demographic Reports
Brad Porter, Aniruddha Desai, Yitao Li, David Hazel,
Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green
13. Systems Biology
13
Predictive Models
and software
Applications: Personalized
medicine, drug discovery
Focus: Develop machine learning
methods and tools to effectively
integrate multiple big data sources in
biology.
15. Detecting Malware Activity based on
Automatically Generated Domains
Command & Control
xyz.com xyz.com
Infected node
Partnering with NIARA we obtained a large dataset of Automatically Generated Domains.
Based on the intercepted domain features we
are able to identify the malware infecting a
network.
16. (March 2012)
• Will this Heart Failure patient
get readmitted within 30 days?
• Yes or No (Binary Classification)
16
Reduce CHF
Readmission
Readmission ?
Machine Learning?
Joint NSF / NIH Solicitation on Health Care and Big Data
17. Affordable Care Act => Avoidable Costs
Readmissions are AVOIDABLE
20%
32%
30 days
60 days
75%
25% Non CHF
CHF
• Readmissions national cost $17 billion
annually
• 76 % considered avoidable
17
Readmissions
Congestive Heart Failure (CHF)
Source: www.presidency.ucsb.edu, cdc.gov, tmz.com
18. Patient
Class
Labels
No
readmission
Readmission
CHF ROR: 30-‐Day Hospital Readmission Risk
Prediction
Machine
Learning
Algorithms
18
Building
the
model
Scoring
the
tuple
Features
Vector
Features
Vectors
New patient
No readmission
Readmission
19. 19
Some of the Steps
Data
Understanding
And Integration
Data
Cleaning
Data
Transformation
Extracting data from Epic -‐
16 data marts and 200 views:
Heart Failure Inpatient Summary
Encounter.Flowsheet
PatientEncounterHospital
vs
20. Public Data:
State Inpatient Dataset 2009-‐2012
20
AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE DXCCS1 PRCCS1 TOTCHG
52 98122 1 3 12 3 0 153 212 56,511
87 98109 1 3 7 1 1 162 -‐ 12,687
26 98028 4 3 1 30 1 139 195 127,300
• Washington State Inpatient Data
• Admission level Claims
• ~400 attributes
• Demographics
• ICD9 Diagnosis codes
• ICD9 Procedure codes
• Charges
• Admissions by year
• 2009 – 652702
• 2010 – 651783
• 2011 – 648079
• 2012 – 648092
21. Variety and Volume (2/3 V’s of Big Data)
Pre Admission Post Admission Pre-‐ Discharge Discharge
-‐ Demographics
-‐ Vital Sign
-‐Prior Hospitalization
Pulse rate
Blood pressure
Respiration rate
BMI
Number of prior admissions
Prior length of stay
+ Demographics
Sodium level
Glucose level
Hemoglobin level
Creatinine level
Hematocrit level
Neutrophils level
Ejection Fraction
BUN level
+ Vital Sign
+ Prior Hospitalization
-‐ Lab Test
+ Vital Sign
+ Prior Hospitalization
+ Demographics
+ Lab Test
-‐ Diagnosis Information
Number of secondary diagnosis
Chronic systolic heart failure
Acute kidney failure
Chest pain
Hyper potassemia
Bronchopneumonia
Other chronic pulmonary heart diseases
Syncope and collapse …
+ Prior Hospitalization
+ Demographics
-‐ Comorbidities
Acute coronary syndrome Asthma
COPD Ulcer Dialysis Dementia
Arrhythmias Mal Nutrition
Vascular Depression
-‐ Discharge/Admit codes
Admit /Discharge type
Severity Of illness Risk Of Mortality
-‐ Utilization Information
Operating room CTSCAN
Emergency Room CCU
Marital status Age
Racial group
Gender
22. (Dec 2012) Initial Models
22
Data integration
Feature Construction
Predictive modeling
• Logistic Regression
• Naïve Bayes
• Support Vector Machines
0.6
0.72
0.64
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
Yale M
odel (Com
parative …Am
arasingham
et al.
Our current Result
Area Under the Curve (AUC)
Several Rejects:
KDD Industry Track
2013
AMIA 2013
JAMIA 2013
2012
23. (July 2013) (much better) & Some Papers
§ Improved data exploration
§ S.-‐C. Chin, K. Zolfaghar, S. Basu Roy, A.
Teredesai, and P. Amoroso, "Divide-‐n-‐
Discover -‐-‐ Discretization based Data
Exploration Framework for
Healthcare Analytics," 7th
International Conference on Health
Informatics (HEALTHINF Short Paper),
Angers, France, 2014
§ N. Meadem, N. Verbiest, K. Zolfaghar,
J. Agarwal, S.-‐C. Chin, S. Basu Roy, A.
Teredesai, D. Hazel, P. Amoroso, and
L. Reed, "Predicting Risk of
Readmission for Congestive Heart
Failure Patients," Workshop on Data
Mining for Healthcare (DMH),
Chicago, IL, 2013
23
0.6
0.72
0.64
0.74
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Yale Model
(Comparative
Baseline)
Amarasingham
et al.
Our 2012 Result Our current
Result
Area Under the Curve (AUC)
§Improved Modeling Effort
24. (Dec 2013) Prototype or a possible Product?
& yes, More Papers
§ Successful Deployment
24
§K. Zolfaghar, J. Agarwal, D. Sistla, S.-‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-‐O-‐Meter: An Intelligent
Clinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD), Chicago, IL, 2013
§Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-‐Chi Chin, Brian Muckian: Big
data solutions for predicting risk-‐of-‐readmission for congestive heart failure patients. BigData
Conference 2013: 64-‐71
25. 25
Multi Layer Classifier : Automatically Detecting
Classification Windows
Will patient ever readmit?
Will patient readmit
within 30 days?
YES NO
YES NO
KNN
LR
NB
SVM
KNN
32% of all data
Only 5% of patients that return
within 30 days is filtered out
26. Generalizing the 30,60,90 Day readmission
§ Automatic design of time prediction hierarchy
§ Feature selection and factor analysis at each layer
§ Different classification algorithms in each layer and satisfying different
quality metrics
26
28. Simple 3 Layer Example
• Stage 1: Design a predictive model for the patients who are likely to
come back within a time window of (X, K), where X is the maximum
number of days until next readmission
• Stage 2: Design a predictive model for time window of (K, 30)
• Stage 3: Design a predictive model for time window of <30 days of
readmission
HOW TO AUTOMATICALLY DETECT THE MIDDLE CUTPOINT K?
28
29. Hill Climbing Algorithm to Detect K
§ Generate a random number K between X and 30
§ Compute C1= Centroid(X,K) , C2= Centroid(K+1,30)
§ Compute the KLCurrent = KLDiv(C1,C2)
§ K’=K+i K”=K-‐i
§ Find a point K2 between (K’,K’’) , and check
§ If KLDiv( Centroid(X,K2), Centroid(K2,30)) > KLCurrent
§ If the above condition is satisfied, then K=K2
§ KLCurrent = KLDiv( Centroid(X,K2), Centroid(K2,30))
§ Repeat the above steps until no further check is possible
29
32. All in one Package – Risk-‐O-‐Meter (KDD 2013)
32
33. Pre Admission Post Admission Pre -‐ Discharge Discharge
Post-‐Discharge
Care
Management
Pipeline
“White Gap”PCP HF Service
Care
Management
Payer
ChroniRisk Continuous Readmission Risk Assessment Across Continuum of Care*
78%*
42%*
Service Line EMRPCP Tools
Psycho-‐social risk
scoring
2013 HF Readmission Statistics
• 7.1 M Readmits
• 5.3 M Avoidable
• $13,000 each
• $13 B opportunity cost
Patient Encounters Scored
+18,000 (HF cohort)
34. Risk – Done
Cost – Done
Next?
Actionable Interventions
If we can predict can we recommend?
34
A Framework to Recommend Interventions for 30-‐Day Heart Failure Readmission Risk, Rui Liu, Kiyana
Zolfaghar, SC Chin, Senjuti Basu Roy, Ankur Teredesai, Data Mining (ICDM), 2014 IEEE International Conference
on DOI: 10.1109/ICDM.2014.89 Publication Year: 2014 , Page(s): 911 -‐ 916
35. A real and common Chronic Readmission
75-‐year old, female
Chronic pulmonary disease,
depression, hypertension
and diastolic heart failure
High Risk
Medium Risk
Low Risk
35
Readmit!
Intervention Plan 1
Major Operating Room, Chest X-‐ray and others
Intervention Plan 2
Echocardiology, CCU and others
Intervention Plan 3
Emergency Room and others
36. Risk will be
lower when the
interventions
are performed
The patient is
not readmitted
Intervention Rule Generation
Readmission
Age Gender
Pneumonia
DX486
Acute
respitory
failure
DX51881
CHF
DX4280
Cont inv mec ven
<96 hrs
PR9671
Venous cath NEC
PR3893
Packed cell
transfusion
PR9904 Rule
Repository
Valid Rule 1
Female, Diabetes, Major Operating Room,
Chest X-‐ray and others
Valid Rule 2
Male, Hypertension, Echocardiology, CCU and
others
Invalid Rule 3
Female, Depression, Emergency Room and
others
Invalid Rule 4
Male, COPD, Emergency Room and others
36
Bayesian Network
Construction
Intervention Rule
Generation
Intervention
Recommendation
Evaluation
Compute patient risk using only non-‐
procedural attributes
Compute patient risk using procedural
attributes
Compare the difference between the two
probabilities
Store the rules where the risk is
reduced after introducing the
procedures
37. Recommendation for New Patient
Intervention Plan 1
Major Operating Room, Chest X-‐ray and others
Intervention Plan 2
Echocardiology, CCU and others
Intervention Plan 3
Emergency Room and others
Top 3 intervention plans
Rule Repository
New Patient Attributes
Summarized Intervention Plan
Major Operating Room, Echocardiology , Chest
X-‐ray and others
37
Summarize
The Rule Repository is HUGE! (over
30k rules)
Parallel Solution!
Bayesian Network
Construction
Intervention Rule
Generation
Intervention
Recommendation
Evaluation
Compute similarity between established
attribute profile and a given patient profile
Identify rules where the established
attribute is most similar to the patient
input
Recommend interventions extracted
from the established rules
38. Validation – Data Highlights
• State Inpatient Database (SID) of Washington State heart failure cohort in year 2010
(67967 patients) for training and 2011 (52021 patients) for testing
• 3908 diagnosis and 2049 procedure codes are involved.
• Feature Selection is performed using chi-‐square test.
Demographics Age, Gender, Race
Comorbidity & Diagnosis 21 comorbidities and 90 diagnosis
Utilization & Interventions 21 health service utilization flags and 70 interventions
Others Length of Stay, # of diagnosis and interventions
38
High Dimensional
Bayesian Network
Construction
Intervention Rule
Generation
Intervention
Recommendation
Evaluation
Extract patients from the test set who were not
readmitted within 30 days
Compute the evaluation metrics between the recommended interventions
and the actual interventions
40. Back to the Chronic Readmission Case
75-‐year old, female
Chronic pulmonary disease,
depression, hypertension
and diastolic heart failure
40
No-‐readmit!
Cardiac catheterization lab, CT scan, echo-‐
cardiology, echo-‐cardiogram,
Cardiac catheterization lab, CT scan, echo-‐
cardiology, echo-‐cardiogram
41. Accountable Care Organizations
Cost/Charge Prediction
41
HealthSCOPE: An Interactive Distributed Data Mining Framework for Scalable Prediction of Healthcare Costs , Marquardt James, Newman Stacey,
Hattarki Deepa, Srinivasan Rajagopalan, Sushmita Shanu, Ram Prabhu, Prasad Viren, Hazel David, Ramesh Archana, De Cock Martine, Teredesai Ankur,
IEEE Data Mining Conference Demo Track, 2014 IEEE International Conference on DOI: 10.1109/ICDMW.2014.45 Publication Year: 2014 , Page(s): 1227 -‐
1230
42. 42
What are healthcare
costs for assigned
population in 2015 ?
Why is the cost so
high or low ?
How does the cost
distribute across
demographics ?
QUESTIONS
DATA
SCIENCE
DATA
APPLICATIONS
Motivation:
ACO Cost Prediction
Demographics
Diagnosis
Codes
Procedure
Codes
Drugs
Lab Results
Clinical
Claims
Sources : SID, OSHPD, MEPS Source : MultiCare Collaboration
Charges
Vitals
Population Predictive
Modeling
Feature Prioritization
Health Prediction
Care Management
Individual Predictive
Modeling
Chandola et. al, KDD 2013
43. Cost/Charge Prediction: Problem Description
• Goal à predict the future healthcare cost of individuals based on
their past medical and cost information.
• Supervised machine learning problem.
• Input:
• Previous health information (e.g. diagnosis, comorbidities, etc).
• General demographics (age, gender, race)
• Previous healthcare cost
• {X} = (x1, x2, x3 ......xp)
• Output:
• Y = Future healthcare cost
foo 43
44. foo 44
Four Scenarios for predicting cost
• Three Months of Historical data (Medical, Demographic and Cost)
à Cost of Following Nine months (1Q)
• Six Months of Historical data (Medical, Demographic and Cost)
à Cost of Following Six months (2Q)
• Nine Months of Historical data (Medical, Demographic and Cost)
à Cost of Following Three months (3Q)
• Twelve Months of Historical data (Medical, Demographic and Cost)
à Cost of Following Twelve months (4Q)
45. Non-‐ Gaussian Distribution of Healthcare Costs
foo 45
Makes it challenging and interesting problem for research
46. Existing Cost prediction Methods
• Limited to Rule based or Multiple Linear Regression methods
• Rule Based methods
• Requires domain knowledge
• Expensive
• Multiple Linear Regression
• Multi-‐collinearity Issue
• Sensitive to extreme values (outliers)
• Evaluation
• Estimate the mean cost of the given sampling distribution.
• Often in-‐sample data used to report predictive performance.
• R2 evaluation metric (not a true indicator)
47. Our Contributions
• Investigate the utility of state-‐of–the –art machine learning
algorithms for the cost prediction problem.
• We empirically evaluate three algorithms:
• Regression Trees
• M5 Model Trees
• Random Forest
foo 47
48. Regression Tree
48
Age > 60?
Has
Asthma?
Gender =
Female?
21,00046,00062,00085,000
Yes
Yes Yes
No
No No
49. M5 Model Tree
foo 49
Has
Asthma?
Gender =
Female?
Yes
Yes Yes
No
No No
Age > 60?
50. Random Forest
50
Had
Procedure
X?
Age > 18?
Gender =
Male?
21,00046,00062,00085,000
Yes
Yes Yes
# Admits
> 3?
No No
Race =
White?
Has CHF?
21,00046,00062,00085,000
Yes
Yes
YesNo No
No
NoAge >
60?
Has
Asthma?
21,000
Gender =
Female?
46,00062,00085,000
Yes
Yes
YesNo No
No
52. 52
MAE Results – SID Data (3Q Scenario)
0
5,000
10,000
15,000
20,000
25,000
30,000
Average
Baseline
Previous
Cost
Regression
Multiple
Linear
Regression
Regression
tree
Random
Forest
Model Tree
MAE ($)
Baselines
Advanced Models
53. 53
MAE Results – MEPS Data
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Average
Baseline
Previous
Cost
Regression
Multiple
Linear
Regression
Regression
tree
Random
Forest
Model Tree
MAE ($)
Baselines
Advanced Models
54. 54
Prediction Error Results – M5 Model Trees
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1Q 2Q 3Q 4Q
Error ($)
MAE
RMSE
55. Error Distribution: WA State SID Data
foo 55
For large fraction of of the
population (75%), we were able to
predict
with higher accuracy using these
algorithms
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0% 25% 50% 75%
Maximum Prediction Error ($)
Portion of Population
Multiple Linear
Regression
Regression Tree
Random Forest
Model Tree
57. Most difficult cohort to predict
foo 57
0
5000
10000
15000
20000
25000
30000
35000
Asthma Diabetes CHF COPD Coronary Over 65
MAE ($)
model trees
linear regression
59. Thu, Nov 7, 2013 at 10:50 AM
59
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Forwarded message -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
From: Windows Azure Pass System Admin <wapadmin@microsoft.com>
Date: Thu, Nov 7, 2013 at 10:50 AM
Subject: Gifting Letter for Windows Azure Research Pass
To: "Ankur M. Teredesai" <ankurt@uw.edu>
Cc: "Azure4Research (RFP External)" <azurerfp@microsoft.com>
Dear Ankur M. Teredesai ,
We have approved your application for a Windows Azure Research Pass Grant. In
order to receive your pass, download the Microsoft gifting letter from the following
link:
61. Web App
for ACOs
Model
Selector
Cost Prediction API
Beneficiary Claims
Population Batch/Individual
A
B
Linear Regression
Regression Trees
Individual Beneficiary
Feature Vector
Individual Beneficiary
Predicted Cost
Predicted, Previous year, Historic
population Costs + population statistics
④
①
②
③
Scale Issues:
Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on
ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual
④
Web App
for
Individual
WA-‐SID Claims / MEPS
Survey (for training)
Data Sources
C M5 Model Trees
62. Web App
for ACOs
Model
Selector
Cost Prediction API
Beneficiary Claims
Population Batch/Individual
A
B
Linear Regression
Regression Trees
Individual Beneficiary
Feature Vector
Individual Beneficiary
Predicted Cost
Predicted, Previous year, Historic
population Costs + population statistics
④
①
②
③
Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on
ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual
④
Web App
for
Individual
Data Sources
WA-‐SID Claims / MEPS
Survey (for training)
C M5 Model Trees
63. Apache Spark
foo 63
Apache Spark
HDFS
Slave 1
Slave 1
Master
Driver RDD
In Memory Data
Partition 1
In Memory Data
Partition 2
Spark
Spark
Spark
Data Partition1
Replica Data
Partition2
Data Partition2
Replica Data
Partition2
64. Weighted k-‐NN for Regression
foo 64
Data
Partition 1
kNN1
Predicted Cost
kNN2
2k NN
kNN
Node 1
Data
Partition 2
Node 2
Test
Instance Top k
Group
& Sort
Group & Sort
Weighted
Average
Compute
kNN
Compute
kNN
65. Rough Set
• Rough set theory is an ML framework that
is especially suitable for information
systems with inconsistencies.
• Rough set theory handles discrete
attributes.
• Lower approximation: instances that
necessarily belong to the class
• Upper approximation: instances that
possibly belong to the class
Patient Age ≥ 50 Alcohol Disorder Visit Cost
P1 Yes Yes High
P2 Yes Yes High
P3 Yes No Low
P4 Yes No High
P5 No No Low
P6 No Yes High
Similar Patients but belong to
different classes!
66. Fuzzy Rough Set
• Uses fuzzy logic to handle continuous
attributes.
• Similarity matrix contains values
between 0 and 1.
• Inconsistent instances are highly
related but have a different class.
Patient Age Alcohol Disorder Visits Cost
P1 52 1 $13335
P2 59 4 $277966
P3 55 0 $8139
P4 50 0 $66058
P5 34 0 $5815
P6 26 1 $38526
P1 P2 P3 P4 P5 P6
P1 1 0.52 0.83 0.84 0.60 0.61
P2 0.52 1.00 0.44 0.36 0.12 0.13
P3 0.83 0.44 1 0.92 0.68 0.44
P4 0.84 0.36 0.92 1 0.76 0.51
P5 0.60 0.12 0.68 0.76 1 0.75
P6 0.61 0.13 0.44 0.51 0.75 1
67. Fuzzy Rough Set
• Let rj,i be the degree of similarity of instances i and j.
• Let ci be the degree to which instance i belongs to the class.
• Then the degree to which instance j belongs to the:
• Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n}
• Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n}
• Current implementations can handle only up to 100,000 instances
because they keep the similarity matrix in memory.
70. Implementation
• The construction of the similarity matrix
can be done in a
parallel manner, making each of K
compute nodes calculate n/K columns of
the similarity matrix.
• No need to store the similarity matrix as
a whole.
• The construction of the similarity matrix
does not have to be
finished before (partial) computation of
the lower and upper
approximations can begin.
Node 1 Node 2
73. Web App
for ACOs
Model
Selector
Cost Prediction API
Beneficiary Claims
Population Batch/Individual
A
B
Linear Regression
Regression Trees
Individual Beneficiary
Feature Vector
Individual Beneficiary
Predicted Cost
Predicted, Previous year, Historic
population Costs + population statistics
④
①
②
③
Cost Prediction as a Service
R
Big Data Stack
Cost Prediction Engine
Model Bank deployed on
ADAPA
Spark
Beneficiary Claims for individual
①
Predicted cost for the individual
④
Web App
for
Individual
WA-‐SID Claims / MEPS
Survey (for training)
Data Sources
C M5 Model Trees
74. Readmission Application
• Android
• Windows Phone
• Patient View
• what is my risk
• Doctor View
• who are my risky patients?
• alerts
• Interventions
74
76. 0.6 AUC
Yale Model
(Baseline)
76
Milestones: Readmission Risk
0.64 AUC
UW 2012
Result
Ensemble
method,
Hierarchical
classification
Dec 2012
0.74 AUC
UW 2014Result
Lab results
+
New
Algorithm
(Adaboost)
Feb 2014
QlikView
Readmission
App
Dec 2013
Machine Learning
Process to Target
New Chronic
Diseases
Aug 2014 -‐> Moving Forward
Integrating
care pathway
March 2014
Bayesian
Network
Learning
AUC – Accuracy measure
(Area Under Curve)
Real Time
Care
Factors &
Pathways
July 2014
with
EPIC
Post-‐Discharge
(Clinical data)
June 2013
Risk-‐o-‐Meter
Development
+
Big data Efforts
Pre-‐Admission
(Clinical data)
Post-‐Discharge
(Claim data)
Post-‐Admission
(Clinical data)
IEEE Big Data
REF #3
KDD
REF #1 & 2
HEALTHINF
REF #4 & 5
KDD
REF #6
ICDM 2014
REF #6
77. Problem
Explorat
ion
77
Milestones: Cost Prediction
H-‐SCOPE I
SID Data
June 2014
H-‐SCOPE IV
SID + MEPS
data
Nov. 2014
H-‐SCOPE III
Adapa Scoring
Engine
Spark
Framework
Sept. 2014 Aug 2015 -‐> Moving Forward
H-‐SCOPE V
Five Cohort
Dec. 2014
M5 Model
Trees
Random
Forest
Regression
Tress
Health
SCOPE VI
July 2015
Admit Level
August 2014
H-‐SCOPE II
Population View
(ACO)
OSHPD Data
Application
Beneficiary
Level
Beneficiary
View
Four Future
Scenario
ICDM 2014 KDD-‐2015 AMIA-‐2015
Sub-‐
Population
Deep
Learning
Time &
Cost Of
Hospital
readmission
H-‐SCOPE VII
AHRQ Private
data
WWW-‐Digital
Health-‐2015
Time, Cost
And
Illness (Alignment)
Prediction
78. 78
AUC – Accuracy measure
(Area Under Curve)
2012
78
Milestones: Merging Threads
2016 and beyond2013 2014 2015
Risk of Readmission (Clinical, Sociological & Claims)
2014 2015
Cost Prediction (Claims and secondary data sources)
2015
Risk & Cost Convergence
79. Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set
Approximation
CHF Risk of
Readmission
Geo
Routing
Random Forests KNN
Industry Partners and Domain Experts
Other
Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian
Networks
Support Vector
Machines
79
Cost of Chronic
Interventions
Age/Gender
Prediction
Malware
Analytics
Personalized
Cancer Therapy
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoop
80. Flat Files CSV Claims X12 Clinical HL7
Distance Compute Library
Instance Selection
RNGE Drop 3
Fuzzy Rough Set
Approximation
Personalized
Cancer Therapy
Geo
Routing
Random Forests KNN
Industry Partners and Domain Experts
Other
Solutions
HDFS NUMA
MPI Grappa
Census US Gov Unstructured CCD
Bayesian
Networks
Support Vector
Machines
80
Cost of Chronic
Interventions
Age/Gender
Prediction
Malware
Analytics
CHF Risk of
Readmission
ETL Tools
Raw Data from Sources (SID, OSHPD, HCUP, Edifecs, MHS, CMS, LINCS, Industry)
Sqoop