SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Patient-Level Data Integration of De-Identified Healthcare Databases
to Support Improved Predictive Analytics
Yang Yang, Reza Sharifi Sedeh, Min Xue, Nandini Raghavan, Daniel Elgort
Contact: yang.yang_1@philips.com; reza.sharifi.sedeh@philips.com; daniel.elgort@philips.com
1, Introduction
Various types of de-identified healthcare databases, from
clinical and administrative to utilization, have emerged
recently, which enable researchers to perform analyses in
each individual domain. However, in the absence of pa-
tient identifying data features, current methods do not al-
low for patient record level integration across these de-
identified databases.
In this paper, we propose a novel approach to over-
come this limitation and integrate multiple de-identified
databases on the patient record level so that inter-domain
research problems become addressable. In addition,
we have developed a scalable healthcare data analytics
pipeline, which incorporates multiple machine learning
methods, including penalized and splined linear models,
logistic regression, random forest, and survival models.
Based on the nature of the integrated database and the an-
alytics purpose, users are provided with options to use any
combination of the available machine learning methods in
a timely manner. Adopting this strategy, users could ob-
tain more meaningful findings from the integrated dataset
compared with using a single database or relying on a sin-
gle analytical method.
2, Data Integration Approach
Many aggregated healthcare databases are strictly de-
identified that all of the hospital and patient identifiers
are removed before any secondary use by researchers
[Meystre et al., 2010]. This fact makes the integration
across databases highly challenging. Recently we devel-
oped a hierarchical approach to integrate de-identified
databases on the patient level using non-uniquely identi-
fying patient features. For example, age, sex, weight, pri-
mary diagnosis and length of hospital stay. The general
approach follows:
• generate UID from features for each patient.
• calculate patient rarity score for each patient.
• use rarest patients to identify the same hospitals
across databases.
• match patients belonging to the same hospital across
databases and repeat it for all matched hospitals.
• categorize matching results into confident, impossi-
ble and possible matches.
Below is an example for a patient with UID
1.5.1.122.18, and his calculated rarity score.
Table 1.: Calculating rarity score for the Native American, 18-year-old,
male patient who has a LOS of 122 days and has died in hospital.
The rarity score 4.5 ∗ 10−11
can be interpreted as, in every
22 billion patients from the hospital population, there is
only one patient with the same UID as him.
3, Data Integration Approach con’t
After generating UIDs for patients, we further added
the diagnosis codes to reduce duplicated patient
matches. The ICD9 [for Medicare et al., 2011]
codes was collapsed to Clinical Classifications [Cost
et al., 2010] for better accuracy and robustness of the
matching. The general rules of matching two pa-
tients can be summarized:
• the patients have the exact, same patient UID.
• the patients share at least 50% of the diagnosis
codes of the patient with less number of diag-
nosis codes. For example, if six and ten diagno-
sis codes have been assigned to Patient A in the
"clinical" database and Patient B in the "claims"
database respectively, then Patient A and Pa-
tient B must share at least three diagnosis codes
to convince us there is a match.
Finally, we summarized the hierarchical match-
ing algorithm into the following flowchart:
A
B
C
Set the rarity coefficient
threshold, r, to 10-10.
Match the patients of the “clinical”
hospital X with coefficients less than
r to the “claims” patients.
Are there five patient
matches and 30% matching
rate between the “clinical”
hospital X and any “claims”
Hospital Y?
YES
Link the de-identified hospital IDs
in the two databases.
NO
Increase r by a
factor of 10.
Our matching criteria of two
patients’records are defined as:
• Same patient ID.
• Share at least 50% diagnosis.
patient Diagnosis patientID
A 1, 2, 3, 3, 5 12345678
a 2, 3 , , 12345678
Patient matching using basic
features
Identified
one-to-one
matched
records
No matched
record
Multiple
matched
records
Age
Gender
Race
Primary Diagnosis
LOS
Mortality
Using secondary feature to
narrow the possible pairs
Confident Matching
DL
AS
Impossible Matching Possible Matching
yes
no
yes
no
Within single year
and single hospital
Figure 1.: A. Integration of eICU and HCUP using pro-
vided common features (eICU and HCUP are two differ-
ent healthcare databases. See Section 5. Data Application);
B. Hospital matching algorithm flowchart; C. Individual
patient matching algorithm flowchart.
4, Analytics Pipeline
PhilipsHealthcareBDS is an automated pipeline which
gives the user opportunities to execute a range of statis-
tical/machine learning models on a specific dataset in a
neat and fast manner. The whole pipeline is written in
R language and it is a Linux command-line based pro-
gram. The pipeline contains five modules in a flowchart
(Figure 2). The pipeline features flexible parallel/serial
scheme, flexible model parameter tuning, robustness to
different datasets with mixed types of explanatory and
response variables, complete logging and error collecting
system, and the ease to add more models in the future.
Currently the pipeline contains ten models/algorithms, in-
cluding Generalized Linear Model with stepwise variable
selection; Lasso, Ridge and Elastic Net algorithm; Group
Elastic Net algorithm; SCAD/MCP algorithm; Random
Forest; Random Survival Forest; Quantile Regression and
Normal-Probit Bivariate Model.
Figure 2.: Flowchart of PhilipsHealthcareBDS. Module within square
parentheses is optional.
6, Results
4e5
2e5
0
-2e5
Residual
ErrorRate
1e5 2e5 3e5 4e5 5e5 6e50
Predicted Value # of Trees Variable Importance
A B
Figure 3.: (A) Linear regression residual plot of in-hospital expen-
diture. (B) Random forest tree error rate (left panel) and variable
importance rank (right panel). The two variables with the highest
effects (blue box) on in-hospital expenditure were plotted versus
the in-hospital expenditure (the two rightmost plots).
Summary of conclusions
• We found a significant correlation between
the actually observed values of mortality or
length of stay (from eICU) and the in-hospital
expenditure (from HCUP).
• We learned that the in-hospital expenditures
(HCUP) of the patients who died in hospital
(eICU) are higher than those alive.
• We found the patients in either extremely
bad condition or excellent condition, inferred
from "Predicted Hospital/ICU mortality" or
"Acute Physiology Score" (eICU), have higher
in-hospital expenditures than patients with
moderate condition. These two variables
were ranked as the top two predictors of ex-
penditure by a random forest method (Figure
3B).
In addition, there are several other findings: Asian
or Pacific Islander patients paid more; patients with
more interventional procedures paid more; patients
with longer actual hospital/ICU lengths of stay paid
more; patients admitted from other health facilities
paid less.
5, Data Application
We integrated patients from Philips eICU database and
Healthcare Cost and Utilization Project (HCUPa
) State In-
patient Database (SID) for Massachusetts between 2008
and 2011. From this full dataset, by "DX1" (primary diag-
nosis ICD-9 code) values we further extracted those with
Heart Disease (i.e., Heart Failure and Cardiovascular My-
ocardial Infarction). The variables available are clinical
variables, utilization variables, billing variables, demo-
graphic variables and hospital characteristics.
We selected and applied five analytical methods on the real
data including: 1, Linear regression with stepwise variable
selection by AIC criteria; 2, Penalized linear model such
as elastic net, SCAD and MCP; 3, Group based penalized
linear model; 4, Random Forest; 5, Quantile Regression.
aDisclaimer: Study design, Data sources, analysis and findings de-
scribed in this paper were executed in compliance with the Data Use Agree-
ment of HCUP.
References
Healthcare Cost, Utilization Project, et al. Clinical classifications
software (ccs) for icd-9-cm. Rockville, MD: Agency for Healthcare
Research and Quality, 2010.
Centers for Medicare, Medicaid Services, et al. Icd-9-cm official
guidelines for coding and reporting. US GPO, Washington, DC,
2011.
Stephane M Meystre, F Jeffrey Friedlin, Brett R South, Shuying
Shen, and Matthew H Samore. Automatic de-identification
of textual documents in the electronic health record: a review
of recent research. BMC medical research methodology, 10(1):70,
2010.

Weitere ähnliche Inhalte

Was ist angesagt?

Disease prediction and doctor recommendation system
Disease prediction and doctor recommendation systemDisease prediction and doctor recommendation system
Disease prediction and doctor recommendation systemsabafarheen
 
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...ijcsa
 
Decision Tree Models for Medical Diagnosis
Decision Tree Models for Medical DiagnosisDecision Tree Models for Medical Diagnosis
Decision Tree Models for Medical Diagnosisijtsrd
 
Chapter 7 & 14 HIT 109
Chapter 7  & 14  HIT 109Chapter 7  & 14  HIT 109
Chapter 7 & 14 HIT 109Angela49938
 
Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...ijtsrd
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for HealthcareChandan Reddy
 
Prognosis of Diabetes by Performing Data Mining of HbA1c
 Prognosis of Diabetes by Performing Data Mining of HbA1c Prognosis of Diabetes by Performing Data Mining of HbA1c
Prognosis of Diabetes by Performing Data Mining of HbA1cIJCSIS Research Publications
 
prediction of heart disease using machine learning algorithms
prediction of heart disease using machine learning algorithmsprediction of heart disease using machine learning algorithms
prediction of heart disease using machine learning algorithmsINFOGAIN PUBLICATION
 
IRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction SystemIRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction SystemIRJET Journal
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesJagdeep Singh Malhi
 
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...Editor IJCATR
 
Future of Healthcare Forum (Digital Health 2017) - Andrew Satz
Future of Healthcare Forum (Digital Health 2017) - Andrew SatzFuture of Healthcare Forum (Digital Health 2017) - Andrew Satz
Future of Healthcare Forum (Digital Health 2017) - Andrew SatzOZ Digital Consulting
 
Blood bank in network
Blood bank in networkBlood bank in network
Blood bank in networkNirikshith LN
 
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A SurveyPrediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A Surveyrahulmonikasharma
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Kedar Damkondwar
 

Was ist angesagt? (18)

Disease prediction and doctor recommendation system
Disease prediction and doctor recommendation systemDisease prediction and doctor recommendation system
Disease prediction and doctor recommendation system
 
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...
 
Decision Tree Models for Medical Diagnosis
Decision Tree Models for Medical DiagnosisDecision Tree Models for Medical Diagnosis
Decision Tree Models for Medical Diagnosis
 
Chapter 7 & 14 HIT 109
Chapter 7  & 14  HIT 109Chapter 7  & 14  HIT 109
Chapter 7 & 14 HIT 109
 
Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 
Prognosis of Diabetes by Performing Data Mining of HbA1c
 Prognosis of Diabetes by Performing Data Mining of HbA1c Prognosis of Diabetes by Performing Data Mining of HbA1c
Prognosis of Diabetes by Performing Data Mining of HbA1c
 
prediction of heart disease using machine learning algorithms
prediction of heart disease using machine learning algorithmsprediction of heart disease using machine learning algorithms
prediction of heart disease using machine learning algorithms
 
IRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction SystemIRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction System
 
Bloodbank
BloodbankBloodbank
Bloodbank
 
Stroke Prediction
Stroke PredictionStroke Prediction
Stroke Prediction
 
BLOOD BANK SOFTWARE PRESENTATION
BLOOD BANK SOFTWARE PRESENTATIONBLOOD BANK SOFTWARE PRESENTATION
BLOOD BANK SOFTWARE PRESENTATION
 
Hybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart DiseasesHybrid Technique for Associative Classification of Heart Diseases
Hybrid Technique for Associative Classification of Heart Diseases
 
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
 
Future of Healthcare Forum (Digital Health 2017) - Andrew Satz
Future of Healthcare Forum (Digital Health 2017) - Andrew SatzFuture of Healthcare Forum (Digital Health 2017) - Andrew Satz
Future of Healthcare Forum (Digital Health 2017) - Andrew Satz
 
Blood bank in network
Blood bank in networkBlood bank in network
Blood bank in network
 
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A SurveyPrediction of Heart Disease using Machine Learning Algorithms: A Survey
Prediction of Heart Disease using Machine Learning Algorithms: A Survey
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
 

Andere mochten auch

MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital ProgramsMITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital ProgramsYang Yang
 
Juego para ninos de expresion corporal
Juego para ninos de expresion corporalJuego para ninos de expresion corporal
Juego para ninos de expresion corporalBetty de la Cruz
 
Plataformas virtuales
Plataformas virtualesPlataformas virtuales
Plataformas virtualesAna Balbi
 
El uso educativo en las redes sociales
El uso educativo en las redes socialesEl uso educativo en las redes sociales
El uso educativo en las redes socialesJorge Carranza
 
Introducción a la tecnología educativa
Introducción a la tecnología educativaIntroducción a la tecnología educativa
Introducción a la tecnología educativaFrank Félix De La Cruz
 
NWS_M001_Tue08SEP2015.PDF
NWS_M001_Tue08SEP2015.PDFNWS_M001_Tue08SEP2015.PDF
NWS_M001_Tue08SEP2015.PDFElaine Cooney
 
Clothes
ClothesClothes
ClothesalanV8
 
Tormentas solares udes diapositivas
Tormentas solares   udes diapositivasTormentas solares   udes diapositivas
Tormentas solares udes diapositivasNydia30
 
Research study presentation
Research study presentationResearch study presentation
Research study presentationCrusher James
 
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...Experiências com uso da adubação verde na recuperação dos espaços produtivos ...
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...Gilberto Fugimoto
 
디미컨 어린이컴퓨터교육 4주차
디미컨 어린이컴퓨터교육 4주차디미컨 어린이컴퓨터교육 4주차
디미컨 어린이컴퓨터교육 4주차jiyein
 
Dont Try Teach A Pig To Sing
Dont Try  Teach A Pig To SingDont Try  Teach A Pig To Sing
Dont Try Teach A Pig To SingReeni
 

Andere mochten auch (18)

MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital ProgramsMITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
 
Juego para ninos de expresion corporal
Juego para ninos de expresion corporalJuego para ninos de expresion corporal
Juego para ninos de expresion corporal
 
actividad nº3
actividad nº3actividad nº3
actividad nº3
 
Plataformas virtuales
Plataformas virtualesPlataformas virtuales
Plataformas virtuales
 
Helena Carter
Helena CarterHelena Carter
Helena Carter
 
El uso educativo en las redes sociales
El uso educativo en las redes socialesEl uso educativo en las redes sociales
El uso educativo en las redes sociales
 
Introducción a la tecnología educativa
Introducción a la tecnología educativaIntroducción a la tecnología educativa
Introducción a la tecnología educativa
 
NWS_M001_Tue08SEP2015.PDF
NWS_M001_Tue08SEP2015.PDFNWS_M001_Tue08SEP2015.PDF
NWS_M001_Tue08SEP2015.PDF
 
Clothes
ClothesClothes
Clothes
 
SLS maquetes 2015-Portfólio
SLS maquetes 2015-PortfólioSLS maquetes 2015-Portfólio
SLS maquetes 2015-Portfólio
 
Tormentas solares udes diapositivas
Tormentas solares   udes diapositivasTormentas solares   udes diapositivas
Tormentas solares udes diapositivas
 
Seresvivo e não vivos
Seresvivo e não vivosSeresvivo e não vivos
Seresvivo e não vivos
 
Research study presentation
Research study presentationResearch study presentation
Research study presentation
 
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...Experiências com uso da adubação verde na recuperação dos espaços produtivos ...
Experiências com uso da adubação verde na recuperação dos espaços produtivos ...
 
Televisión
TelevisiónTelevisión
Televisión
 
Smvd booklet(2016.09)
Smvd booklet(2016.09)Smvd booklet(2016.09)
Smvd booklet(2016.09)
 
디미컨 어린이컴퓨터교육 4주차
디미컨 어린이컴퓨터교육 4주차디미컨 어린이컴퓨터교육 4주차
디미컨 어린이컴퓨터교육 4주차
 
Dont Try Teach A Pig To Sing
Dont Try  Teach A Pig To SingDont Try  Teach A Pig To Sing
Dont Try Teach A Pig To Sing
 

Ähnlich wie poster_INFORMS_healthcare_2015 - condensed

Classifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersClassifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersMayur Srinivasan
 
readmission paper applied health econ
readmission paper applied health econreadmission paper applied health econ
readmission paper applied health econPamela Bonifay Peele
 
Re-admit Historical using SAS Visual Analytics
Re-admit Historical  using SAS Visual AnalyticsRe-admit Historical  using SAS Visual Analytics
Re-admit Historical using SAS Visual AnalyticsMonika Mishra
 
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMS
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMSPREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMS
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMSAIRCC Publishing Corporation
 
Prediction of Anemia using Machine Learning Algorithms
Prediction of Anemia using Machine Learning AlgorithmsPrediction of Anemia using Machine Learning Algorithms
Prediction of Anemia using Machine Learning AlgorithmsAIRCC Publishing Corporation
 
Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...John Frias Morales, DrBA, MS
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction systemSWAMI06
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
 
Diabetes Mellitus Prediction System Using Data Mining
Diabetes Mellitus Prediction System Using Data MiningDiabetes Mellitus Prediction System Using Data Mining
Diabetes Mellitus Prediction System Using Data Miningpaperpublications3
 
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUESBLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUEShiij
 
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUESBLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUEShiij
 
Cross the Finish Line First
Cross the Finish Line FirstCross the Finish Line First
Cross the Finish Line FirstVitreosHealth
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseSunil Kakade
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
IMS Health Enriched Real-World Data Study
IMS Health Enriched Real-World Data StudyIMS Health Enriched Real-World Data Study
IMS Health Enriched Real-World Data StudyIMSHealthRWES
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cDamian R. Mingle, MBA
 
Machine Learning applied to heart failure readmissions
Machine Learning applied to heart failure readmissionsMachine Learning applied to heart failure readmissions
Machine Learning applied to heart failure readmissionsJohn Frias Morales, DrBA, MS
 

Ähnlich wie poster_INFORMS_healthcare_2015 - condensed (20)

Classifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersClassifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient Encounters
 
readmission paper applied health econ
readmission paper applied health econreadmission paper applied health econ
readmission paper applied health econ
 
Re-admit Historical using SAS Visual Analytics
Re-admit Historical  using SAS Visual AnalyticsRe-admit Historical  using SAS Visual Analytics
Re-admit Historical using SAS Visual Analytics
 
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMS
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMSPREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMS
PREDICTION OF ANEMIA USING MACHINE LEARNING ALGORITHMS
 
Prediction of Anemia using Machine Learning Algorithms
Prediction of Anemia using Machine Learning AlgorithmsPrediction of Anemia using Machine Learning Algorithms
Prediction of Anemia using Machine Learning Algorithms
 
20170215 daniel gartner session 3_a1_kss_2016_paper_31
20170215 daniel gartner session 3_a1_kss_2016_paper_3120170215 daniel gartner session 3_a1_kss_2016_paper_31
20170215 daniel gartner session 3_a1_kss_2016_paper_31
 
BFRG AI Investor Aug 2023
BFRG AI Investor Aug 2023BFRG AI Investor Aug 2023
BFRG AI Investor Aug 2023
 
Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...Machine learning and operations research to find diabetics at risk for readmi...
Machine learning and operations research to find diabetics at risk for readmi...
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction system
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Diabetes Mellitus Prediction System Using Data Mining
Diabetes Mellitus Prediction System Using Data MiningDiabetes Mellitus Prediction System Using Data Mining
Diabetes Mellitus Prediction System Using Data Mining
 
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUESBLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
 
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUESBLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
BLOOD TUMOR PREDICTION USING DATA MINING TECHNIQUES
 
Cross the Finish Line First
Cross the Finish Line FirstCross the Finish Line First
Cross the Finish Line First
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_Disease
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
IMS Health Enriched Real-World Data Study
IMS Health Enriched Real-World Data StudyIMS Health Enriched Real-World Data Study
IMS Health Enriched Real-World Data Study
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
Machine Learning applied to heart failure readmissions
Machine Learning applied to heart failure readmissionsMachine Learning applied to heart failure readmissions
Machine Learning applied to heart failure readmissions
 

poster_INFORMS_healthcare_2015 - condensed

  • 1. Patient-Level Data Integration of De-Identified Healthcare Databases to Support Improved Predictive Analytics Yang Yang, Reza Sharifi Sedeh, Min Xue, Nandini Raghavan, Daniel Elgort Contact: yang.yang_1@philips.com; reza.sharifi.sedeh@philips.com; daniel.elgort@philips.com 1, Introduction Various types of de-identified healthcare databases, from clinical and administrative to utilization, have emerged recently, which enable researchers to perform analyses in each individual domain. However, in the absence of pa- tient identifying data features, current methods do not al- low for patient record level integration across these de- identified databases. In this paper, we propose a novel approach to over- come this limitation and integrate multiple de-identified databases on the patient record level so that inter-domain research problems become addressable. In addition, we have developed a scalable healthcare data analytics pipeline, which incorporates multiple machine learning methods, including penalized and splined linear models, logistic regression, random forest, and survival models. Based on the nature of the integrated database and the an- alytics purpose, users are provided with options to use any combination of the available machine learning methods in a timely manner. Adopting this strategy, users could ob- tain more meaningful findings from the integrated dataset compared with using a single database or relying on a sin- gle analytical method. 2, Data Integration Approach Many aggregated healthcare databases are strictly de- identified that all of the hospital and patient identifiers are removed before any secondary use by researchers [Meystre et al., 2010]. This fact makes the integration across databases highly challenging. Recently we devel- oped a hierarchical approach to integrate de-identified databases on the patient level using non-uniquely identi- fying patient features. For example, age, sex, weight, pri- mary diagnosis and length of hospital stay. The general approach follows: • generate UID from features for each patient. • calculate patient rarity score for each patient. • use rarest patients to identify the same hospitals across databases. • match patients belonging to the same hospital across databases and repeat it for all matched hospitals. • categorize matching results into confident, impossi- ble and possible matches. Below is an example for a patient with UID 1.5.1.122.18, and his calculated rarity score. Table 1.: Calculating rarity score for the Native American, 18-year-old, male patient who has a LOS of 122 days and has died in hospital. The rarity score 4.5 ∗ 10−11 can be interpreted as, in every 22 billion patients from the hospital population, there is only one patient with the same UID as him. 3, Data Integration Approach con’t After generating UIDs for patients, we further added the diagnosis codes to reduce duplicated patient matches. The ICD9 [for Medicare et al., 2011] codes was collapsed to Clinical Classifications [Cost et al., 2010] for better accuracy and robustness of the matching. The general rules of matching two pa- tients can be summarized: • the patients have the exact, same patient UID. • the patients share at least 50% of the diagnosis codes of the patient with less number of diag- nosis codes. For example, if six and ten diagno- sis codes have been assigned to Patient A in the "clinical" database and Patient B in the "claims" database respectively, then Patient A and Pa- tient B must share at least three diagnosis codes to convince us there is a match. Finally, we summarized the hierarchical match- ing algorithm into the following flowchart: A B C Set the rarity coefficient threshold, r, to 10-10. Match the patients of the “clinical” hospital X with coefficients less than r to the “claims” patients. Are there five patient matches and 30% matching rate between the “clinical” hospital X and any “claims” Hospital Y? YES Link the de-identified hospital IDs in the two databases. NO Increase r by a factor of 10. Our matching criteria of two patients’records are defined as: • Same patient ID. • Share at least 50% diagnosis. patient Diagnosis patientID A 1, 2, 3, 3, 5 12345678 a 2, 3 , , 12345678 Patient matching using basic features Identified one-to-one matched records No matched record Multiple matched records Age Gender Race Primary Diagnosis LOS Mortality Using secondary feature to narrow the possible pairs Confident Matching DL AS Impossible Matching Possible Matching yes no yes no Within single year and single hospital Figure 1.: A. Integration of eICU and HCUP using pro- vided common features (eICU and HCUP are two differ- ent healthcare databases. See Section 5. Data Application); B. Hospital matching algorithm flowchart; C. Individual patient matching algorithm flowchart. 4, Analytics Pipeline PhilipsHealthcareBDS is an automated pipeline which gives the user opportunities to execute a range of statis- tical/machine learning models on a specific dataset in a neat and fast manner. The whole pipeline is written in R language and it is a Linux command-line based pro- gram. The pipeline contains five modules in a flowchart (Figure 2). The pipeline features flexible parallel/serial scheme, flexible model parameter tuning, robustness to different datasets with mixed types of explanatory and response variables, complete logging and error collecting system, and the ease to add more models in the future. Currently the pipeline contains ten models/algorithms, in- cluding Generalized Linear Model with stepwise variable selection; Lasso, Ridge and Elastic Net algorithm; Group Elastic Net algorithm; SCAD/MCP algorithm; Random Forest; Random Survival Forest; Quantile Regression and Normal-Probit Bivariate Model. Figure 2.: Flowchart of PhilipsHealthcareBDS. Module within square parentheses is optional. 6, Results 4e5 2e5 0 -2e5 Residual ErrorRate 1e5 2e5 3e5 4e5 5e5 6e50 Predicted Value # of Trees Variable Importance A B Figure 3.: (A) Linear regression residual plot of in-hospital expen- diture. (B) Random forest tree error rate (left panel) and variable importance rank (right panel). The two variables with the highest effects (blue box) on in-hospital expenditure were plotted versus the in-hospital expenditure (the two rightmost plots). Summary of conclusions • We found a significant correlation between the actually observed values of mortality or length of stay (from eICU) and the in-hospital expenditure (from HCUP). • We learned that the in-hospital expenditures (HCUP) of the patients who died in hospital (eICU) are higher than those alive. • We found the patients in either extremely bad condition or excellent condition, inferred from "Predicted Hospital/ICU mortality" or "Acute Physiology Score" (eICU), have higher in-hospital expenditures than patients with moderate condition. These two variables were ranked as the top two predictors of ex- penditure by a random forest method (Figure 3B). In addition, there are several other findings: Asian or Pacific Islander patients paid more; patients with more interventional procedures paid more; patients with longer actual hospital/ICU lengths of stay paid more; patients admitted from other health facilities paid less. 5, Data Application We integrated patients from Philips eICU database and Healthcare Cost and Utilization Project (HCUPa ) State In- patient Database (SID) for Massachusetts between 2008 and 2011. From this full dataset, by "DX1" (primary diag- nosis ICD-9 code) values we further extracted those with Heart Disease (i.e., Heart Failure and Cardiovascular My- ocardial Infarction). The variables available are clinical variables, utilization variables, billing variables, demo- graphic variables and hospital characteristics. We selected and applied five analytical methods on the real data including: 1, Linear regression with stepwise variable selection by AIC criteria; 2, Penalized linear model such as elastic net, SCAD and MCP; 3, Group based penalized linear model; 4, Random Forest; 5, Quantile Regression. aDisclaimer: Study design, Data sources, analysis and findings de- scribed in this paper were executed in compliance with the Data Use Agree- ment of HCUP. References Healthcare Cost, Utilization Project, et al. Clinical classifications software (ccs) for icd-9-cm. Rockville, MD: Agency for Healthcare Research and Quality, 2010. Centers for Medicare, Medicaid Services, et al. Icd-9-cm official guidelines for coding and reporting. US GPO, Washington, DC, 2011. Stephane M Meystre, F Jeffrey Friedlin, Brett R South, Shuying Shen, and Matthew H Samore. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology, 10(1):70, 2010.