SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Working With Large-Scale 
Clinical Datasets 
Craig Smail, MA, MSc ( @craigsmail) 
KU Medical Center 
9th October 2014 
Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
Disclosures 
• Industry grant funding: 
– Merck 
– Mallinckrodt 
– Sanofi
Overview 
• Targeted audience: anyone involved (directly 
or indirectly) in clinical data extraction, 
validation, and standardization 
• Sections: 
1. Data extraction: planning 
2. Data extraction 
3. Data standardization 
4. Data transfer
Data Extraction: Planning 
• Dataset type 
– Most common: limited and de-identified 
– Difference: limited can contain some personal 
information (DOB, DOD, city, state, age) 
• Legal agreements 
– Data Use Agreement (DUA) 
– Business Associates Agreement (BAA) 
– Institutional Review Board (IRB) 
• Usually only if IRB considers activity Human Subjects 
Research
Data Extraction: Planning 
• Important to finalize list of data elements 
before pull 
– Time-consuming to repull 
– Reallocation of resources (e.g. programmer time) 
• Summary statistics are helpful in planning 
stage 
• e.g. death status requested a lot, but is very rarely 
available in the EHR
Data Extraction: Planning 
• Use of data proxy correlated with data 
element of interest 
– sometimes need to develop proxies for data 
points of interest (e.g. severity of pain; 
hypoglycemic events) 
– Example use case: aspirin as a proxy for 
antiphospholipid antibodies lab1 
• Proxy data elements should be supported by 
data 
1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 
2011; 365:1758-1759N
Example: Proxy for Death Status 
• Data extracted from large multi-specialty 
clinic on the east coast 
• 300,000 patients in EHR 
• ~10,000 with date-of-death (we’ll take this as 
gold-standard) 
• Is days since last encounter a good proxy?
Example: Proxy for Death Status 
library(glm2) 
# import data 
setwd([dir here]) 
Encs = read.csv("lastenc.csv", header= FALSE) 
# find days since last encounter 
for (i in 1:nrow(Encs)) { 
Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y") 
} 
# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750) 
for (i in 1:nrow(Encs)) { 
Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0) 
} 
# clean up table 
Encs = Encs[ , c(2, 4)] 
# fit model (logistic regression – but could use something else) 
fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial") 
confusionMatrix = table(round(fit$fitted.values), Encs[,1]) 
misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) # 
0.34
Example: Proxy for Death Status 
• Is days since last encounter a good proxy? 
No (error rate = 34%) 
• Consequences:
Data Extraction: Planning 
• Cohort definition 
– Spell out cohort definitions explicitly, including all 
assumptions 
– Real-world example: 
• ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90 
days apart’ 
• Further restriction specified ‘if any value > 60 in between 90 
days, then throw out’ 
• Word ‘consecutive’ means no values in between 90 days will 
be considered at all 
– If any another eGFR value occurs between 90 days, then the 
patient does not meet the first restriction
Data Extraction: Planning 
• Final thought on planning: 
“Not everything that counts can be counted and 
not everything that can be counted counts.” 
—Albert Einstein (or William Bruce Cameron, 
depends who you believe) 
• some data elements are well populated, but 
reflect things like coding bias (e.g. ‘up-coding’ 
to a code with larger reimbursement)
Data Extraction 
• What are data extractions being used for in the 
NRN? 
– Pharmaceutical companies: data on 143,057 
patients from 8 health-care organizations/health 
care systems 
– Federally-funded research (NIH, AHRQ): data on 
~100,000 patients 
– Health IT vendors: work with Cerner to produce 
performance reports for use by participating 
providers 
• Clinicians like performance feedback, if your EHR cannot 
provide it they will go elsewhere (i.e. switch to another 
vendor)
Data Extraction 
• Longitudinal data important 
– look at temporal trends over time in same 
patient 
– during EHR transitions, some EHR vendors will 
import all data, but restrict full access to only 
last 18/24/26 months – clinicians don’t like this, 
they want to be able to access all data
Data Validation 
• Date parameters (e.g. look at min and max dates of encounter 
in dataset, when 1000s of patients of dataset, would expect to 
see dates match with range) 
– Percentage of distinct patients in extraction vs. overall practice count: 
cohort percentages are quite stable across practices 
» e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes 
defined by ICD-9 code xx.xxx 
– Caveat: doesn’t work well with small practices (< 2,000 distinct 
patients)
Data Standardization 
• Open-source models (Observational Medical Outcomes 
Partnership) 
• Script data out of database (e.g. SQL view) 
• Map labs/procedures to standardized concept list 
– Why? different string labels referring to creatinine blood test from 
three data feeds, with frequency of occurrence…
Note: source values with counts < 100 were censored
Data Transfer 
• HIPPA requirements 
• Usually FTP to secure site (e.g Egnyte) 
Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
Concluding Thoughts 
• Extracted data is treated as gold-standard, since it is pulled 
directly from data source (i.e. EHR), but data often comes 
from intermediate product (such as a registry product, like the 
product DARTNet provides); but usually don’t have control 
over data mapping from EHR to registry 
• The EHR of the future (?): 
– Genetic data (WGS or WES) 
Âť WGS = ~100 GB 
Âť WES = ~8 GB 
– Integration with consumer wearable devices (e.g. FitBit; iPhone ECG) 
– Further down the road: human microbiome; home microbiome
Always question 
the data 
Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
Questions? 
• Slides available from slideshare 
(URL @craigsmail) 
• Email: csmail@aafp.org

Weitere ähnliche Inhalte

Was ist angesagt?

Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data miningijpla
 
Webinar: Schema Design and Performance Implications
Webinar: Schema Design and Performance ImplicationsWebinar: Schema Design and Performance Implications
Webinar: Schema Design and Performance ImplicationsMongoDB
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set MiningAn Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set Miningijsrd.com
 
Datamining
DataminingDatamining
Dataminingsumit621
 
ICBAI Presentation (2)
ICBAI Presentation (2)ICBAI Presentation (2)
ICBAI Presentation (2)Pradeepta Mishra
 
Developing a Clinical Decision Support System with Grakn
Developing a Clinical Decision Support System with GraknDeveloping a Clinical Decision Support System with Grakn
Developing a Clinical Decision Support System with GraknVaticle
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)Paul Agapow
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data miningvs datawarehouse
Data miningvs datawarehouseData miningvs datawarehouse
Data miningvs datawarehouseSuman Astani
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lectureMahmoud Alfarra
 
IRJET- Disease Prediction and Doctor Recommendation System
IRJET-  	  Disease Prediction and Doctor Recommendation SystemIRJET-  	  Disease Prediction and Doctor Recommendation System
IRJET- Disease Prediction and Doctor Recommendation SystemIRJET Journal
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processingMahmoud Alfarra
 
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...LexisNexis
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 

Was ist angesagt? (19)

Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data mining
 
Webinar: Schema Design and Performance Implications
Webinar: Schema Design and Performance ImplicationsWebinar: Schema Design and Performance Implications
Webinar: Schema Design and Performance Implications
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set MiningAn Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
 
Data mining
Data miningData mining
Data mining
 
Datamining
DataminingDatamining
Datamining
 
ICBAI Presentation (2)
ICBAI Presentation (2)ICBAI Presentation (2)
ICBAI Presentation (2)
 
Developing a Clinical Decision Support System with Grakn
Developing a Clinical Decision Support System with GraknDeveloping a Clinical Decision Support System with Grakn
Developing a Clinical Decision Support System with Grakn
 
Data mining
Data miningData mining
Data mining
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)
 
Data mining
Data miningData mining
Data mining
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data miningvs datawarehouse
Data miningvs datawarehouseData miningvs datawarehouse
Data miningvs datawarehouse
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lecture
 
ICBAI Paper (1)
ICBAI Paper (1)ICBAI Paper (1)
ICBAI Paper (1)
 
IRJET- Disease Prediction and Doctor Recommendation System
IRJET-  	  Disease Prediction and Doctor Recommendation SystemIRJET-  	  Disease Prediction and Doctor Recommendation System
IRJET- Disease Prediction and Doctor Recommendation System
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 

Ähnlich wie Working With Large-Scale Clinical Datasets

Next Gen Clinical Data Sciences
Next Gen Clinical Data SciencesNext Gen Clinical Data Sciences
Next Gen Clinical Data SciencesSaama
 
Big Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiBig Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiJohn Cai
 
MongoDB Schema Design and its Performance Implications
MongoDB Schema Design and its Performance ImplicationsMongoDB Schema Design and its Performance Implications
MongoDB Schema Design and its Performance ImplicationsLewis Lin 🦊
 
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA
 
Project Proposal(Hospital Management System)
Project Proposal(Hospital Management System)Project Proposal(Hospital Management System)
Project Proposal(Hospital Management System)SN Chakraborty
 
Predicting Hospital Readmissions
Predicting Hospital ReadmissionsPredicting Hospital Readmissions
Predicting Hospital ReadmissionsDerek Christensen
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Editor IJARCET
 
predictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfpredictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfDasariSeshadri
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxkumari36
 
Mdds sundararaman 12th meeting
Mdds  sundararaman 12th meetingMdds  sundararaman 12th meeting
Mdds sundararaman 12th meetingPankaj Gupta
 
Clinical Analytics
Clinical AnalyticsClinical Analytics
Clinical AnalyticsMichael Bice
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareDATA360US
 
Big Data Mining Methods in Medical Applications [Autosaved].pptx
Big Data Mining Methods in Medical Applications [Autosaved].pptxBig Data Mining Methods in Medical Applications [Autosaved].pptx
Big Data Mining Methods in Medical Applications [Autosaved].pptxHemaSenthil5
 
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...Fundamentals of Health Insurance Big Data Management: Master Patient Index De...
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...Salus One Ed
 

Ähnlich wie Working With Large-Scale Clinical Datasets (20)

Next Gen Clinical Data Sciences
Next Gen Clinical Data SciencesNext Gen Clinical Data Sciences
Next Gen Clinical Data Sciences
 
Big Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiBig Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John Cai
 
MongoDB Schema Design and its Performance Implications
MongoDB Schema Design and its Performance ImplicationsMongoDB Schema Design and its Performance Implications
MongoDB Schema Design and its Performance Implications
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Saude
SaudeSaude
Saude
 
Cri big data
Cri big dataCri big data
Cri big data
 
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for...
 
Presentation
PresentationPresentation
Presentation
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
 
Project Proposal(Hospital Management System)
Project Proposal(Hospital Management System)Project Proposal(Hospital Management System)
Project Proposal(Hospital Management System)
 
Predicting Hospital Readmissions
Predicting Hospital ReadmissionsPredicting Hospital Readmissions
Predicting Hospital Readmissions
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397
 
predictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfpredictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdf
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
 
Mdds sundararaman 12th meeting
Mdds  sundararaman 12th meetingMdds  sundararaman 12th meeting
Mdds sundararaman 12th meeting
 
Clinical Analytics
Clinical AnalyticsClinical Analytics
Clinical Analytics
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Big Data Mining Methods in Medical Applications [Autosaved].pptx
Big Data Mining Methods in Medical Applications [Autosaved].pptxBig Data Mining Methods in Medical Applications [Autosaved].pptx
Big Data Mining Methods in Medical Applications [Autosaved].pptx
 
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...Fundamentals of Health Insurance Big Data Management: Master Patient Index De...
Fundamentals of Health Insurance Big Data Management: Master Patient Index De...
 

KĂźrzlich hochgeladen

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 

KĂźrzlich hochgeladen (20)

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 

Working With Large-Scale Clinical Datasets

  • 1. Working With Large-Scale Clinical Datasets Craig Smail, MA, MSc ( @craigsmail) KU Medical Center 9th October 2014 Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
  • 2. Disclosures • Industry grant funding: – Merck – Mallinckrodt – Sanofi
  • 3. Overview • Targeted audience: anyone involved (directly or indirectly) in clinical data extraction, validation, and standardization • Sections: 1. Data extraction: planning 2. Data extraction 3. Data standardization 4. Data transfer
  • 4. Data Extraction: Planning • Dataset type – Most common: limited and de-identified – Difference: limited can contain some personal information (DOB, DOD, city, state, age) • Legal agreements – Data Use Agreement (DUA) – Business Associates Agreement (BAA) – Institutional Review Board (IRB) • Usually only if IRB considers activity Human Subjects Research
  • 5. Data Extraction: Planning • Important to finalize list of data elements before pull – Time-consuming to repull – Reallocation of resources (e.g. programmer time) • Summary statistics are helpful in planning stage • e.g. death status requested a lot, but is very rarely available in the EHR
  • 6. Data Extraction: Planning • Use of data proxy correlated with data element of interest – sometimes need to develop proxies for data points of interest (e.g. severity of pain; hypoglycemic events) – Example use case: aspirin as a proxy for antiphospholipid antibodies lab1 • Proxy data elements should be supported by data 1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 2011; 365:1758-1759N
  • 7. Example: Proxy for Death Status • Data extracted from large multi-specialty clinic on the east coast • 300,000 patients in EHR • ~10,000 with date-of-death (we’ll take this as gold-standard) • Is days since last encounter a good proxy?
  • 8. Example: Proxy for Death Status library(glm2) # import data setwd([dir here]) Encs = read.csv("lastenc.csv", header= FALSE) # find days since last encounter for (i in 1:nrow(Encs)) { Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y") } # binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750) for (i in 1:nrow(Encs)) { Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0) } # clean up table Encs = Encs[ , c(2, 4)] # fit model (logistic regression – but could use something else) fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial") confusionMatrix = table(round(fit$fitted.values), Encs[,1]) misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) # 0.34
  • 9. Example: Proxy for Death Status • Is days since last encounter a good proxy? No (error rate = 34%) • Consequences:
  • 10. Data Extraction: Planning • Cohort definition – Spell out cohort definitions explicitly, including all assumptions – Real-world example: • ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90 days apart’ • Further restriction specified ‘if any value > 60 in between 90 days, then throw out’ • Word ‘consecutive’ means no values in between 90 days will be considered at all – If any another eGFR value occurs between 90 days, then the patient does not meet the first restriction
  • 11. Data Extraction: Planning • Final thought on planning: “Not everything that counts can be counted and not everything that can be counted counts.” —Albert Einstein (or William Bruce Cameron, depends who you believe) • some data elements are well populated, but reflect things like coding bias (e.g. ‘up-coding’ to a code with larger reimbursement)
  • 12. Data Extraction • What are data extractions being used for in the NRN? – Pharmaceutical companies: data on 143,057 patients from 8 health-care organizations/health care systems – Federally-funded research (NIH, AHRQ): data on ~100,000 patients – Health IT vendors: work with Cerner to produce performance reports for use by participating providers • Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch to another vendor)
  • 13. Data Extraction • Longitudinal data important – look at temporal trends over time in same patient – during EHR transitions, some EHR vendors will import all data, but restrict full access to only last 18/24/26 months – clinicians don’t like this, they want to be able to access all data
  • 14. Data Validation • Date parameters (e.g. look at min and max dates of encounter in dataset, when 1000s of patients of dataset, would expect to see dates match with range) – Percentage of distinct patients in extraction vs. overall practice count: cohort percentages are quite stable across practices Âť e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes defined by ICD-9 code xx.xxx – Caveat: doesn’t work well with small practices (< 2,000 distinct patients)
  • 15. Data Standardization • Open-source models (Observational Medical Outcomes Partnership) • Script data out of database (e.g. SQL view) • Map labs/procedures to standardized concept list – Why? different string labels referring to creatinine blood test from three data feeds, with frequency of occurrence…
  • 16. Note: source values with counts < 100 were censored
  • 17. Data Transfer • HIPPA requirements • Usually FTP to secure site (e.g Egnyte) Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
  • 18. Concluding Thoughts • Extracted data is treated as gold-standard, since it is pulled directly from data source (i.e. EHR), but data often comes from intermediate product (such as a registry product, like the product DARTNet provides); but usually don’t have control over data mapping from EHR to registry • The EHR of the future (?): – Genetic data (WGS or WES) Âť WGS = ~100 GB Âť WES = ~8 GB – Integration with consumer wearable devices (e.g. FitBit; iPhone ECG) – Further down the road: human microbiome; home microbiome
  • 19. Always question the data Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
  • 20. Questions? • Slides available from slideshare (URL @craigsmail) • Email: csmail@aafp.org

Hinweis der Redaktion

  1. Repulls wastes everyone’s time
  2. used aspirin as a proxy for antiphospholipid antibodies lab (due to practice of prescribing aspirin in these patients at site) in treating a 13 year-old girl with systemic lupus erythematosus (SLE)
  3. Audience participation: ask what other factors might explain a gap in encounters (e.g. moved out-of-town, changed provider)
  4. Only 13 lines of code Binarized time since last encounter (tried 180, 365, 750)
  5. CKD study: ~100,000 in dataset, say same ratio holds (3% of individuals in EHR are dead), gives 3,000 names for NDI Cost: $350 + ($0.15 * 3,000 * 10) = $4,500 So you want to make sure the cohort you send to NCI is right!
  6. What are data extractions being used for in the NRN? Pharmaceutical companies: type-2 diabetes study looking at drug prescribing habits of primary-care physicians for patients with type-2 diabetes (data on 143,057 patients from 8 health-care organizations/health care systems) Federally-funded research (NIH, AHRQ ): decision support for chronic kidney disease, working with National Kidney Foundation (data on ~100,000 patients) Health IT vendors: we work with Cerner to product performance reports for use by participating providers, used to compare performance on several metrics (e.g. blood pressure targets; accuracy of ICD-9 coding) Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch)
  7. Overall take-away