1. Working With Large-Scale
Clinical Datasets
Craig Smail, MA, MSc ( @craigsmail)
KU Medical Center
9th October 2014
Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
3. Overview
⢠Targeted audience: anyone involved (directly
or indirectly) in clinical data extraction,
validation, and standardization
⢠Sections:
1. Data extraction: planning
2. Data extraction
3. Data standardization
4. Data transfer
4. Data Extraction: Planning
⢠Dataset type
â Most common: limited and de-identified
â Difference: limited can contain some personal
information (DOB, DOD, city, state, age)
⢠Legal agreements
â Data Use Agreement (DUA)
â Business Associates Agreement (BAA)
â Institutional Review Board (IRB)
⢠Usually only if IRB considers activity Human Subjects
Research
5. Data Extraction: Planning
⢠Important to finalize list of data elements
before pull
â Time-consuming to repull
â Reallocation of resources (e.g. programmer time)
⢠Summary statistics are helpful in planning
stage
⢠e.g. death status requested a lot, but is very rarely
available in the EHR
6. Data Extraction: Planning
⢠Use of data proxy correlated with data
element of interest
â sometimes need to develop proxies for data
points of interest (e.g. severity of pain;
hypoglycemic events)
â Example use case: aspirin as a proxy for
antiphospholipid antibodies lab1
⢠Proxy data elements should be supported by
data
1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med
2011; 365:1758-1759N
7. Example: Proxy for Death Status
⢠Data extracted from large multi-specialty
clinic on the east coast
⢠300,000 patients in EHR
⢠~10,000 with date-of-death (weâll take this as
gold-standard)
⢠Is days since last encounter a good proxy?
8. Example: Proxy for Death Status
library(glm2)
# import data
setwd([dir here])
Encs = read.csv("lastenc.csv", header= FALSE)
# find days since last encounter
for (i in 1:nrow(Encs)) {
Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y")
}
# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 â also tried 180, 265, 750)
for (i in 1:nrow(Encs)) {
Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0)
}
# clean up table
Encs = Encs[ , c(2, 4)]
# fit model (logistic regression â but could use something else)
fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial")
confusionMatrix = table(round(fit$fitted.values), Encs[,1])
misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) #
0.34
9. Example: Proxy for Death Status
⢠Is days since last encounter a good proxy?
No (error rate = 34%)
⢠Consequences:
10. Data Extraction: Planning
⢠Cohort definition
â Spell out cohort definitions explicitly, including all
assumptions
â Real-world example:
⢠âTwo consecutive eGFRs >= 15 and < 60 occurring at least 90
days apartâ
⢠Further restriction specified âif any value > 60 in between 90
days, then throw outâ
⢠Word âconsecutiveâ means no values in between 90 days will
be considered at all
â If any another eGFR value occurs between 90 days, then the
patient does not meet the first restriction
11. Data Extraction: Planning
⢠Final thought on planning:
âNot everything that counts can be counted and
not everything that can be counted counts.â
âAlbert Einstein (or William Bruce Cameron,
depends who you believe)
⢠some data elements are well populated, but
reflect things like coding bias (e.g. âup-codingâ
to a code with larger reimbursement)
12. Data Extraction
⢠What are data extractions being used for in the
NRN?
â Pharmaceutical companies: data on 143,057
patients from 8 health-care organizations/health
care systems
â Federally-funded research (NIH, AHRQ): data on
~100,000 patients
â Health IT vendors: work with Cerner to produce
performance reports for use by participating
providers
⢠Clinicians like performance feedback, if your EHR cannot
provide it they will go elsewhere (i.e. switch to another
vendor)
13. Data Extraction
⢠Longitudinal data important
â look at temporal trends over time in same
patient
â during EHR transitions, some EHR vendors will
import all data, but restrict full access to only
last 18/24/26 months â clinicians donât like this,
they want to be able to access all data
14. Data Validation
⢠Date parameters (e.g. look at min and max dates of encounter
in dataset, when 1000s of patients of dataset, would expect to
see dates match with range)
â Percentage of distinct patients in extraction vs. overall practice count:
cohort percentages are quite stable across practices
Âť e.g. âall patients over age 18 with a diagnosis of type-2 diabetes
defined by ICD-9 code xx.xxx
â Caveat: doesnât work well with small practices (< 2,000 distinct
patients)
15. Data Standardization
⢠Open-source models (Observational Medical Outcomes
Partnership)
⢠Script data out of database (e.g. SQL view)
⢠Map labs/procedures to standardized concept list
â Why? different string labels referring to creatinine blood test from
three data feeds, with frequency of occurrenceâŚ
17. Data Transfer
⢠HIPPA requirements
⢠Usually FTP to secure site (e.g Egnyte)
Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
18. Concluding Thoughts
⢠Extracted data is treated as gold-standard, since it is pulled
directly from data source (i.e. EHR), but data often comes
from intermediate product (such as a registry product, like the
product DARTNet provides); but usually donât have control
over data mapping from EHR to registry
⢠The EHR of the future (?):
â Genetic data (WGS or WES)
Âť WGS = ~100 GB
Âť WES = ~8 GB
â Integration with consumer wearable devices (e.g. FitBit; iPhone ECG)
â Further down the road: human microbiome; home microbiome
19. Always question
the data
Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
20. Questions?
⢠Slides available from slideshare
(URL @craigsmail)
⢠Email: csmail@aafp.org
Hinweis der Redaktion
Repulls wastes everyoneâs time
used aspirin as a proxy for antiphospholipid antibodies lab (due to practice of prescribing aspirin in these patients at site) in treating a 13 year-old girl with systemic lupus erythematosus (SLE)
Audience participation: ask what other factors might explain a gap in encounters (e.g. moved out-of-town, changed provider)
Only 13 lines of code
Binarized time since last encounter (tried 180, 365, 750)
CKD study: ~100,000 in dataset, say same ratio holds (3% of individuals in EHR are dead), gives 3,000 names for NDI
Cost: $350 + ($0.15 * 3,000 * 10) = $4,500
So you want to make sure the cohort you send to NCI is right!
What are data extractions being used for in the NRN?
Pharmaceutical companies: type-2 diabetes study looking at drug prescribing habits of primary-care physicians for patients with type-2 diabetes (data on 143,057 patients from 8 health-care organizations/health care systems)
Federally-funded research (NIH, AHRQ ): decision support for chronic kidney disease, working with National Kidney Foundation (data on ~100,000 patients)
Health IT vendors: we work with Cerner to product performance reports for use by participating providers, used to compare performance on several metrics (e.g. blood pressure targets; accuracy of ICD-9 coding)
Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch)