Big Data at Geisinger Health System: Big Wins in a Short Time

Geisinger Health System:
Mark Mossel, Director of Data Team Operations
Dhruv Mathrawala, Senior Data Architect

Integrated health services organization
Innovative care delivery models
Serves >3 million residents in 45 counties
>30,000 employees
>1,500 employed physicians
12 hospital campuses
551,000 member health plan

A good first-start.
 Data assembled in a central location
 Allowed for self-service
 Could link disparate data
Health
Record
Data
Warehouse
Surveys
Cardiology
Oncology
Financials
Codesets
External
Data
Claims

“There are too
many
undocumented
data sources.”
“There is no
documented
understanding
of business
requirements for
CDIS business
analytics.”
“We don’t have
the
transformations
that the
business users
really need.”
“Cannot
provide data
that is fit for
purpose.” “Data dictionary
does not exist
today.”
“Can’t
“match” from
encounters to
bills to claim.”
“Much of my
group’s time
is spent
entering data
manually”
“The platform/
architecture in
place for CDIS
analytics is not
correct for the
types of work
being
performed.”
“Clinical data
quality
problems
related to
patient safety
exist.”
“Hierarchies
exist at many
levels.”
“The level of
detail that I
need is not
there in the
data.” “There are too
many pockets
of data.”
“The CDIS “lift
and shift” model
perpetuates the
problem with
too many
views/analytics”

• If Data isn’t accurate, it is worse than nothing.
• Incomplete isn’t useful.
• Data that isn’t timely is less than desirable.
• When multiple versions of data exist, relying
on the wrong value can lead to bad decisions.
•There must be ONE source of truth for data
•Data without documentation is of
questionable value
Often, the first exposure of new
data highlights data quality issues.

A unified data architecture (UDA) is a more comprehensive view of the overall enterprise
architecture; a collection of services, platforms, applications, and tools that help customers
define and deploy an architecture that makes the best use of available technologies to
unleash the optimal value of data. TDWI: Jun 6, 2013
The UDA at Geisinger Health System is the integration of key analytic platforms (e.g.,
Hadoop, EDW EHR, etc.) with a common semantic layer, and all performing under the
umbrella of the same Data Governance structure.

• Less expensive due to commodity hardware
• It could be as little as 10% of the cost of our traditional EDW.
• Faster ingestion of data
• Because of early binding, any mapping, modeling, etc. is typically done
upfront in traditional data warehousing. Late binding of Hadoop allows for
the data to simply be loaded without detailed analysis and preparation.
• Multiple views of the data
• Our multi-zoned Hadoop system allows for many views of the data, including
temporal, modeled, etc.
• Unstructured and semi-structured data
• Hadoop is not confined to structured data in discreet fields, as is the case
with traditional analytic platforms.

THE V’S OF BIG DATA
Controlling Data Volume, Velocity, and Variety

VolumeScale of data
600
TB
184clinical notes M
9,000Epic clarity tables
>136,000
patient-participants
for exome sequencing

VelocitySpeed of ingestion
late
DAYS
VERSUS
MONTHS
real-time
capabilities
<2
second
to search
all clinical
notes
b i d i n g

VarietyDifferent forms and views
non-
traditional
sources
home
devices
KeyHIE
social
media
patient
apps
Device
integration
genomics
struct
multi-
zoned
Lawson

VeracityUncertainty of data
Encryption
at rest
PHI
m a s ke d
Appropriate
Authentication,
Authorization,
And
Access
single
source of
TRUTH

ValueCost and resources
$20,000
vs$500K
10TB
opensource
commodity
hardware
NLP
can use

• ROI: use open-source, commodity hardware argument
• Change: SQL team are unfamiliar with Big Data ecosystem
• Data Load: Load EVERYTHING into Hadoop by building prototypes,
not use cases
• Self-service: Push for self-serve as much as possible,
• Adoption: Develop valuable early wins, invest in visualization (e.g.
Tableau)
• Data Zones: Create separate data zones, split PHI from non-PHI data
• Surge capacity: Pop-off to cloud-based options at surge capacity
needs

PRODUCTION FOOTPRINT
CDIS
Teradata production server
– Version 14.10
– ~13TB uncompressed
– ~30TB compressed
Hadoop
Production cluster
– Hortonworks Data Platform
v2.6
– 30 nodes
– 600TB total
– 200TB usable (3 copies)

MAJOR DATA SOURCES
Traditional EDW
• Health Record (clinical) data
• Financial
• Claims
• Pulmonary
• Pathology
• Oncology
Hadoop
• All EDW sources, plus:
• Lawson
– Fin, supply chain, A/P
• RIS (Radiology)
• Microbiology
• KeyHIE (Health Info Exchange)
• Lab System Data
• Phone Systems
• Lumedx (Cardiology)

LLAP STATISTICS
Configuration
• Running on 10 nodes
• Using 40% of the cluster
• 100GB Cache availability
Teradata vs LLAP
• Query under 1 minute : 80% queries
performed better than Teradata
• Query over 1 minute : 95% queries
performed better than Teradata

Epic
Cache
Epic Clarity
Hadoop
.ext files (ETL
files feeding
the clinical
reporting
database)
EDW
Primary Clinical
dataset containing
patient health records
Clinical reporting DB
Traditional Ent.
Data Warehouse
New Big Data Platform
Results in data
available hours
before the
traditional EDW

• More tables loaded nightly
• ~1100 in Teradata
• ~7200 in Hadoop
• Incremental EXT’s (~3,500 EXT files/night)
• Automated Epic loading process using Map Reduce
and Java

Landing
Zone
Raw Zone
Refined
Zone
Current
Zone
Integrated
Zone
• Source
system
pushes to
landing zone
• Stored
separately by
source
system
• Securely
transferred
• Auditing,
traceability,
compliance
and lineage
• New source
data is
appended,
not deleted
• Partitioned by
load date
• Compressed
• Data still
temporal
• Data types
match source
• Partitioned by
load date
• Organized by
business
attributes and
load date
• Current
snapshot
(temporal
history is
merged to
give the
latest view)
• Purpose-built
datasets for
quicker analytics
• Patient/member
uniquely
identified across
systems

• Encryption at rest for Hadoop data
• Authentication/Authorization
• LDAPS and AD Integration using Ranger/Knox
• Connections
• SSL endpoint encryption active for all network connections
• ODBC – SSL Secured
• JDBC – SSL Secured
• Data
• Appropriate access and roles as required. These roles will continue to be
defined by the Data Manger or his designate.
• All PHI data will be masked in the Development environment
• Kerberos Authentication: To thwart impersonation threats

• Bundled Payments Care Initiative
• Data Model
• De-identification of PHI/BSI
• Natural Language Processing
• Sepsis
• O.R. Workflows
• Bactec
• Social Security Death File
• Supply Chain
• Registries
• MPOG, AAA, Ortho Infection, Ortho Trauma

• Lung Nodules
• Abdominal Aortic Aneurysms
• RetrospectOR
• Check Please

• Problem
• Patients with lung nodules found on imaging are lost to follow-up
• Solution
• Ingestion of data from radiology imaging notes
• NLP
• Value
• Identify lung nodules

NLP and Dictionary annotator
Annotates with UMLS concept codes
Lung nodule Filter annotator
Identifies lung nodule notes
~ 10 million notes
Negation Annotator
Measurement/Lung RADS Calculator
~ 9.7 million notes
NO
YES
~ 300 thousand notes
. . .
Lung nodule
in note?
Radiology notes
LUNG NODULES – TEXT ANALYTICS WORKFLOW

28
Actual
Yes No
Predicted
Yes True Positive False Positive
No False Negative True Negative
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• F1 Score = 2 * (Precision * Recall)
/ (Precision + Recall)
• Accuracy = TP + TN / (TP + TN +
FP + FN)

0.87precision
0.95recall
0.91accuracy
LUNG NODULES

• Problem
• Patients with AAA are lost to follow up
• Solution
• Ingestion of data from radiology imaging notes
• Use NLP and care-gap closure technologies
• Value
• Ensure proper follow up

502 patients identified
23 required urgent surgery

• Use case
• Provide capabilities to perform retrospective analysis of OR data
• Solution
• Ingest key data elements and metrics into a data model on Hadoop
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Improve OR utilization and quality of care using learnings from retrospective
analysis

• Scheduled vs Actual Analysis
• OR Staff Summary Information
• Various filters to slice and dice
the data in different ways
• Next day data availability

• Use case
• Understand the supply costs associated with OR procedures and variance by
provider/service/location
• Solution
• Ingest key data elements from EMR, Billing and Supply Chain systems
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Identify areas of greatest potential variance/opportunity to manage costs
• Opportunities for Isolation of data issues, best practices across platforms,
supply chain cost optimization and process improvement

• Compare supply cost for multiple
providers for same procedure
• Cost band indicates +/- 1 standard
deviation
• Compare cost for same procedure
by surgical role

• Heatmap of cost variance across all
service lines
• Heatmap of cost variance by
service lines
• Can be filtered by lead procedures
per case
• Drill down capability to show
implants/explants and supply cost
per procedure and per case

Big Data at Geisinger Health System: Big Wins in a Short Time

Big Data at Geisinger Health System: Big Wins in a Short Time

Recommended

Recommended

More Related Content

Similar to Big Data at Geisinger Health System: Big Wins in a Short Time

Similar to Big Data at Geisinger Health System: Big Wins in a Short Time (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Big Data at Geisinger Health System: Big Wins in a Short Time

Editor's Notes