SlideShare a Scribd company logo
1 of 37
Geisinger Health System:
Mark Mossel, Director of Data Team Operations
Dhruv Mathrawala, Senior Data Architect
Integrated health services organization
Innovative care delivery models
Serves >3 million residents in 45 counties
>30,000 employees
>1,500 employed physicians
12 hospital campuses
551,000 member health plan
A good first-start.
 Data assembled in a central location
 Allowed for self-service
 Could link disparate data
Health
Record
Data
Warehouse
Surveys
Cardiology
Oncology
Financials
Codesets
External
Data
Claims
“There are too
many
undocumented
data sources.”
“There is no
documented
understanding
of business
requirements for
CDIS business
analytics.”
“We don’t have
the
transformations
that the
business users
really need.”
“Cannot
provide data
that is fit for
purpose.” “Data dictionary
does not exist
today.”
“Can’t
“match” from
encounters to
bills to claim.”
“Much of my
group’s time
is spent
entering data
manually”
“The platform/
architecture in
place for CDIS
analytics is not
correct for the
types of work
being
performed.”
“Clinical data
quality
problems
related to
patient safety
exist.”
“Hierarchies
exist at many
levels.”
“The level of
detail that I
need is not
there in the
data.” “There are too
many pockets
of data.”
“The CDIS “lift
and shift” model
perpetuates the
problem with
too many
views/analytics”
• If Data isn’t accurate, it is worse than nothing.
• Incomplete isn’t useful.
• Data that isn’t timely is less than desirable.
• When multiple versions of data exist, relying
on the wrong value can lead to bad decisions.
•There must be ONE source of truth for data
•Data without documentation is of
questionable value
Often, the first exposure of new
data highlights data quality issues.
A unified data architecture (UDA) is a more comprehensive view of the overall enterprise
architecture; a collection of services, platforms, applications, and tools that help customers
define and deploy an architecture that makes the best use of available technologies to
unleash the optimal value of data. TDWI: Jun 6, 2013
The UDA at Geisinger Health System is the integration of key analytic platforms (e.g.,
Hadoop, EDW EHR, etc.) with a common semantic layer, and all performing under the
umbrella of the same Data Governance structure.
• Less expensive due to commodity hardware
• It could be as little as 10% of the cost of our traditional EDW.
• Faster ingestion of data
• Because of early binding, any mapping, modeling, etc. is typically done
upfront in traditional data warehousing. Late binding of Hadoop allows for
the data to simply be loaded without detailed analysis and preparation.
• Multiple views of the data
• Our multi-zoned Hadoop system allows for many views of the data, including
temporal, modeled, etc.
• Unstructured and semi-structured data
• Hadoop is not confined to structured data in discreet fields, as is the case
with traditional analytic platforms.
THE V’S OF BIG DATA
Controlling Data Volume, Velocity, and Variety
VolumeScale of data
600
TB
184clinical notes M
9,000Epic clarity tables
>136,000
patient-participants
for exome sequencing
VelocitySpeed of ingestion
late
DAYS
VERSUS
MONTHS
real-time
capabilities
<2
second
to search
all clinical
notes
b i d i n g
VarietyDifferent forms and views
non-
traditional
sources
home
devices
KeyHIE
social
media
patient
apps
Device
integration
genomics
struct
multi-
zoned
Lawson
VeracityUncertainty of data
Encryption
at rest
PHI
m a s ke d
Appropriate
Authentication,
Authorization,
And
Access
single
source of
TRUTH
ValueCost and resources
$20,000
vs$500K
10TB
opensource
commodity
hardware
NLP
can use
• ROI: use open-source, commodity hardware argument
• Change: SQL team are unfamiliar with Big Data ecosystem
• Data Load: Load EVERYTHING into Hadoop by building prototypes,
not use cases
• Self-service: Push for self-serve as much as possible,
• Adoption: Develop valuable early wins, invest in visualization (e.g.
Tableau)
• Data Zones: Create separate data zones, split PHI from non-PHI data
• Surge capacity: Pop-off to cloud-based options at surge capacity
needs
PRODUCTION FOOTPRINT
CDIS
Teradata production server
– Version 14.10
– ~13TB uncompressed
– ~30TB compressed
Hadoop
Production cluster
– Hortonworks Data Platform
v2.6
– 30 nodes
– 600TB total
– 200TB usable (3 copies)
MAJOR DATA SOURCES
Traditional EDW
• Health Record (clinical) data
• Financial
• Claims
• Pulmonary
• Pathology
• Oncology
Hadoop
• All EDW sources, plus:
• Lawson
– Fin, supply chain, A/P
• RIS (Radiology)
• Microbiology
• KeyHIE (Health Info Exchange)
• Lab System Data
• Phone Systems
• Lumedx (Cardiology)
LLAP STATISTICS
Configuration
• Running on 10 nodes
• Using 40% of the cluster
• 100GB Cache availability
Teradata vs LLAP
• Query under 1 minute : 80% queries
performed better than Teradata
• Query over 1 minute : 95% queries
performed better than Teradata
Epic
Cache
Epic Clarity
Hadoop
.ext files (ETL
files feeding
the clinical
reporting
database)
EDW
Primary Clinical
dataset containing
patient health records
Clinical reporting DB
Traditional Ent.
Data Warehouse
New Big Data Platform
Results in data
available hours
before the
traditional EDW
• More tables loaded nightly
• ~1100 in Teradata
• ~7200 in Hadoop
• Incremental EXT’s (~3,500 EXT files/night)
• Automated Epic loading process using Map Reduce
and Java
Landing
Zone
Raw Zone
Refined
Zone
Current
Zone
Integrated
Zone
• Source
system
pushes to
landing zone
• Stored
separately by
source
system
• Securely
transferred
• Auditing,
traceability,
compliance
and lineage
• New source
data is
appended,
not deleted
• Partitioned by
load date
• Compressed
• Data still
temporal
• Data types
match source
• Partitioned by
load date
• Organized by
business
attributes and
load date
• Current
snapshot
(temporal
history is
merged to
give the
latest view)
• Purpose-built
datasets for
quicker analytics
• Patient/member
uniquely
identified across
systems
• Encryption at rest for Hadoop data
• Authentication/Authorization
• LDAPS and AD Integration using Ranger/Knox
• Connections
• SSL endpoint encryption active for all network connections
• ODBC – SSL Secured
• JDBC – SSL Secured
• Data
• Appropriate access and roles as required. These roles will continue to be
defined by the Data Manger or his designate.
• All PHI data will be masked in the Development environment
• Kerberos Authentication: To thwart impersonation threats
• Bundled Payments Care Initiative
• Data Model
• De-identification of PHI/BSI
• Natural Language Processing
• Sepsis
• O.R. Workflows
• Bactec
• Social Security Death File
• Supply Chain
• Registries
• MPOG, AAA, Ortho Infection, Ortho Trauma
• Lung Nodules
• Abdominal Aortic Aneurysms
• RetrospectOR
• Check Please
• Problem
• Patients with lung nodules found on imaging are lost to follow-up
• Solution
• Ingestion of data from radiology imaging notes
• NLP
• Value
• Identify lung nodules
NLP and Dictionary annotator
Annotates with UMLS concept codes
Lung nodule Filter annotator
Identifies lung nodule notes
~ 10 million notes
Negation Annotator
Measurement/Lung RADS Calculator
~ 9.7 million notes
NO
YES
~ 300 thousand notes
. . .
Lung nodule
in note?
Radiology notes
LUNG NODULES – TEXT ANALYTICS WORKFLOW
28
Actual
Yes No
Predicted
Yes True Positive False Positive
No False Negative True Negative
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• F1 Score = 2 * (Precision * Recall)
/ (Precision + Recall)
• Accuracy = TP + TN / (TP + TN +
FP + FN)
0.87precision
0.95recall
0.91accuracy
LUNG NODULES
• Problem
• Patients with AAA are lost to follow up
• Solution
• Ingestion of data from radiology imaging notes
• Use NLP and care-gap closure technologies
• Value
• Ensure proper follow up
502 patients identified
23 required urgent surgery
• Use case
• Provide capabilities to perform retrospective analysis of OR data
• Solution
• Ingest key data elements and metrics into a data model on Hadoop
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Improve OR utilization and quality of care using learnings from retrospective
analysis
• Scheduled vs Actual Analysis
• OR Staff Summary Information
• Various filters to slice and dice
the data in different ways
• Next day data availability
• Use case
• Understand the supply costs associated with OR procedures and variance by
provider/service/location
• Solution
• Ingest key data elements from EMR, Billing and Supply Chain systems
• Provide advanced visualization and drill down capabilities using Tableau
• Value
• Identify areas of greatest potential variance/opportunity to manage costs
• Opportunities for Isolation of data issues, best practices across platforms,
supply chain cost optimization and process improvement
• Compare supply cost for multiple
providers for same procedure
• Cost band indicates +/- 1 standard
deviation
• Compare cost for same procedure
by surgical role
• Heatmap of cost variance across all
service lines
• Heatmap of cost variance by
service lines
• Can be filtered by lead procedures
per case
• Drill down capability to show
implants/explants and supply cost
per procedure and per case
Big Data at Geisinger Health System: Big Wins in a Short Time

More Related Content

Similar to Big Data at Geisinger Health System: Big Wins in a Short Time

Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
Denodo
 
Tackle healthcare interoperability challenges and improve transitions of care v3
Tackle healthcare interoperability challenges and improve transitions of care v3Tackle healthcare interoperability challenges and improve transitions of care v3
Tackle healthcare interoperability challenges and improve transitions of care v3
Perficient, Inc.
 
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
CTSI at UCSF
 
Clinicaldatamanagementindiaasahub 130313225150-phpapp01
Clinicaldatamanagementindiaasahub 130313225150-phpapp01Clinicaldatamanagementindiaasahub 130313225150-phpapp01
Clinicaldatamanagementindiaasahub 130313225150-phpapp01
Upendra Agarwal
 

Similar to Big Data at Geisinger Health System: Big Wins in a Short Time (20)

Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer Research
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer Research
 
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
 
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsEnabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
 
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsEnabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
 
Medical Intelligence EDW 20 juni: Radboudumc
Medical Intelligence EDW 20 juni: RadboudumcMedical Intelligence EDW 20 juni: Radboudumc
Medical Intelligence EDW 20 juni: Radboudumc
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
Big Data in Pediatric Critical Care by Mohit Mehra
Big Data in Pediatric Critical Care by Mohit MehraBig Data in Pediatric Critical Care by Mohit Mehra
Big Data in Pediatric Critical Care by Mohit Mehra
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
 
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
Data Virtualization at UMC Utrecht: Don't Collect, Connect! by Erik Fransen (...
 
Tackle healthcare interoperability challenges and improve transitions of care v3
Tackle healthcare interoperability challenges and improve transitions of care v3Tackle healthcare interoperability challenges and improve transitions of care v3
Tackle healthcare interoperability challenges and improve transitions of care v3
 
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
UCSF Informatics Day 2014 - David Dobbs, "Enterprise Data Warehouse"
 
Clinicaldatamanagementindiaasahub 130313225150-phpapp01
Clinicaldatamanagementindiaasahub 130313225150-phpapp01Clinicaldatamanagementindiaasahub 130313225150-phpapp01
Clinicaldatamanagementindiaasahub 130313225150-phpapp01
 
D1 1440 cesar wong next generation sequencing &amp; bio medical data analysis
D1 1440 cesar wong next generation sequencing &amp; bio medical data analysisD1 1440 cesar wong next generation sequencing &amp; bio medical data analysis
D1 1440 cesar wong next generation sequencing &amp; bio medical data analysis
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
1 PSUT Big Data Class, introduction
1 PSUT Big Data Class,  introduction1 PSUT Big Data Class,  introduction
1 PSUT Big Data Class, introduction
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Big Data at Geisinger Health System: Big Wins in a Short Time

  • 1. Geisinger Health System: Mark Mossel, Director of Data Team Operations Dhruv Mathrawala, Senior Data Architect
  • 2. Integrated health services organization Innovative care delivery models Serves >3 million residents in 45 counties >30,000 employees >1,500 employed physicians 12 hospital campuses 551,000 member health plan
  • 3.
  • 4. A good first-start.  Data assembled in a central location  Allowed for self-service  Could link disparate data Health Record Data Warehouse Surveys Cardiology Oncology Financials Codesets External Data Claims
  • 5. “There are too many undocumented data sources.” “There is no documented understanding of business requirements for CDIS business analytics.” “We don’t have the transformations that the business users really need.” “Cannot provide data that is fit for purpose.” “Data dictionary does not exist today.” “Can’t “match” from encounters to bills to claim.” “Much of my group’s time is spent entering data manually” “The platform/ architecture in place for CDIS analytics is not correct for the types of work being performed.” “Clinical data quality problems related to patient safety exist.” “Hierarchies exist at many levels.” “The level of detail that I need is not there in the data.” “There are too many pockets of data.” “The CDIS “lift and shift” model perpetuates the problem with too many views/analytics”
  • 6. • If Data isn’t accurate, it is worse than nothing. • Incomplete isn’t useful. • Data that isn’t timely is less than desirable. • When multiple versions of data exist, relying on the wrong value can lead to bad decisions. •There must be ONE source of truth for data •Data without documentation is of questionable value Often, the first exposure of new data highlights data quality issues.
  • 7. A unified data architecture (UDA) is a more comprehensive view of the overall enterprise architecture; a collection of services, platforms, applications, and tools that help customers define and deploy an architecture that makes the best use of available technologies to unleash the optimal value of data. TDWI: Jun 6, 2013 The UDA at Geisinger Health System is the integration of key analytic platforms (e.g., Hadoop, EDW EHR, etc.) with a common semantic layer, and all performing under the umbrella of the same Data Governance structure.
  • 8. • Less expensive due to commodity hardware • It could be as little as 10% of the cost of our traditional EDW. • Faster ingestion of data • Because of early binding, any mapping, modeling, etc. is typically done upfront in traditional data warehousing. Late binding of Hadoop allows for the data to simply be loaded without detailed analysis and preparation. • Multiple views of the data • Our multi-zoned Hadoop system allows for many views of the data, including temporal, modeled, etc. • Unstructured and semi-structured data • Hadoop is not confined to structured data in discreet fields, as is the case with traditional analytic platforms.
  • 9.
  • 10. THE V’S OF BIG DATA Controlling Data Volume, Velocity, and Variety
  • 11. VolumeScale of data 600 TB 184clinical notes M 9,000Epic clarity tables >136,000 patient-participants for exome sequencing
  • 13. VarietyDifferent forms and views non- traditional sources home devices KeyHIE social media patient apps Device integration genomics struct multi- zoned Lawson
  • 14. VeracityUncertainty of data Encryption at rest PHI m a s ke d Appropriate Authentication, Authorization, And Access single source of TRUTH
  • 16. • ROI: use open-source, commodity hardware argument • Change: SQL team are unfamiliar with Big Data ecosystem • Data Load: Load EVERYTHING into Hadoop by building prototypes, not use cases • Self-service: Push for self-serve as much as possible, • Adoption: Develop valuable early wins, invest in visualization (e.g. Tableau) • Data Zones: Create separate data zones, split PHI from non-PHI data • Surge capacity: Pop-off to cloud-based options at surge capacity needs
  • 17. PRODUCTION FOOTPRINT CDIS Teradata production server – Version 14.10 – ~13TB uncompressed – ~30TB compressed Hadoop Production cluster – Hortonworks Data Platform v2.6 – 30 nodes – 600TB total – 200TB usable (3 copies)
  • 18. MAJOR DATA SOURCES Traditional EDW • Health Record (clinical) data • Financial • Claims • Pulmonary • Pathology • Oncology Hadoop • All EDW sources, plus: • Lawson – Fin, supply chain, A/P • RIS (Radiology) • Microbiology • KeyHIE (Health Info Exchange) • Lab System Data • Phone Systems • Lumedx (Cardiology)
  • 19. LLAP STATISTICS Configuration • Running on 10 nodes • Using 40% of the cluster • 100GB Cache availability Teradata vs LLAP • Query under 1 minute : 80% queries performed better than Teradata • Query over 1 minute : 95% queries performed better than Teradata
  • 20. Epic Cache Epic Clarity Hadoop .ext files (ETL files feeding the clinical reporting database) EDW Primary Clinical dataset containing patient health records Clinical reporting DB Traditional Ent. Data Warehouse New Big Data Platform Results in data available hours before the traditional EDW
  • 21. • More tables loaded nightly • ~1100 in Teradata • ~7200 in Hadoop • Incremental EXT’s (~3,500 EXT files/night) • Automated Epic loading process using Map Reduce and Java
  • 22. Landing Zone Raw Zone Refined Zone Current Zone Integrated Zone • Source system pushes to landing zone • Stored separately by source system • Securely transferred • Auditing, traceability, compliance and lineage • New source data is appended, not deleted • Partitioned by load date • Compressed • Data still temporal • Data types match source • Partitioned by load date • Organized by business attributes and load date • Current snapshot (temporal history is merged to give the latest view) • Purpose-built datasets for quicker analytics • Patient/member uniquely identified across systems
  • 23. • Encryption at rest for Hadoop data • Authentication/Authorization • LDAPS and AD Integration using Ranger/Knox • Connections • SSL endpoint encryption active for all network connections • ODBC – SSL Secured • JDBC – SSL Secured • Data • Appropriate access and roles as required. These roles will continue to be defined by the Data Manger or his designate. • All PHI data will be masked in the Development environment • Kerberos Authentication: To thwart impersonation threats
  • 24. • Bundled Payments Care Initiative • Data Model • De-identification of PHI/BSI • Natural Language Processing • Sepsis • O.R. Workflows • Bactec • Social Security Death File • Supply Chain • Registries • MPOG, AAA, Ortho Infection, Ortho Trauma
  • 25. • Lung Nodules • Abdominal Aortic Aneurysms • RetrospectOR • Check Please
  • 26. • Problem • Patients with lung nodules found on imaging are lost to follow-up • Solution • Ingestion of data from radiology imaging notes • NLP • Value • Identify lung nodules
  • 27. NLP and Dictionary annotator Annotates with UMLS concept codes Lung nodule Filter annotator Identifies lung nodule notes ~ 10 million notes Negation Annotator Measurement/Lung RADS Calculator ~ 9.7 million notes NO YES ~ 300 thousand notes . . . Lung nodule in note? Radiology notes LUNG NODULES – TEXT ANALYTICS WORKFLOW
  • 28. 28 Actual Yes No Predicted Yes True Positive False Positive No False Negative True Negative • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • F1 Score = 2 * (Precision * Recall) / (Precision + Recall) • Accuracy = TP + TN / (TP + TN + FP + FN)
  • 30. • Problem • Patients with AAA are lost to follow up • Solution • Ingestion of data from radiology imaging notes • Use NLP and care-gap closure technologies • Value • Ensure proper follow up
  • 31. 502 patients identified 23 required urgent surgery
  • 32. • Use case • Provide capabilities to perform retrospective analysis of OR data • Solution • Ingest key data elements and metrics into a data model on Hadoop • Provide advanced visualization and drill down capabilities using Tableau • Value • Improve OR utilization and quality of care using learnings from retrospective analysis
  • 33. • Scheduled vs Actual Analysis • OR Staff Summary Information • Various filters to slice and dice the data in different ways • Next day data availability
  • 34. • Use case • Understand the supply costs associated with OR procedures and variance by provider/service/location • Solution • Ingest key data elements from EMR, Billing and Supply Chain systems • Provide advanced visualization and drill down capabilities using Tableau • Value • Identify areas of greatest potential variance/opportunity to manage costs • Opportunities for Isolation of data issues, best practices across platforms, supply chain cost optimization and process improvement
  • 35. • Compare supply cost for multiple providers for same procedure • Cost band indicates +/- 1 standard deviation • Compare cost for same procedure by surgical role
  • 36. • Heatmap of cost variance across all service lines • Heatmap of cost variance by service lines • Can be filtered by lead procedures per case • Drill down capability to show implants/explants and supply cost per procedure and per case

Editor's Notes

  1. Brief introduction about Geisinger
  2. EHR in mid-90s. By 2006, leadership wanted EDW. CDIS (clin dec intel syst) live in 2008. Big win early. Few Healthcare orgs had this integration platform at this time. Internally, depts. (research) no longer had to request extracts from Epic for analytics. One platform of data (clin, fin, claims) for analytics, to transform the delivery of care. It has gone through a number of iterations, and currently supports much of the analytics running our day-to-day operations. Over 2100 users. 2012, switched to TD (higher performance). 2016, UDA. Integrate all key analytics platforms (Hadoop, Cerner, Epic EDW)
  3. Next phase of our analytics platform: Hadoop (Big Data)
  4. Late binding of Hadoop allows for the data to simply be loaded without detailed analysis and preparation up-front.
  5. Our multi-zoned Hadoop system allows for many views of the data, including temporal, modeled, etc. Hadoop is not confined to structured data in discreet fields, as is the case with traditional analytic platforms.
  6. LDAP and AD Integration using Ranger/Knox Encryption at rest SSL endpoint encryption active for all network connections Kerberos Authentication: To thwart impersonation threats Appropriate access and roles as required. These roles will continue to be defined by the Data Manger or his designate All PHI data will be masked in the Development environment
  7. Less costly hardware for storing increasing data (structured and unstructured) 5 million to purchase new Terradata hardware Prevent “one-off” data systems (e.g. IoT data capture, ICU real-time data capture, Cybersecurity)
  8. Lung nodules are commonly identified in free text within radiology reports and can easily be lost to follow up with potential for delayed cancer diagnosis. A treasure trove of useful, relevant, and unstructured clinical information in the form of text blobs and semi-templated data is locked inside EHRs. We used Solr, a module part of the Apache Hadoop ecosystem, to expose the data and let users perform rapid search. The ability to sort through over 184M clinical notes across 20-years worth of in/outpatient records Serves a framework to run CTAKES and other Natural Language Processing programs to find signal in the text noise, and make the data actionable.
  9. UMLS: Unified Medical Language System Negations Nearly 30 % of identified lung nodule notes are negative results. NLP engine constructs grammar tree and associates negation words with the identified lung nodule text Calculate Lung RADS scores based on nodule size and description Future tasks Measure accuracy of predicted Lund RADS scores and improve performace