Starting the Hadoop Journey at a Global Leader in Cancer Research

Vamshi Punugoti & Bryan Lari
MD Anderson Cancer Center
June 2016
HDP @ MD ANDERSON
Starting the Hadoop Journey at a Global Leader
in Cancer Research

Agenda
• About MD Anderson
• Big Data Program
• Our Hadoop Implementation
• Lessons Learned
• Next Steps

• Who we are
– One of the worlds largest centers devoted exclusively to cancer care
– Created by the Texas legislature in 1941
– Named one of the nation's top two hospitals for cancer care every
year since the survey began in 1990
• Mission
– MD Anderson’s mission is to eliminate cancer in Texas, the nation and
the world through exceptional programs that integrate patient care,
research and prevention.
About MD Anderson

About MD Anderson cont.
Patient Care Education
Research

Moon Shots Program
• Launched in 2012 – to make a giant leap for patients
• Accelerating the pace of converting scientific discoveries into
clinical advances that reduce cancer deaths
• Transdisciplinary team-science approach
• Transformative professional platforms
List of Moon Shots
12 Total Moon Shots
B-cell Lymphoma Lung Cancer
Breast Cancer Melanoma
Colorectal Cancer Multiple Myeloma
Glioblastoma Ovarian Cancer
HPV-Related Cancers Pancreatic Cancer
Leukemia (CLL, MDS, AML) Prostate Cancer
http://www.cancermoonshots.org

Volume
Variety
Velocity
Veracity

Goals of Big Data Program
• Data driven organization
• All “types” of data
• “Access” for all customers
• Clinicians
• Researchers
• Administrative / Operational
• Enable discovery of “insights”
• Improve patient care
• Increase research discoveries
• Improve operations
• Govern data like an asset
• Provide a platform / environment to enable all these things

To provide the right information to the right people at the right time with the right tools
Goal
data
insight

Make big data additive and build upon foundation

What are we doing today?
• FIRE Enterprise Data Warehouse
• Natural Language Processing (NLP)
• Data Governance
• Hadoop NoSQL
• Cognitive Computing
• Data Visualization
• Evolving our Platform / Architecture
• Identifying big data use cases
• Training & Skills

• Federated Institutional Reporting Environment
• Centralized data repository supporting analytics,
decision making, and business intelligence
• Central repository for historical and operational data
• Break-down data silos
Enterprise
RepositorySource Systems
Dashboards
KPI’s
Analytic
Reports
Analytics
& Reporting
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Genomic
FIRE Program
Radiology
Labs
Epic / Clarity
Legacy Systems

• Vast amounts of unstructured data are
stored on MDACC servers.
• Conventional ETL tools are not designed
to mine unstructured data.
• Suite of tools make up the NLP Pipeline
• Dictionaries were created to help Epic
go-live (Provider Friendly Terminology)
• Other examples:
• Diagnosis from the pathology reports
• Comorbidities
• Family Cancer History
• Cytogenetics
• Obituary text
• ICD10 Coding
• Structured results feeding Moonshot TRA and OEA
• Etc.
IBM ECM
NLP
Engine
Unstructured Data
Sources
Post NLP
Database
HDWF
(FIRE)
NLP Pipeline - Overview

Enterprise
Business
Clinical Big Data
Peoplesoft
Systems of Record
Systems of Reporting
Systems of Insights
Kronos
Point of Sale
Volunteer Services
Rotary House
MyHR
UTPD
Facilities
Clinic Station
Epic
Lab
GE IDX
Cerner
CARE
EPM
Hyperion
Oracle Business Intelligence
Smart View
Web Analytics
FIRE
EIW
Business Objects
Crystal
Hyperion Interactive Reporting
Facebook
Twitter
UPS
Center for Disease Control
The Weather Channel
LinkedIN
Youtube
oracle.com
Yelp!
Reuters
Google
U.S. Census
Medical Devices
Medical Equipment
Building Controls
Campus Video
Real-time Location Service
Wayfinding
Data
Visualization
Ad Hoc
Cognitive
Computing
Big Data for Analytics & Cognitive Computing
Presentation
Cohort Explorer
Parking Garages
Pharmacy
Research
LCDR
Melcore
Gemini
IPCT

Data Governance
Data Stewardship
Data Portal
Data Profiling
and Quality
Data
Standardization
Compliance
Metadata and
Business
Glossary
Master Data
Management

Data
Repository
Dashboards
KPI’s
Analytic
Reports
Analytics & Informatics
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Data Mgt & Operations
Data Lake
Data Discovery
Profiling
Standards / Quality
Big Data (Structured and NoSQL)
Insight Apps
Genomic
Radiology
Labs
Epic / Clarity
Legacy Systems

Big Data Technical Architecture

Our Hadoop Implementation cont.

Average number of messages per day: 1,556,688
Estimated amount of storage increase per day: 5.7 GB
Number of channels currently being used: 24
Estimated daily message processing capacity: 4,320,000

Medical Device Data Flow
Data Source Data Capture MDA Big DataData Lake
Access Portals
(Analytics/Visualization)
Integration HUB Data ingestion
Processing
Channels
HBase
Data Loader
Capsule
Capsule
DB
Medical
Device
End-Users
FIRE/Big Data
Cloverleaf
Engine
Epic
TCP-based Data
Listener - Flume
HIVE
PIG
HUNK
Sqoop
Validated HL7
with Patient ID
(from Epic)
HL7
Raw HL7
(from Capsule)
Cleanse &
Transform
Raw HL7
Validated HL7

Developer
Workstation/Sandbox
SVN
(source control server)
Bamboo
(build server)
HDP Dev Cluster
HDP QA Cluster
HDP Prod Cluster
Daily Checkin/Checkout
Development Cycle
On Dev Lead Approval:
Build, Unit Test, Deploy & Tag
On Successful UAT
& Release Approval:
Deploy Per
Last Successful
Build Tag
Smoke Test
Before Updating Task status
Periodic Integration & Validation:
Build, Unit Test
& Notify On Error
Development
Cycle
Deployment
Cycle

process
1. It’s complex
2. It’s a journey
3. Leverage existing strengths
4. Collaborate openly
5. Learn from experts
6. One cluster – multiple use cases
7. Follow best practices
Lessons Learned – what went well
people

1. Continue to expand/evolve our platform
2. Ingest more data and data types
3. Identify high value use cases
4. Develop/Train people with new skills
Next Steps

Train People with new Skills
Accessing data
Computing data
Visualizing data
Insights &
Cognitive Computing

Starting the Hadoop Journey at a Global Leader in Cancer Research

Starting the Hadoop Journey at a Global Leader in Cancer Research

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Starting the Hadoop Journey at a Global Leader in Cancer Research

Ähnlich wie Starting the Hadoop Journey at a Global Leader in Cancer Research (20)

Mehr von DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Starting the Hadoop Journey at a Global Leader in Cancer Research