MD Anderson Cancer Center implemented Hadoop to help manage and analyze big data as part of its big data program. The implementation included building Hadoop clusters to store and process structured and unstructured data from various sources. Lessons learned included that implementing Hadoop is complex and a journey, and to leverage existing strengths, collaborate openly, learn from experts, start with one cluster for multiple uses cases, and follow best practices. Next steps include expanding the Hadoop platform, ingesting more data types, identifying high value use cases, and developing and training people with new big data skills.
Starting the Hadoop Journey at a Global Leader in Cancer Research
1. Vamshi Punugoti & Bryan Lari
MD Anderson Cancer Center
June 2016
HDP @ MD ANDERSON
Starting the Hadoop Journey at a Global Leader
in Cancer Research
2. Agenda
• About MD Anderson
• Big Data Program
• Our Hadoop Implementation
• Lessons Learned
• Next Steps
3. • Who we are
– One of the worlds largest centers devoted exclusively to cancer care
– Created by the Texas legislature in 1941
– Named one of the nation's top two hospitals for cancer care every
year since the survey began in 1990
• Mission
– MD Anderson’s mission is to eliminate cancer in Texas, the nation and
the world through exceptional programs that integrate patient care,
research and prevention.
About MD Anderson
5. Moon Shots Program
• Launched in 2012 – to make a giant leap for patients
• Accelerating the pace of converting scientific discoveries into
clinical advances that reduce cancer deaths
• Transdisciplinary team-science approach
• Transformative professional platforms
List of Moon Shots
12 Total Moon Shots
B-cell Lymphoma Lung Cancer
Breast Cancer Melanoma
Colorectal Cancer Multiple Myeloma
Glioblastoma Ovarian Cancer
HPV-Related Cancers Pancreatic Cancer
Leukemia (CLL, MDS, AML) Prostate Cancer
http://www.cancermoonshots.org
9. Goals of Big Data Program
• Data driven organization
• All “types” of data
• “Access” for all customers
• Clinicians
• Researchers
• Administrative / Operational
• Enable discovery of “insights”
• Improve patient care
• Increase research discoveries
• Improve operations
• Govern data like an asset
• Provide a platform / environment to enable all these things
10. To provide the right information to the right people at the right time with the right tools
Goal
data
insight
13. What are we doing today?
• FIRE Enterprise Data Warehouse
• Natural Language Processing (NLP)
• Data Governance
• Hadoop NoSQL
• Cognitive Computing
• Data Visualization
• Evolving our Platform / Architecture
• Identifying big data use cases
• Training & Skills
14. • Federated Institutional Reporting Environment
• Centralized data repository supporting analytics,
decision making, and business intelligence
• Central repository for historical and operational data
• Break-down data silos
Enterprise
RepositorySource Systems
Dashboards
KPI’s
Analytic
Reports
Analytics
& Reporting
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Genomic
FIRE Program
Radiology
Labs
Epic / Clarity
Legacy Systems
15. • Vast amounts of unstructured data are
stored on MDACC servers.
• Conventional ETL tools are not designed
to mine unstructured data.
• Suite of tools make up the NLP Pipeline
• Dictionaries were created to help Epic
go-live (Provider Friendly Terminology)
• Other examples:
• Diagnosis from the pathology reports
• Comorbidities
• Family Cancer History
• Cytogenetics
• Obituary text
• ICD10 Coding
• Structured results feeding Moonshot TRA and OEA
• Etc.
IBM ECM
NLP
Engine
Unstructured Data
Sources
Post NLP
Database
HDWF
(FIRE)
NLP Pipeline - Overview
16. Enterprise
Business
Clinical Big Data
Peoplesoft
Systems of Record
Systems of Reporting
Systems of Insights
Kronos
Point of Sale
Volunteer Services
Rotary House
MyHR
UTPD
Facilities
Clinic Station
Epic
Lab
GE IDX
Cerner
CARE
EPM
Hyperion
Oracle Business Intelligence
Smart View
Web Analytics
FIRE
EIW
Business Objects
Crystal
Hyperion Interactive Reporting
Facebook
Twitter
UPS
Center for Disease Control
The Weather Channel
LinkedIN
Youtube
oracle.com
Yelp!
Reuters
Google
U.S. Census
Medical Devices
Medical Equipment
Building Controls
Campus Video
Real-time Location Service
Wayfinding
Data
Visualization
Ad Hoc
Cognitive
Computing
Big Data for Analytics & Cognitive Computing
Presentation
Cohort Explorer
Parking Garages
Pharmacy
Research
LCDR
Melcore
Gemini
IPCT
17. Data Governance
Data Stewardship
Data Portal
Data Profiling
and Quality
Data
Standardization
Compliance
Metadata and
Business
Glossary
Master Data
Management
23. Our Hadoop Implementation cont.
Average number of messages per day: 1,556,688
Estimated amount of storage increase per day: 5.7 GB
Number of channels currently being used: 24
Estimated daily message processing capacity: 4,320,000
24. Our Hadoop Implementation cont.
Medical Device Data Flow
Data Source Data Capture MDA Big DataData Lake
Access Portals
(Analytics/Visualization)
Integration HUB Data ingestion
Processing
Channels
HBase
Data Loader
Capsule
Capsule
DB
Medical
Device
End-Users
FIRE/Big Data
Cloverleaf
Engine
Epic
TCP-based Data
Listener - Flume
HIVE
PIG
HUNK
Sqoop
Validated HL7
with Patient ID
(from Epic)
HL7
Raw HL7
(from Capsule)
Cleanse &
Transform
Raw HL7
Validated HL7
25. Our Hadoop Implementation cont.
Developer
Workstation/Sandbox
SVN
(source control server)
Bamboo
(build server)
HDP Dev Cluster
HDP QA Cluster
HDP Prod Cluster
Daily Checkin/Checkout
Development Cycle
On Dev Lead Approval:
Build, Unit Test, Deploy & Tag
On Successful UAT
& Release Approval:
Deploy Per
Last Successful
Build Tag
Smoke Test
Before Updating Task status
Periodic Integration & Validation:
Build, Unit Test
& Notify On Error
Development
Cycle
Deployment
Cycle
26. process
1. It’s complex
2. It’s a journey
3. Leverage existing strengths
4. Collaborate openly
5. Learn from experts
6. One cluster – multiple use cases
7. Follow best practices
Lessons Learned – what went well
people
27. 1. Continue to expand/evolve our platform
2. Ingest more data and data types
3. Identify high value use cases
4. Develop/Train people with new skills
Next Steps
28. Train People with new Skills
Accessing data
Computing data
Visualizing data
Insights &
Cognitive Computing