SlideShare a Scribd company logo
1 of 69
Download to read offline
Joel Saltz MD, PhD
Emory University
February 2013
Data Science, Big Data and You
CenterforComprehensiveInformatics
Big Data
• Social media—
analysis of tweets
and Facebook to
observed trends in
real time
• Local Walgreens
stock their shelves
according to local
tweets about cold
symptoms
• Credit card fraud—lost
of transactions, but
yet you get a flag that
you shopped in a store
that does not fit your
profile—and within
minutes your card is
blocked.
CenterforComprehensiveInformatics
Big Data in Commerce - Fraud Detection
• Seek unexpected data – outliers
• Lots of data – all Amex, Visa or Mastercard
transactions
• Look for individual outliers – e.g. credit
transaction involving large amount of money
purchasing unusual product
• Look for sequence data with temporal or
spatial relationship -- find unusual
sequence e.g., intrusion detection and cyber
security
CenterforComprehensiveInformatics
• Define the ―typical‖ regions in a data set – may be
difficult
• ―Typical‖ behavior may change with time. What is
typical today may be considered anomalous in
future and vice versa.
• (Smart) crooks will make ―keep under the radar‖ to
try to stay undetected
CenterforComprehensiveInformatics
Approaches
• Sometimes build a model from the training data
and apply the model to detect outliers
• Sometimes use the existing data directly to detect
outliers
CenterforComprehensiveInformatics
Big Data Ecosystem
6
Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
CenterforComprehensiveInformatics
Science and Engineering Applications
Sloan Sky Survey
CenterforComprehensiveInformatics
Early Big Data 1922 -Lewis Richardson
Weather Forecasting
CenterforComprehensiveInformatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/‖Big Data‖ Computers,
Systems Software
• Analysis of Patient Populations
CenterforComprehensiveInformatics
CenterforComprehensiveInformatics
Scientific Big Data Targets
• Multi-dimensional spatial-temporal datasets
– Biomedicine
– Oil Reservoir Simulation/Carbon
Sequestration/Groundwater Pollution Remediation
– Biomass monitoring and disaster surveillance
– Weather prediction
– Analysis of Results from Large Scale Simulations
• Correlative and cooperative analysis of data from
multiple sensor modalities and sources
• What-if scenarios and multiple design choices or
initial conditions
Emory In Silico Center for Brain Tumor
Research (PI = Dan Brat, PD= Joel Saltz)
CenterforComprehensiveInformatics
Integrative Cancer Research with Digital Pathology
histology neuroimaging
clincalpathology
Integrated
Analysis
molecular
High-resolution whole-slide microscopy
Multiplex IHC
Integrative Analysis: OSU BISTI NBIB Center
Big Data (2005)
Associate genotype with
phenotype
Big science experiments on
cancer, heart disease,
pathogen host response
Tissue specimen -- 1 cm3
0.3 μ resolution – roughly 1013
bytes
Molecular data (spatial location)
can add additional significant
factor; e.g. 102
Multispectral imaging, laser
captured microdissection,
Imaging Mass Spec, Multiplex
QD
Multiple tissue specimens; another
factor of 103
Total: 1018 bytes – exabyte per big
science experiment
A Data Intense Challenge:
The Instrumented Oil Field of the Future
The Tyranny of Scale
(Tinsley Oden - U Texas)
process scale
field scale
km
cm
simulation scale
mm
pore scale
Why Applications Get Big
• Physical world or simulation results
• Detailed description of two, three (or more)
dimensional space
• High resolution in each dimension, lots of
timesteps
• e.g. oil reservoir code -- simulate 100 km by
100 km region to 1 km depth at resolution of
100 cm:
– 10^6*10^6*10^4 mesh points, 10^2 bytes per
mesh point, 10^6 timesteps --- 10^24 bytes
(Yottabyte) of data!!!
Detect and track changes in data during production
Invert data for reservoir properties
Detect and track reservoir changes
Assimilate data & reservoir properties into
the evolving reservoir model
Use simulation and optimization to guide future production
Oil Field Management – Joint ITR with Mary Wheeler,
Paul Stoffa
Coupled Ground Water and Surface Water Simulations
Multiple codes -- e.g. fluid code, contaminant
transport code
Different space and time scales
Data from a given fluid code run is used in different
contaminant transport code scenarios
Bioremediation Simulation
Microbe colonies
(magenta)
Dissolved NAPL (blue)
Mineral oxidation
products (green)
abiotic reactions
compete with
microbes,
reduce extent of
biodegradation
National Science Foundation Grand Challenge
in Land Cover Dynamics - 1994
• Remote sensing analysis of
high resolution satellite
images.
• Databases of land cover
dynamics are essential for
global carbon models,
biogeochemical cycling,
hydrological modeling and
ecosystem response
modeling
• Maps of the world's tropical
rain forest during the past
three decades.
Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John
Townshend
CenterforComprehensiveInformatics
Analysis of Computational Data; Uncertainty
Quantification, Comparisons with Experimental Results
Dimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s
CenterforComprehensiveInformatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/‖Big Data‖ Computers,
Systems Software
• Analysis of Patient Populations
CenterforComprehensiveInformatics
Whole Slide Imaging: Scale
CenterforComprehensiveInformatics
Using TCGA Data to Study
Glioblastoma
Diagnostic Improvement
Molecular Classification
Predictors of Progression
Digital Pathology
Neuroimaging
TCGA Network
CenterforComprehensiveInformatics
Morphological Tissue Classification
Nuclei Segmentation
Cellular Features
Lee Cooper,
Jun Kong
Whole Slide Imaging
Oligodendroglioma Astrocytoma
Nuclear Qualities
Can we use image analysis of TCGA GBMs TO INFORM
diagnostic criteria based on molecular or clinical
endpoints?
Application: Oligodendroglioma Component in GBM
Millions of Nuclei Defined by n Features
• Top-down analysis: use the features
with existing diagnostic constructs
• Bottom-up analysis: let features define
and drive the analysis
TCGA Whole Slide Images
Jun Kong
Step 1:
Nuclei
Segmentation
• Identify individual nuclei
and their boundaries
Nuclear Analysis Workflow
• Describe individual nuclei in terms of size,
shape, and texture
Step 2:
Feature
Extraction
Step 1:
Nuclei
Segmentation
Oligodendroglioma Astrocytoma
Nuclear Qualities
1 10
Step 3:
Nuclei
Classification
Survival Analysis
Human Machine
Gene Expression Correlates of High Oligo-Astro
Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic Protein
Proteolipoprotein
HoxD1
Nuclear features most
Associated with Oligo
Signature Genes:
Circularity (high)
Eccentricity (low)
Millions of Nuclei Defined by n Features
• Top-down analysis: analyze features in
context of existing diagnostic constructs
• Bottom-up analysis: let nuclear features
define and drive the analysis
CenterforComprehensiveInformatics
Direct Study of Relationship Between
vs
Lee Cooper,
Carlos Moreno
CenterforComprehensiveInformatics
Consensus clustering of morphological
signatures
Study includes 200 million nuclei taken from 480
slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-
means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160
20
40
60
80
100
120
140
160
2 3 4 5 6 7
25
30
35
40
45
50
# Clusters
SilhouetteArea
0 0.5 1
1
2
3
Silhouette Value
Cluster
CenterforComprehensiveInformatics
Clustering identifies three morphological groups
• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)
FeatureIndices
CC CM PB
10
20
30
40
50
0 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Survival
CC
CM
PB
Molecular Correlates of MR Features Using TCGA Data
MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In
Vivo Imaging tools
MR Features compared to TCGA Transcriptional Classes and Genetic Alterations
David Gutman
VASARI
Feature Set
CenterforComprehensiveInformatics
46
Principal Investigator and Director: Haian Fu
Co-Directors: Fadlo R. Khuri, Joel Saltz
Project Manager: Margaret Johns
Aim 1 Leader
Yuhong Du
Aim 2 Leader
Carlos Moreno
Cancer
genomics-
based HT PPI
network
discovery &
validation
Genomics
informatics and
data integration
Emory CTD2 Center:
High throughput protein-protein interaction interrogation in cancer
Winship
Cancer
Institute
Center for
Comprehensive
InformaticsEmory
Chemical Biology
Discovery Center
Emory Molecular Interaction Center
for Functional Genomics (MicFG)
CenterforComprehensiveInformatics
a.k.a ―Big Data‖
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/”Big Data” Computers,
Systems Software
• Analysis of Patient Populations
CenterforComprehensiveInformatics
Titan – Peak Speed 30,000,000,000,000,000
floating point operations per second!
CenterforComprehensiveInformatics
CenterforComprehensiveInformatics
HPC Segmentation and Feature Extraction Pipeline
Tony Pan and George Teodoro
CenterforComprehensiveInformatics
Large Scale Data Management
 Represented by a complex data model capturing
multi-faceted information including markups,
annotations, algorithm provenance, specimen, etc.
 Support for complex relationships and spatial
query: multi-level granularities, relationships
between markups and annotations, spatial and
nested relationships
 Highly optimized spatial query and analyses
 Implemented in a variety of ways including
optimized CPU/GPU, Hadoop/HDFS and IBM DB2
Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point
inside a nucleus
.
Window query: return markups
contained in a rectangle
Spatial join query: algorithm
validation/comparison
Containment query: nuclear feature
aggregation in tumor regions
CenterforComprehensiveInformatics
a.k.a ―Big Data‖
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/‖Big Data‖ Computers,
Systems Software
• Analysis of Patient Populations
CenterforComprehensiveInformatics
• Example Project: Find hot spots in readmissions
within 30 days
– What fraction of patients with a given principal diagnosis will
be readmitted within 30 days?
– What fraction of patients with a given set of diseases will be
readmitted within 30 days?
– How does severity and time course of co-morbidities affect
readmissions?
– Geographic analyses
• Compare and contrast with UHC Clinical Data Base
– Repeat analyses across all UHC hospitals
– Are we performing the same?
– How are UHC-curated groupings of patients (e.g., product
lines) useful?
Clinical Phenotype Characterization and the Emory
Analytic Information Warehouse
Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
CenterforComprehensiveInformatics
Overall System
I2b2 Web
Server
I2b2
Database
Source
data
Database
Mapper
Source
data
Source
data
Data
Processing
Metadata
Manager
Metadata
Repository
Query
Specification
Investigator
Data Analyst
Data Analyst
Data Modeler
Investigator
Query tools
Study-
specific
Database
Investigator
CenterforComprehensiveInformatics
5-year Datasets from Emory and
University Healthcare Consortium
• EUH, EUHM and WW (inpatient encounters)
• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)
• Providers (Emory only)
• Discharge disposition
• Primary and secondary ICD9 codes
• Procedure codes
• DRGs
• Medication orders (Emory only)
• Labs (Emory only)
• Vitals (Emory only)
• Geographic information (CDW only + US Census and American
Community Survey)
Analytic Information
CenterforComprehensiveInformatics
Using Emory & UHC Data to Find
Associations With 30-day Readmits
• Problem: ―Raw‖ clinical and administrative variables
are difficult to use for associative data mining
– Too many diagnosis codes, procedure codes
– Continuous variables (e.g., labs) require interpretation
– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set
of variables using heuristic knowledge
– Categorize diagnosis and procedure codes using code
hierarchies
– Classify continuous variables using standard
interpretations (e.g., high, normal, low)
– Identify temporal patterns (e.g., frequency, duration,
sequence)
– Apply standard data mining techniques
Analytic Information
Warehouse
CenterforComprehensiveInformatics
30-Day Readmission Rates for Derived
Variables
Emory Health Care
CenterforComprehensiveInformatics
Geographic Analyses
UHC Medicine General Product Line (#15)
Analytic Information Warehouse
CenterforComprehensiveInformatics
Predictive Modeling for Readmission
• Random forests (ensemble of decision trees)
– Create a decision tree using a random subset of the
variables in the dataset
– Generate a large number of such trees
– All trees vote to classify each test example in a
training dataset
– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-
day readmission
Sharath Cholleti
CenterforComprehensiveInformatics
Emory Readmission Rates for High and
Low Risk Groups Generated with
Random Forest
CenterforComprehensiveInformatics
Predictive Modeling for 180 UHC Hospitals, 35 Million Patients
Identify High Risk Patients!
Readmission fraction of top 10% high risk patients
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
All Hospital Model
Individual Hospital
Model
Quasi-real-time display and analysis of physiologic
data from Emory University Hospital SICU
Burst of tachycardia,
no desaturation
Two episodes of
desaturation, no change
in heart rate
HR
SpO2
This slide is for orientation. Red data are the newest, green
intermediate, blue oldest. Frequency every 2 seconds.
We have started to construct alerts around
desaturation behaviors
(this image courtesy IBM)
CenterforComprehensiveInformatics
• Integrative Spatio-Temporal Analytics
• Deep Integrative Biomedical Research
• High End Computing/‖Big Data‖ Computers,
Systems Software
• Analysis of Patient Populations
CenterforComprehensiveInformatics
Thanks to:
• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish
Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti,
Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom
Mikkelsen, Adam Flanders, Joel Saltz (Director)
• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath
Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma,
David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang,
David J. Foran (Rutgers)
• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris
Gao, Michel Monsour, Himanshu Rathod
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich
Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max
Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John
Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl
Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim
Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis,
Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
CenterforComprehensiveInformatics
Thanks to
• National Cancer Institute
• National Library of Medicine
• National Science Foundation
• Cardiovascular Research Grid (NHLBI)
• Minority Health Grid (ARRA)
• Emory Health Care
• Kaiser Health Care
• Winship Cancer Institute
• Oak Ridge National Laboratory
• Woodruff Health Sciences
Thanks!

More Related Content

What's hot

Detecting malaria using a deep convolutional neural network
Detecting malaria using a deep  convolutional neural networkDetecting malaria using a deep  convolutional neural network
Detecting malaria using a deep convolutional neural networkYusuf Brima
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicineHongyoon Choi
 
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGES
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGESAN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGES
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGEScscpconf
 
Integrative analysis of medical imaging and omics
Integrative analysis of medical imaging and omicsIntegrative analysis of medical imaging and omics
Integrative analysis of medical imaging and omicsHongyoon Choi
 
Identifying brain tumour from mri image using modified fcm and support
Identifying brain tumour from mri image using modified fcm and supportIdentifying brain tumour from mri image using modified fcm and support
Identifying brain tumour from mri image using modified fcm and supportIAEME Publication
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
The Cancer imaging Phenomics Toolkit (CaPTk)
The Cancer imaging Phenomics Toolkit (CaPTk)The Cancer imaging Phenomics Toolkit (CaPTk)
The Cancer imaging Phenomics Toolkit (CaPTk)imgcommcall
 
Classification of Breast Masses Using Convolutional Neural Network as Feature...
Classification of Breast Masses Using Convolutional Neural Network as Feature...Classification of Breast Masses Using Convolutional Neural Network as Feature...
Classification of Breast Masses Using Convolutional Neural Network as Feature...Pinaki Ranjan Sarkar
 
Definiens In Digital Pathology Hr
Definiens In Digital Pathology HrDefiniens In Digital Pathology Hr
Definiens In Digital Pathology HrDaniel Nicolson
 
E-book Thesis Sara Carvalho
E-book Thesis  Sara CarvalhoE-book Thesis  Sara Carvalho
E-book Thesis Sara CarvalhoSara Carvalho
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Kevin Mader
 
Breast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural NetworkBreast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and ThresholdingIRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and ThresholdingIRJET Journal
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
Comparison of Image Segmentation Algorithms for Brain Tumor Detection
Comparison of Image Segmentation Algorithms for Brain Tumor DetectionComparison of Image Segmentation Algorithms for Brain Tumor Detection
Comparison of Image Segmentation Algorithms for Brain Tumor DetectionIJMTST Journal
 
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...ijtsrd
 

What's hot (20)

Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
TCIA
TCIATCIA
TCIA
 
Detecting malaria using a deep convolutional neural network
Detecting malaria using a deep  convolutional neural networkDetecting malaria using a deep  convolutional neural network
Detecting malaria using a deep convolutional neural network
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicine
 
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGES
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGESAN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGES
AN ANN BASED BRAIN ABNORMALITY DETECTION USING MR IMAGES
 
Integrative analysis of medical imaging and omics
Integrative analysis of medical imaging and omicsIntegrative analysis of medical imaging and omics
Integrative analysis of medical imaging and omics
 
Identifying brain tumour from mri image using modified fcm and support
Identifying brain tumour from mri image using modified fcm and supportIdentifying brain tumour from mri image using modified fcm and support
Identifying brain tumour from mri image using modified fcm and support
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
The Cancer imaging Phenomics Toolkit (CaPTk)
The Cancer imaging Phenomics Toolkit (CaPTk)The Cancer imaging Phenomics Toolkit (CaPTk)
The Cancer imaging Phenomics Toolkit (CaPTk)
 
Classification of Breast Masses Using Convolutional Neural Network as Feature...
Classification of Breast Masses Using Convolutional Neural Network as Feature...Classification of Breast Masses Using Convolutional Neural Network as Feature...
Classification of Breast Masses Using Convolutional Neural Network as Feature...
 
Definiens In Digital Pathology Hr
Definiens In Digital Pathology HrDefiniens In Digital Pathology Hr
Definiens In Digital Pathology Hr
 
Madhavi
MadhaviMadhavi
Madhavi
 
E-book Thesis Sara Carvalho
E-book Thesis  Sara CarvalhoE-book Thesis  Sara Carvalho
E-book Thesis Sara Carvalho
 
Madhavi tippani
Madhavi tippaniMadhavi tippani
Madhavi tippani
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
Breast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural NetworkBreast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural Network
 
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and ThresholdingIRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
IRJET- Brain Tumor Detection using Hybrid Model of DCT DWT and Thresholding
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
Comparison of Image Segmentation Algorithms for Brain Tumor Detection
Comparison of Image Segmentation Algorithms for Brain Tumor DetectionComparison of Image Segmentation Algorithms for Brain Tumor Detection
Comparison of Image Segmentation Algorithms for Brain Tumor Detection
 
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...
An Efficient Brain Tumor Detection Algorithm based on Segmentation for MRI Sy...
 

Similar to Data Science, Big Data and You

Machine Learning and Deep Contemplation of Data
Machine Learning and Deep Contemplation of DataMachine Learning and Deep Contemplation of Data
Machine Learning and Deep Contemplation of DataJoel Saltz
 
High Dimensional Fused-Informatics
High Dimensional Fused-InformaticsHigh Dimensional Fused-Informatics
High Dimensional Fused-InformaticsJoel Saltz
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Joel Saltz
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillanceJoel Saltz
 
BIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANBIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANRobert Oostenveld
 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataJoel Saltz
 
NCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyNCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyWarren Kibbe
 
The Importance Of Data Mining By Musa Mohd. Nordin, Noor
The Importance Of Data Mining By Musa Mohd. Nordin, NoorThe Importance Of Data Mining By Musa Mohd. Nordin, Noor
The Importance Of Data Mining By Musa Mohd. Nordin, Noormuzkara
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Learning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingLearning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingJoel Saltz
 
FDA NGS and Big Data Conference September 2014
FDA NGS and Big Data Conference September 2014FDA NGS and Big Data Conference September 2014
FDA NGS and Big Data Conference September 2014Warren Kibbe
 
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Joel Saltz
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)Dongheon Lee
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
Pathomics Based Biomarkers, Tools, and Methods
Pathomics Based Biomarkers, Tools, and MethodsPathomics Based Biomarkers, Tools, and Methods
Pathomics Based Biomarkers, Tools, and Methodsimgcommcall
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 

Similar to Data Science, Big Data and You (20)

Machine Learning and Deep Contemplation of Data
Machine Learning and Deep Contemplation of DataMachine Learning and Deep Contemplation of Data
Machine Learning and Deep Contemplation of Data
 
High Dimensional Fused-Informatics
High Dimensional Fused-InformaticsHigh Dimensional Fused-Informatics
High Dimensional Fused-Informatics
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer Surveillance
 
BIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANBIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCAN
 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor Data
 
NCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyNCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncology
 
The Importance Of Data Mining By Musa Mohd. Nordin, Noor
The Importance Of Data Mining By Musa Mohd. Nordin, NoorThe Importance Of Data Mining By Musa Mohd. Nordin, Noor
The Importance Of Data Mining By Musa Mohd. Nordin, Noor
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Learning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingLearning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale Computing
 
FDA NGS and Big Data Conference September 2014
FDA NGS and Big Data Conference September 2014FDA NGS and Big Data Conference September 2014
FDA NGS and Big Data Conference September 2014
 
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Pathomics Based Biomarkers, Tools, and Methods
Pathomics Based Biomarkers, Tools, and MethodsPathomics Based Biomarkers, Tools, and Methods
Pathomics Based Biomarkers, Tools, and Methods
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 

More from Joel Saltz

AI and whole slide imaging biomarkers
AI and whole slide imaging biomarkersAI and whole slide imaging biomarkers
AI and whole slide imaging biomarkersJoel Saltz
 
Integrative Everything, Deep Learning and Streaming Data
Integrative Everything, Deep Learning and Streaming DataIntegrative Everything, Deep Learning and Streaming Data
Integrative Everything, Deep Learning and Streaming DataJoel Saltz
 
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...Joel Saltz
 
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure CancerExtreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure CancerJoel Saltz
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeJoel Saltz
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeJoel Saltz
 
Digital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision MedicineDigital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision MedicineJoel Saltz
 
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...Joel Saltz
 
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...Joel Saltz
 
Generation and Use of Quantitative Pathology Phenotype
Generation and Use of Quantitative Pathology PhenotypeGeneration and Use of Quantitative Pathology Phenotype
Generation and Use of Quantitative Pathology PhenotypeJoel Saltz
 
Big Data and Extreme Scale Computing
Big Data and Extreme Scale Computing Big Data and Extreme Scale Computing
Big Data and Extreme Scale Computing Joel Saltz
 
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Joel Saltz
 
Data and Computational Challenges in Integrative Biomedical Informatics
Data and Computational Challenges in Integrative Biomedical InformaticsData and Computational Challenges in Integrative Biomedical Informatics
Data and Computational Challenges in Integrative Biomedical InformaticsJoel Saltz
 
Integrative Multi-Scale Analyses
Integrative Multi-Scale AnalysesIntegrative Multi-Scale Analyses
Integrative Multi-Scale AnalysesJoel Saltz
 
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Joel Saltz
 
Role of Biomedical Informatics in Translational Cancer Research
Role of Biomedical Informatics in Translational Cancer ResearchRole of Biomedical Informatics in Translational Cancer Research
Role of Biomedical Informatics in Translational Cancer ResearchJoel Saltz
 
Extreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisExtreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisJoel Saltz
 
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...Joel Saltz
 
Presentation at UHC Annual Meeting
Presentation at UHC  Annual MeetingPresentation at UHC  Annual Meeting
Presentation at UHC Annual MeetingJoel Saltz
 
Indiana 4 2011 Final Final
Indiana 4 2011 Final FinalIndiana 4 2011 Final Final
Indiana 4 2011 Final FinalJoel Saltz
 

More from Joel Saltz (20)

AI and whole slide imaging biomarkers
AI and whole slide imaging biomarkersAI and whole slide imaging biomarkers
AI and whole slide imaging biomarkers
 
Integrative Everything, Deep Learning and Streaming Data
Integrative Everything, Deep Learning and Streaming DataIntegrative Everything, Deep Learning and Streaming Data
Integrative Everything, Deep Learning and Streaming Data
 
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
 
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure CancerExtreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase Change
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase Change
 
Digital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision MedicineDigital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision Medicine
 
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
 
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
 
Generation and Use of Quantitative Pathology Phenotype
Generation and Use of Quantitative Pathology PhenotypeGeneration and Use of Quantitative Pathology Phenotype
Generation and Use of Quantitative Pathology Phenotype
 
Big Data and Extreme Scale Computing
Big Data and Extreme Scale Computing Big Data and Extreme Scale Computing
Big Data and Extreme Scale Computing
 
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
 
Data and Computational Challenges in Integrative Biomedical Informatics
Data and Computational Challenges in Integrative Biomedical InformaticsData and Computational Challenges in Integrative Biomedical Informatics
Data and Computational Challenges in Integrative Biomedical Informatics
 
Integrative Multi-Scale Analyses
Integrative Multi-Scale AnalysesIntegrative Multi-Scale Analyses
Integrative Multi-Scale Analyses
 
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
 
Role of Biomedical Informatics in Translational Cancer Research
Role of Biomedical Informatics in Translational Cancer ResearchRole of Biomedical Informatics in Translational Cancer Research
Role of Biomedical Informatics in Translational Cancer Research
 
Extreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisExtreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data Analysis
 
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
 
Presentation at UHC Annual Meeting
Presentation at UHC  Annual MeetingPresentation at UHC  Annual Meeting
Presentation at UHC Annual Meeting
 
Indiana 4 2011 Final Final
Indiana 4 2011 Final FinalIndiana 4 2011 Final Final
Indiana 4 2011 Final Final
 

Data Science, Big Data and You

  • 1. Joel Saltz MD, PhD Emory University February 2013 Data Science, Big Data and You
  • 2. CenterforComprehensiveInformatics Big Data • Social media— analysis of tweets and Facebook to observed trends in real time • Local Walgreens stock their shelves according to local tweets about cold symptoms • Credit card fraud—lost of transactions, but yet you get a flag that you shopped in a store that does not fit your profile—and within minutes your card is blocked.
  • 3. CenterforComprehensiveInformatics Big Data in Commerce - Fraud Detection • Seek unexpected data – outliers • Lots of data – all Amex, Visa or Mastercard transactions • Look for individual outliers – e.g. credit transaction involving large amount of money purchasing unusual product • Look for sequence data with temporal or spatial relationship -- find unusual sequence e.g., intrusion detection and cyber security
  • 4. CenterforComprehensiveInformatics • Define the ―typical‖ regions in a data set – may be difficult • ―Typical‖ behavior may change with time. What is typical today may be considered anomalous in future and vice versa. • (Smart) crooks will make ―keep under the radar‖ to try to stay undetected
  • 5. CenterforComprehensiveInformatics Approaches • Sometimes build a model from the training data and apply the model to detect outliers • Sometimes use the existing data directly to detect outliers
  • 6. CenterforComprehensiveInformatics Big Data Ecosystem 6 Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 7.
  • 9. CenterforComprehensiveInformatics Early Big Data 1922 -Lewis Richardson Weather Forecasting
  • 10. CenterforComprehensiveInformatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/‖Big Data‖ Computers, Systems Software • Analysis of Patient Populations
  • 12. CenterforComprehensiveInformatics Scientific Big Data Targets • Multi-dimensional spatial-temporal datasets – Biomedicine – Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation – Biomass monitoring and disaster surveillance – Weather prediction – Analysis of Results from Large Scale Simulations • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions
  • 13. Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
  • 14. CenterforComprehensiveInformatics Integrative Cancer Research with Digital Pathology histology neuroimaging clincalpathology Integrated Analysis molecular High-resolution whole-slide microscopy Multiplex IHC
  • 15. Integrative Analysis: OSU BISTI NBIB Center Big Data (2005) Associate genotype with phenotype Big science experiments on cancer, heart disease, pathogen host response Tissue specimen -- 1 cm3 0.3 μ resolution – roughly 1013 bytes Molecular data (spatial location) can add additional significant factor; e.g. 102 Multispectral imaging, laser captured microdissection, Imaging Mass Spec, Multiplex QD Multiple tissue specimens; another factor of 103 Total: 1018 bytes – exabyte per big science experiment
  • 16. A Data Intense Challenge: The Instrumented Oil Field of the Future
  • 17.
  • 18. The Tyranny of Scale (Tinsley Oden - U Texas) process scale field scale km cm simulation scale mm pore scale
  • 19. Why Applications Get Big • Physical world or simulation results • Detailed description of two, three (or more) dimensional space • High resolution in each dimension, lots of timesteps • e.g. oil reservoir code -- simulate 100 km by 100 km region to 1 km depth at resolution of 100 cm: – 10^6*10^6*10^4 mesh points, 10^2 bytes per mesh point, 10^6 timesteps --- 10^24 bytes (Yottabyte) of data!!!
  • 20. Detect and track changes in data during production Invert data for reservoir properties Detect and track reservoir changes Assimilate data & reservoir properties into the evolving reservoir model Use simulation and optimization to guide future production Oil Field Management – Joint ITR with Mary Wheeler, Paul Stoffa
  • 21. Coupled Ground Water and Surface Water Simulations Multiple codes -- e.g. fluid code, contaminant transport code Different space and time scales Data from a given fluid code run is used in different contaminant transport code scenarios
  • 22. Bioremediation Simulation Microbe colonies (magenta) Dissolved NAPL (blue) Mineral oxidation products (green) abiotic reactions compete with microbes, reduce extent of biodegradation
  • 23. National Science Foundation Grand Challenge in Land Cover Dynamics - 1994 • Remote sensing analysis of high resolution satellite images. • Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling • Maps of the world's tropical rain forest during the past three decades. Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend
  • 24.
  • 25. CenterforComprehensiveInformatics Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results Dimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s
  • 26. CenterforComprehensiveInformatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/‖Big Data‖ Computers, Systems Software • Analysis of Patient Populations
  • 29. Using TCGA Data to Study Glioblastoma Diagnostic Improvement Molecular Classification Predictors of Progression
  • 31. CenterforComprehensiveInformatics Morphological Tissue Classification Nuclei Segmentation Cellular Features Lee Cooper, Jun Kong Whole Slide Imaging
  • 32. Oligodendroglioma Astrocytoma Nuclear Qualities Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints? Application: Oligodendroglioma Component in GBM
  • 33. Millions of Nuclei Defined by n Features • Top-down analysis: use the features with existing diagnostic constructs • Bottom-up analysis: let features define and drive the analysis
  • 34. TCGA Whole Slide Images Jun Kong Step 1: Nuclei Segmentation • Identify individual nuclei and their boundaries
  • 35. Nuclear Analysis Workflow • Describe individual nuclei in terms of size, shape, and texture Step 2: Feature Extraction Step 1: Nuclei Segmentation
  • 36. Oligodendroglioma Astrocytoma Nuclear Qualities 1 10 Step 3: Nuclei Classification
  • 38. Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification Oligo Related Genes Myelin Basic Protein Proteolipoprotein HoxD1 Nuclear features most Associated with Oligo Signature Genes: Circularity (high) Eccentricity (low)
  • 39. Millions of Nuclei Defined by n Features • Top-down analysis: analyze features in context of existing diagnostic constructs • Bottom-up analysis: let nuclear features define and drive the analysis
  • 40. CenterforComprehensiveInformatics Direct Study of Relationship Between vs Lee Cooper, Carlos Moreno
  • 41. CenterforComprehensiveInformatics Consensus clustering of morphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering Nuclear Features Used to Classify GBMs 3 2 1 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 2 3 4 5 6 7 25 30 35 40 45 50 # Clusters SilhouetteArea 0 0.5 1 1 2 3 Silhouette Value Cluster
  • 42. CenterforComprehensiveInformatics Clustering identifies three morphological groups • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4) FeatureIndices CC CM PB 10 20 30 40 50 0 500 1000 1500 2000 2500 3000 0 0.2 0.4 0.6 0.8 1 Days Survival CC CM PB
  • 43. Molecular Correlates of MR Features Using TCGA Data MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools MR Features compared to TCGA Transcriptional Classes and Genetic Alterations David Gutman
  • 46. 46 Principal Investigator and Director: Haian Fu Co-Directors: Fadlo R. Khuri, Joel Saltz Project Manager: Margaret Johns Aim 1 Leader Yuhong Du Aim 2 Leader Carlos Moreno Cancer genomics- based HT PPI network discovery & validation Genomics informatics and data integration Emory CTD2 Center: High throughput protein-protein interaction interrogation in cancer Winship Cancer Institute Center for Comprehensive InformaticsEmory Chemical Biology Discovery Center Emory Molecular Interaction Center for Functional Genomics (MicFG)
  • 47. CenterforComprehensiveInformatics a.k.a ―Big Data‖ • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/”Big Data” Computers, Systems Software • Analysis of Patient Populations
  • 48. CenterforComprehensiveInformatics Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!
  • 50. CenterforComprehensiveInformatics HPC Segmentation and Feature Extraction Pipeline Tony Pan and George Teodoro
  • 51. CenterforComprehensiveInformatics Large Scale Data Management  Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.  Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships  Highly optimized spatial query and analyses  Implemented in a variety of ways including optimized CPU/GPU, Hadoop/HDFS and IBM DB2
  • 52. Spatial Centric – Pathology Imaging “GIS” Point query: human marked point inside a nucleus . Window query: return markups contained in a rectangle Spatial join query: algorithm validation/comparison Containment query: nuclear feature aggregation in tumor regions
  • 53. CenterforComprehensiveInformatics a.k.a ―Big Data‖ • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/‖Big Data‖ Computers, Systems Software • Analysis of Patient Populations
  • 54. CenterforComprehensiveInformatics • Example Project: Find hot spots in readmissions within 30 days – What fraction of patients with a given principal diagnosis will be readmitted within 30 days? – What fraction of patients with a given set of diseases will be readmitted within 30 days? – How does severity and time course of co-morbidities affect readmissions? – Geographic analyses • Compare and contrast with UHC Clinical Data Base – Repeat analyses across all UHC hospitals – Are we performing the same? – How are UHC-curated groupings of patients (e.g., product lines) useful? Clinical Phenotype Characterization and the Emory Analytic Information Warehouse Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
  • 56. CenterforComprehensiveInformatics 5-year Datasets from Emory and University Healthcare Consortium • EUH, EUHM and WW (inpatient encounters) • Removed encounter pairs with chemotherapy and radiation therapy readmit encounters (CDW data) • Encounter location (down to unit for Emory) • Providers (Emory only) • Discharge disposition • Primary and secondary ICD9 codes • Procedure codes • DRGs • Medication orders (Emory only) • Labs (Emory only) • Vitals (Emory only) • Geographic information (CDW only + US Census and American Community Survey) Analytic Information
  • 57. CenterforComprehensiveInformatics Using Emory & UHC Data to Find Associations With 30-day Readmits • Problem: ―Raw‖ clinical and administrative variables are difficult to use for associative data mining – Too many diagnosis codes, procedure codes – Continuous variables (e.g., labs) require interpretation – Temporal relationships between variables are implicit • Solution: Transform the data into a much smaller set of variables using heuristic knowledge – Categorize diagnosis and procedure codes using code hierarchies – Classify continuous variables using standard interpretations (e.g., high, normal, low) – Identify temporal patterns (e.g., frequency, duration, sequence) – Apply standard data mining techniques Analytic Information Warehouse
  • 58. CenterforComprehensiveInformatics 30-Day Readmission Rates for Derived Variables Emory Health Care
  • 59. CenterforComprehensiveInformatics Geographic Analyses UHC Medicine General Product Line (#15) Analytic Information Warehouse
  • 60. CenterforComprehensiveInformatics Predictive Modeling for Readmission • Random forests (ensemble of decision trees) – Create a decision tree using a random subset of the variables in the dataset – Generate a large number of such trees – All trees vote to classify each test example in a training dataset – Generate a patient-specific readmission risk for each encounter • Rank the encounters by risk for a subsequent 30- day readmission Sharath Cholleti
  • 61. CenterforComprehensiveInformatics Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest
  • 62. CenterforComprehensiveInformatics Predictive Modeling for 180 UHC Hospitals, 35 Million Patients Identify High Risk Patients! Readmission fraction of top 10% high risk patients 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 All Hospital Model Individual Hospital Model
  • 63. Quasi-real-time display and analysis of physiologic data from Emory University Hospital SICU
  • 64. Burst of tachycardia, no desaturation Two episodes of desaturation, no change in heart rate HR SpO2 This slide is for orientation. Red data are the newest, green intermediate, blue oldest. Frequency every 2 seconds.
  • 65. We have started to construct alerts around desaturation behaviors (this image courtesy IBM)
  • 66. CenterforComprehensiveInformatics • Integrative Spatio-Temporal Analytics • Deep Integrative Biomedical Research • High End Computing/‖Big Data‖ Computers, Systems Software • Analysis of Patient Populations
  • 67. CenterforComprehensiveInformatics Thanks to: • In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director) • Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers) • Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod • In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz • NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe • ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos • ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
  • 68. CenterforComprehensiveInformatics Thanks to • National Cancer Institute • National Library of Medicine • National Science Foundation • Cardiovascular Research Grid (NHLBI) • Minority Health Grid (ARRA) • Emory Health Care • Kaiser Health Care • Winship Cancer Institute • Oak Ridge National Laboratory • Woodruff Health Sciences

Editor's Notes

  1. Combine with next slide.Graphical representation