2. High Dimensional Fused-
Informatics
Joel Saltz MD, PhD
Chair Biomedical Informatics Stony
Brook University
Associate Director for Informatics,
Stony Brook Cancer Center
3. Integrative Biomedical Informatics Analysis
• Reproducible
anatomic/functional
characterization at fine
level (Pathology) and gross
level (Radiology)
• Integrate of
anatomic/functional
characterization, multiple
types of “omic”
information, outcome
• Predict treatment outcome,
select, monitor treatments
• Integrated analysis and
presentation of
observations, features
Radiology
Imaging
Patient
Outcome
Pathologic
Features
“Omic”
Data
4. Pathology and Radiology imaging have different
properties in roles of discovery and aggressiveness
potential
• Differences
– arise from differing capabilities & need not completely
correspond
– sampling differences & global properties
– differing purposes
• discovery, staging, IMRT/brachyRx planning
– Pathology – high spatial and increasing molecular
resolution
– Radiology – global view, temporal information,
increasing spatial resolution
Carl Jaffe
5.
6. Correlating Imaging Phenotypes with Genomic
Signatures: Scientific Opportunities
(Imaging Genomics Workshop NCI June 2013)
Clinical Approach and Use
• Development of imaging+analysis methods to
characterize heterogeneity
• within a tumor at one time point
• evolution over time
• among different tumor types
• Development of imaging metrics that:
• can predict and detect emergence of resistance?
• correlates with genomic heterogeneity?
• correlates with habitat heterogeneity?
• can identify more homogeneous sub-types
10. Quantitative Feature Analysis in Pathology: Emory In Silico
Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
NLM/NCI: Integrative Analysis/Digital Pathology R01LM011119,
R01LM009239 (Dual PIs Joel Saltz, David Foran)
11. Millions of Nuclei Defined by n Features
• Top-down analysis: analyze features in
context of existing diagnostic constructs
• Bottom-up analysis: let nuclear features
define and drive the analysis
12. Direct Study of Relationship Between vs
Lee Cooper,
Carlos Moreno
13. Clustering identifies three
morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)
• Prognostically-significant (logrank p=4.5e-4)
FeatureIndices
CC CM PB
10
20
30
40
50
0 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Survival
CC
CM
PB
15. Millions of Nuclei Defined by n Features
• Top-down analysis: use the features
with existing diagnostic constructs
• Bottom-up analysis: let features define
and drive the analysis
16. Nuclear Analysis Workflow
• Describe individual nuclei in terms of size,
shape, and texture
Step 2:
Feature
Extraction
Step 1:
Nuclei
Segmentation
19. Gene Expression Correlates of High Oligo-Astro
Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic Protein
Proteolipoprotein
HoxD1
Nuclear features most
Associated with Oligo
Signature Genes:
Circularity (high)
Eccentricity (low)
20. Role of Microenvironment
• Necrosis in TCGA GBM tissue
samples v.s. Verhaak
transcriptional class
• Mesenchymal
transcriptional class --
greater levels of necrosis
than other classes
• Gene expression signatures
of nonmesenchymal GBMs
became more similar to the
mesenchymal signature
with increasing levels of
necrosis
21. Microenvironment and Master Regulators
• Extent of Necrosis Related Expression of
Master Regulators of the Mesenchymal
Transition
Necrosis and C/EBP-β
22. Computation and Data Management:
Requirements and Challenges
• Explosion of derived data
– 105x105 pixels per image
– 1 million objects per image
– Hundreds to thousands of images per study
• High computational complexity
– Image analysis, feature extraction, machine learning
pipelines
– Spatial queries involve heavy duty geometric computations
23. Projection – 2025
• 100K – 1M pathology slides/hospital/year
• 2GB compressed per slide
• 1-10 slides used for Pathologist computer
aided diagnosis
• 100-10K slides used in hospital Quality control
• Groups of 100K+ slides used for clinical
research studies -- Combined with molecular,
outcome data
25. HPC Whole Slide Segmentation and
Feature Extraction Pipeline
Tony Pan, George Teodoro,
Tahsin Kurc and Scott Klasky
26. Titan – Peak Speed
30,000,000,000,000,000 floating
point operations per second!
27. Large Scale Data Management
Data model capturing multi-faceted information
including markups, annotations, algorithm
provenance, specimen, etc.
Support for complex relationships and spatial query:
multi-level granularities, relationships between
markups and annotations, spatial and nested
relationships
Highly optimized spatial query and analyses
Implemented in a variety of ways including optimized
CPU/GPU, Hadoop/HDFS and IBM DB2
28. Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point
inside a nucleus
.
Window query: return markups
contained in a rectangle
Spatial join query: algorithm
validation/comparison
Containment query: nuclear feature
aggregation in tumor regions
Fusheng Wang
29. PAIS (Pathology Analytical Imaging Standards)
• PAIS Logical Model
– 62 UML classes
– markups, annotations,
imageReferences,
provenance
• PAIS Data Representation
– XML (compressed) or HDF5
• PAIS Databases
– loading, managing and
querying and sharing data
– Native XML DBMS or
RDBMS + SDBMS
class Domain Mo...
Annotation
GeometricShape
CalculationObservation
Specimen
ImageReference
Provenance
User
PAIS
Equipment
Group
AnatomicEntity
Subject
Field
Project
MicroscopyImageReference
DICOMImageReference
TMAImageReference
Markup
Inference
Region
WholeSlideImageReference
Patient
Surface
Collection
AnnotationReference
10..1
1
0..1
0..*
0..*
1
0..*
1
0..1
1 0..*
1
0..1
1
0..1
1
0..1
1
0..*
1
0..*
0..*
0..*
1 0..1
1
0..1
1
0..*
0..1
0..*
1
0..*
1
0..1
1
0..*
1
0..1
1
0..1
1
0..*
10..*
1 0..*
1
0..*
Fusheng Wang
30. High Performance Spatial Queries
and Analytics: Hadoop-GIS
General framework to support high performance spatial
queries and analytics for spatial big data on MapReduce
and CPU-GPU hybrid platforms
• Spatial data processing methods and pipelines with spatial
partition level parallelism running on MapReduce
• Multi-level indexing methods to accelerate spatial data
processing
• Declarative spatial queries and translation into MapReduce
operations
• Utilize GPU to parallelize spatial operations and integrate them
into MapReduce
[VLDB’12, GIS’12, GIS’13, VLDB’13]
31. MICCAI 2014
BRAIN TUMOR
Classification and Segmentation Challenges
TCGA
TCIA
IMAGING
CHALLENGE
DIGITAL PATHOLOGY
CHALLENGE
Phase 1: Training
June 20 - July 31
Phase 2: Leader Board
Aug 1 - Aug 29
Phase 3: Test
Sept 8 - Sept 12
For more information about these challenges and a related workshop
on September 14, 2014 at MICCAI in Boston, see: cancerimagingarchive.net
MICCAI: Medical Image Computing and Computer Aided Interventions - MICCAI2014.org
TCGA: The Cancer Genome Atlas - cancergenome.nih.gov
TCIA: The Cancer Image Archive - cancerimagingarchive.net
32. Digital Pathology/Brain Tumor
Image Segmentation (BRATS)
• Use data currently available through data archive resources of
the National Institutes of Health (NIH), namely, the Cancer
Genome Atlas (TCGA) and the Cancer Image Archive (TCIA)
• Digital Pathology challenge will use digital slides related to
patients whose genomics data are available from TCGA.
Similarly, BRATS 2014 Challenge will use clinical MRI image
data, also from the TCGA study subjects.
• Proposed outcome of RSNA/ASCP workshop
– Coordinated Pathology/Radiology 2015 challenge –
feature selection and statistical/machine learning
algorithms to leverage Radiology, Pathology and “omic”
features to predict outcome, response to treatment
10 billion pixels1 million markups, 100 million featuresQuadrillion pixels10 trillion features
Metadata about imagesMetadata about image targets, how images are derived (patient, specimen, anatomicEntity, etc)3) Metadata about analyses (the purpose of the analysis, who performed the analysis, etc) 4) Image markups -- a markup delineates a spatial region (e.g., as points, lines, polygons, multi-polygons) in images5) Annotation: Image features: a type of annotation calculated or derived from the markups6) Annotation: observation -- an annotation associates semantic meaning to markup entities through coded or free text terms that provide explanatory or descriptive information7) provenance information, i.e., the derivation history of a markup or annotation, including algorithm information, parameters, and inputsNative XML database based approachSmall sized PAIS documents, e.g., organ, tissue, or region level annotationsNo mapping needed, support standard XML queriesRelational and spatial database approachFor large scale PAIS documents, e.g., analysis results at cellular or subcellular level Data mapped into relational tables and spatial objectsHighly efficient on storage and queries
Instead, we develop a system called Hadoop-GIS, and provide a generic framework to support high performance spatial queries and analytics for spatial big data on MapReduce and CPU-GPU hybrid systems.Hadoop-GIS provides: …