Extreme Spatio Temporal Data Analysis in         Biomedical Informatics           Joel Saltz MD, PhD   Director Center for...
Contributions                                       • Computer Science: Methods and middlewareCenter for Comprehensive Inf...
Outline of Talk                                       • Pathology: Analysis of Digitized Tissue for ResearchCenter for Com...
Center for Comprehensive Informatics                                       Whole Slide Imaging: Scale
Pathology Computer Assisted DiagnosisCenter for Comprehensive Informatics                                               Sh...
Computerized Classification System     for Grading Neuroblastoma                    Initialization                        ...
Center for Comprehensive Informatics
Direct Study of Relationship Between                                                 vsCenter for Comprehensive Informatics
In Silico Brain Tumor Center                Anaplastic Astrocytoma                (WHO grade III)                 Glioblas...
Morphological Tissue Classification                                           Whole Slide Imaging               Cellular F...
Nuclear Features Used to Classify GBMsCenter for Comprehensive Informatics                                                ...
Clustering identifies three morphological groupsCenter for Comprehensive Informatics                                      ...
Gene Expression Class AssociationsCenter for Comprehensive Informatics                                       • Cox proport...
Clustering ValidationCenter for Comprehensive Informatics                                       • Separate set of 84 GBMs ...
Center for Comprehensive Informatics                                       Associations
Novel Pathology Modalities        Genomics                                            ImagingExcellent Molecular Resolutio...
Quantum Dots        Professor Robin Bostick
Imaging Pipeline – Feature Extraction
Example Application: Cancer Stem Cell                Niche• Cancer stem cells  – Rare(?), proliferative cells, regenerativ...
Extreme Spatio-Temporal Sensor Data AnalyticsCenter for Comprehensive Informatics                                       • ...
Application TargetsCenter for Comprehensive Informatics                                       • Multi-dimensional spatial-...
Biomass Monitoring (joint with ORNL)                                       • Investigate changes in vegetation and land us...
Center for Comprehensive Informatics
Core Transformations•   Data Cleaning and Low Level Transformations•   Data Subsetting, Filtering, Subsampling•   Spatio-t...
Extreme DataCutterDataCutter  Pipeline of filters connected though logical streams  In transit processing  Flow control be...
Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
Node Level Work SchedulingCenter for Comprehensive Informatics                                       • Features of Node Le...
Node Level Work SchedulingCenter for Comprehensive Informatics                                       • Attempt to minimize...
Center for Comprehensive Informatics                                       Node Level Work Scheduling
Brain Tumor Pipeline Scaling on Keeneland                                       (100 Nodes)Center for Comprehensive Inform...
Control Structures for Handling Fine                                       Grained/Runtime Dependent Parallelism in GPUsCe...
Large Scale Data ManagementCenter for Comprehensive Informatics                                        Implemented with I...
Spatial Centric – Pathology Imaging “GIS”Point query: human marked point      Window query: return markupsinside a nucleus...
PAISPAIS (Pathology Analytical Imaging Standards)               Supported by caBIG, R01 and ACTSI class Domain Mo...      ...
PAIS: Example Queries                                       Example Query for Integrative Studies                         ...
Algorithm Validation: Intersectionbetween Two Result Sets (Spatial Join)            PAIS: Example Queries    .   .
VLDB 2012Center for Comprehensive Informatics                                       Change Detection, Comparison, and Quan...
Summary and Perspective                                       • Large scale integrative data analytic methods andCenter fo...
Importance:                                       • Computer Science: general approaches to analysisCenter for Comprehensi...
Thanks to:•   In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David    Gutman, Jun Kon...
Thanks!
Nächste SlideShare
Wird geladen in …5
×

Extreme Spatio-Temporal Data Analysis

862 Aufrufe

Veröffentlicht am

Talked delivered at XLDB Asia, June 2012

Veröffentlicht in: Bildung, Gesundheit & Medizin, Technologie
  • Als Erste(r) kommentieren

Extreme Spatio-Temporal Data Analysis

  1. 1. Extreme Spatio Temporal Data Analysis in Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics
  2. 2. Contributions • Computer Science: Methods and middlewareCenter for Comprehensive Informatics for analysis, classification of very large datasets from low dimensional spatio- temporal sensors; methods to carry out comparisons and change detection between sensor datasets • Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets
  3. 3. Outline of Talk • Pathology: Analysis of Digitized Tissue for ResearchCenter for Comprehensive Informatics and Practice • Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications • Whole Slide Image Analysis in Clinical Practice: Neuroblastoma • Tissue Flow: Multiplex Quantum Dot • HPC/BIGDATA Feature Pipeline • Pathology data analytic tools and techniques
  4. 4. Center for Comprehensive Informatics Whole Slide Imaging: Scale
  5. 5. Pathology Computer Assisted DiagnosisCenter for Comprehensive Informatics Shimada, Gurcan, Kong, Saltz
  6. 6. Computerized Classification System for Grading Neuroblastoma Initialization YesImage Tile Background? Label I=L • Background Identification No Create Image I(L) • Image Decomposition (Multi- Training Tiles resolution levels) Segmentation I = I -1 • Image Segmentation Down-sampling (EMLDA) Segmentation Feature Construction • Feature Construction (2nd Yes No order statistics, Tonal Feature Extraction I > 1?Feature Construction Features) Feature Extraction Classification • Feature Extraction (LDA) + Classification (Bayesian) Classifier Training No • Multi-resolution Layer Within Confidence Region ? Controller (Confidence Yes TRAINING Region) TESTING
  7. 7. Center for Comprehensive Informatics
  8. 8. Direct Study of Relationship Between vsCenter for Comprehensive Informatics
  9. 9. In Silico Brain Tumor Center Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)
  10. 10. Morphological Tissue Classification Whole Slide Imaging Cellular FeaturesCenter for Comprehensive Informatics Nuclei Segmentation Lee Cooper, Jun Kong
  11. 11. Nuclear Features Used to Classify GBMsCenter for Comprehensive Informatics 50 3 2 1 20 1 45 40 Silhouette Area 40 60 Cluster 80 2 35 100 120 30 140 3 25 160 2 3 4 5 6 7 20 40 60 80 100 120 140 160 # Clusters 0 0.5 1 Silhouette Value Consensus clustering of morphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering
  12. 12. Clustering identifies three morphological groupsCenter for Comprehensive Informatics • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4) CC CM PB 1 CC 10 0.8 CM PB 20 Feature Indices 0.6 Survival 30 0.4 40 0.2 50 0 0 500 1000 1500 2000 2500 3000 Days
  13. 13. Gene Expression Class AssociationsCenter for Comprehensive Informatics • Cox proportional hazards – Gene expression class not significant p=0.58 – Morphology clustering p=5.0e-3 100 Classical Mesenchymal 80 Subtype Percentage (%) Neural Proneural 60 40 20 0 CC CM PB Cluster
  14. 14. Clustering ValidationCenter for Comprehensive Informatics • Separate set of 84 GBMs from Henry Ford Hospital • ClusterRepro: CC p=7.2e-3, CM p=1.3e-2 CC Mixed CM 1 10 CC 0.8 Mixed Feature Indices 20 CM 0.6 30 Survival 0.4 40 0.2 50 0 0 20 40 60 80 100 Months
  15. 15. Center for Comprehensive Informatics Associations
  16. 16. Novel Pathology Modalities Genomics ImagingExcellent Molecular Resolution Excellent Spatial Resolution Limited Spatial Resolution Limited Molecular Resolution 1000’s of genes
  17. 17. Quantum Dots Professor Robin Bostick
  18. 18. Imaging Pipeline – Feature Extraction
  19. 19. Example Application: Cancer Stem Cell Niche• Cancer stem cells – Rare(?), proliferative cells, regenerative – Do they prefer to live near blood vessels, or necrosis?
  20. 20. Extreme Spatio-Temporal Sensor Data AnalyticsCenter for Comprehensive Informatics • Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data • Run lots of different algorithms to derive same features • Run lots of algorithms to derive complementary features • Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
  21. 21. Application TargetsCenter for Comprehensive Informatics • Multi-dimensional spatial-temporal datasets – Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations • Can we analyze 100,000+ microscopy images per hour? • Correlative and cooperative analysis of data from multiple sensor modalities and sources • What-if scenarios and multiple design choices or initial conditions
  22. 22. Biomass Monitoring (joint with ORNL) • Investigate changes in vegetation and land useCenter for Comprehensive Informatics • Hierarchical, multi-resolution coarse/fine-grained analytics into a unified framework • Changes identified using high temporal/low spatial resolution MODIS data • Segmentation and classification methods used to characterize changes using higher resolution data (e.g. multitemporal AWiFS data) • Segmentation and classification to identify man-made structures.
  23. 23. Center for Comprehensive Informatics
  24. 24. Core Transformations• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation• Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification
  25. 25. Extreme DataCutterDataCutter Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System SExtreme DataCutter Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm
  26. 26. Extreme DataCutter – Two Level ModelCenter for Comprehensive Informatics
  27. 27. Node Level Work SchedulingCenter for Comprehensive Informatics • Features of Node Level Architectures – Nodes contain CPUs, GPUs – Each CPU contains multiple cores – GPU has complex internal architecture – Data locality within node – Data paths between CPUs and GPUs Keeneland Node
  28. 28. Node Level Work SchedulingCenter for Comprehensive Informatics • Attempt to minimize data movement • Identify and assign operations that perform well on GPU • Balance load between CPUs and GPUs • Prefetch data • Identify and use high bandwidth CPU/GPU data paths • Schedule exclusive GPU access for components (e.g. morphological reconstruction) requiring fine grained parallelism
  29. 29. Center for Comprehensive Informatics Node Level Work Scheduling
  30. 30. Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)Center for Comprehensive Informatics
  31. 31. Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUsCenter for Comprehensive Informatics Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs
  32. 32. Large Scale Data ManagementCenter for Comprehensive Informatics  Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide)  Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.  Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships  Highly optimized spatial query and analyses
  33. 33. Spatial Centric – Pathology Imaging “GIS”Point query: human marked point Window query: return markupsinside a nucleus contained in a rectangle .Containment query: nuclear feature Spatial join query: algorithmaggregation in tumor regions validation/comparison
  34. 34. PAISPAIS (Pathology Analytical Imaging Standards) Supported by caBIG, R01 and ACTSI class Domain Mo...  PAIS Logical Model WholeSlideImageReference TMAImageReference Patient Region  62 UML classes MicroscopyImageReference Subject DICOMImageReference 0..1 1 0..1  markups, annotations, 1 Specimen 0..* ImageReference 1 0..* 0..1 imageReferences, 1 0..1 AnatomicEntity User 0..1 0..* 1 1 provenance 1 1 Equipment Group  PAIS Data Representation 0..1 1 0..1 1 PAIS 1 0..* Project AnnotationReference 1  XML (compressed) or HDF5 0..1 0..* 1 1 1 Collection 0..* 0..* 0..* 1 Markup 0..* 0..* 0..1 Annotation 0..*  PAIS Databases 0..* GeometricShape 1  loading, managing and 1 1 0..* 0..* 0..* 1 Observation Calculation Inference Surface Field 1 1 querying and sharing data 0..1 0..1 0..1 Provenance 0..*  Native XML DBMS or RDBMS + SDBMS
  35. 35. PAIS: Example Queries Example Query for Integrative Studies • Find mean nuclear feature vector and covariance onCenter for Comprehensive Informatics tumor regions for each patient grouped by tumor subtype SELECT c.pais_uid, pc.subtype, AVG(area), AVG(perimeter), AVG(eccentricity), COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity) FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc, pais.patient p WHERE p.patientid = pc.patient_id AND p.pais_uid = c.pais_uid GROUP BY c.pais_uid, pc.subtype; 2 1 3 4 1 1 Cluster 1 20 10 0.9 Cluster 2 20 Cluster 3 40 0.8 Cluster 4 30 0.7 60 40 2 Feature Indices 0.6 Cluster 50 Survival 80 0.5 60 100 0.4 70 3 120 80 0.3 140 90 0.2 100 0.1 160 4 110 50 100 150 0 0 500 1000 1500 2000 2500 3000 0 0.2 0.4 0.6 0.8 1 Days Silhouette Value
  36. 36. Algorithm Validation: Intersectionbetween Two Result Sets (Spatial Join) PAIS: Example Queries . .
  37. 37. VLDB 2012Center for Comprehensive Informatics Change Detection, Comparison, and Quantification
  38. 38. Summary and Perspective • Large scale integrative data analytic methods andCenter for Comprehensive Informatics tools to integrate clinical, molecular, Pathology, Radiology data • Characterize new cancer subtypes and biomarkers, predict outcome, treatment response • Algorithms to quantify Pathology classification • HPC/BIGDATA analysis pipelines
  39. 39. Importance: • Computer Science: general approaches to analysisCenter for Comprehensive Informatics and classification of very large datasets from low dimensional spatio-temporal sensors • Biomedical: generate basic insights into pathophysiology, clues to new treatments, better ways of evaluating existing treatments and core infrastructure needed for comparative effectiveness research studies
  40. 40. Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin Kurc, Himanshu Rathod Emory leads• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon, Daniel Rubin, Fred Prior, Larry Tarbox and many others• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul Pantalone• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado- Ramos• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman, Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi
  41. 41. Thanks!

×