Imaging Data Commons (IDC) - Introduction and intital approach
1. Imaging Data Commons (IDC):
Introduction and initial
approach
Andrey Fedorov, PhD, on behalf of the IDC consortium
Brigham and Women’s Hospital / Harvard Medical School
andrey.fedorov@gmail.com
Oct 7, 2019 - NCI Imaging Community call
Slides location: http://bit.ly/2019-idc-imgcommcall
2. 2
“The NCI Imaging Data Commons will be a
cloud-based resource that connects
researchers with
1. cancer image collections
2. a robust infrastructure that contains imaging data,
subject and sample metadata and experimental
metadata from disparate sources
3. resources for searching, identifying and viewing
images, and
4. additional data types contained in other Cancer
Research Data Commons nodes.”
Cancer Research Data Commons (CRDC)
Imaging Data Commons (IDC)
3. IDC timeline to the award
● April 2018: request for information on
IDC development
● December 11, 2018: IDC solicitation for
competitive proposals released
● January 31, 2019: IDC proposals due
date
● July 24, 2019: IDC contract mutually
signed between Leidos Biomed and
BWH
3
4. IDC: The team
● Leadership:
○ Ron Kikinis*, Principal Investigator
○ Andrey Fedorov, Technical lead and project
manager
● Imaging R&D:
○ BWH: Ron Kikinis, Andrey Fedorov, Hugo Aerts
○ Isomics Inc: Steve Pieper
○ Radical Imaging: Rob Lewis, Erik Ziegler
○ Pixelmed: David Clunie (aka “Mr. DICOM”)
○ Fraunhofer MEVIS: André Homeyer
● Cloud:
○ Institute for Systems Biology: Bill Longabaugh
● Security:
○ General Dynamics IT: David Pot
* Ron Kikinis is 50/50 between BWH and MEVIS at the moment, 100%
at BHW effective March 2020 4
Bill LongabaughRon Kikinis Andrey Fedorov
Steve Pieper David Clunie
Hugo Aerts
David Pot
André HomeyerRob Lewis
5. Team background
● Open source image computing technology
○ 3D Slicer, OHIF, pyradiomics, dcmqi
● Network projects and collaborations
○ Quantitative Imaging Network (QIN)
○ Informatics Technology for Cancer Research (ITCR)
● Collaborations with TCIA
○ Data harmonization efforts (segmentations, LIDC)
● DICOM development
○ Refinement of the standard to address quantitative
imaging use case
○ Tools, outreach, industry collaborations
● Cloud infrastructure development
○ ISB-CGC one of the 3 Cancer Genomics Cloud pilots
5
6. Understanding the problem
6
CrowdFlower 2016 Data Science Report
● Imaging data = images + annotations + clinical
data ( + analysis results)
○ Current community focus is on images
○ Data preparation is very time consuming
○ Limited effort to support “organic growth” of the datasets
● Need multi-site, multi-reader, multi-tool,
representative cohorts
○ Cannot be done without conventions for data
representation
○ Requires tools to support harmonization efforts
○ Hard if not impossible to do retrospectively
● Need harmonization for
○ Semantics, data representation, communication interfaces
● CRDC takes those problems at new levels
7. ● First priority: resource useful for imaging researchers
● Opportunity to address limitations and develop missing components
○ Visualization, search, data harmonization (where “data” is not limited to “images”!)
● Empower imaging research with metadata
○ Harmonize imaging, image derived and image related data
○ Provenance
○ Search
○ FAIR
● Initial goals: radiology and pathology
● Simplify accessibility of the already popular tools
● Simplify analysis workflows
● Lead by example: development of use cases is part of the project plan
● Longer term: cross-node integration
IDC vision
7
8. IDC and The Cancer Imaging Archive (TCIA)
● “TCIA is a service which de-identifies and hosts a large archive of medical
images of cancer accessible for public download.”
● Mostly clinical imaging data (radiology and digital pathology)
● IDC will:
○ Be part of the CRDC ecosystem, with the objective to support cross-domain data integration
○ Utilize public collections of TCIA to populate first wave of content, and will host public
collections of the TCIA going forward
○ Rely on TCIA for image de-identification
○ Collaborate with TCIA on the topics of data harmonization and development of tools of shared
interest
○ Encourage users to compute on the cloud by providing various incentives (e.g., compute
credits)
○ Discourage downloads of the data by not offsetting the data egress charges
8
9. IDC implementation - guiding principles
● Agile development approach
○ Broad initial direction, adaptive
○ Customer involvement
○ Short development cycles - sprints
● Phased implementation plan
○ Phase I: Pre Minimal Viable Product
○ Phase II: Minimal Viable Product
○ Phase III: Production / Further development
○ Phase IV: Further development / Maintenance
● IDC will host only public datasets, will NOT de-identify your data!
● Non-restrictive open source license for everything produced by IDC
● “Bag of tools” instead of monolithic development from scratch
● Standards-based
9
10. Phase I: Pre Minimal Viable Product
RFP-defined broad tasks:
● Definition of imaging data model, data dictionaries and ontologies
● Initial dataset and use-case definition with cross-collection data access
● Image data, metadata and file standards
● Evaluation of existing software and tools for reuse
● IDC advisory committee selection
10
Target completion:
October 2019
11. IDC data backbone: DICOM
● Digital Imaging and Communication in Medicine (DICOM) is the standard for
communication of medical imaging information and related data
○ emphasis on metadata standardization
○ compatibility with acquisition and archival tools
○ images (CT, MR, whole slide pathology) and analysis results (annotations, segmentation,
registration, quantification)
○ interoperability
○ coordinated with other standards (HL7/FHIR, JSON, XML, REST, WADO, BRIDG, SNOMED)
● History of development and adoption since 1983
● Adopted by virtually all manufacturers of medical imaging equipment
● Open international community of stakeholders
● DICOM is a live standard!
● IDC opportunities: raise awareness, create incentives, help transitioning 11
12. DICOM for data modeling
● DICOM files: combine attributes
from several real-world entities
○ Patient
○ Equipment
○ Modality-specific attributes
● Tables of attributes based on
modules
○ Support incremental growth of content
● Unique identifiers
● Specific tasks:
○ Machine-readable representation
○ Data model search interface
○ Performance evaluation
○ Tools
http://dicom.nema.org/medical/dicom/current/output/chtml/part03/chapter_A.html
DICOM Composite Instance IOD Information Model
12
13. ● Radiology images
○ Bonus: multi-frame representation
● Pathology images
○ Converters
○ Capabilities
● Image-derived data
○ Annotations, parameter maps, qualitative evaluations
○ Radiology, pathology
● Other image types: open question
● Opportunities:
○ The knowns: converters, ontologies, capabilities, learning resources, missing data types
○ The unknowns: we have to be ready and nimble to leverage those as they come up
DICOM gaps assessment
13
14. Clinical data
● Basic clinical data: DICOM composite
context
● Treatment, diagnosis, ...
○ Excel spreadsheets?
○ One of a kind example: DICOM SR for
QIN-HEADNECK
● What can be reconciled and harmonized -
open question
● Approach:
○ Coordination with CRDC-wide resource: Cancer
for Cancer Data Harmonization (CCDH)
○ CDISC BRIDG: unifying model for clinical and
research domains (harmonized with DICOM)
14
https://www.cdisc.org/standards/domain-information-module/bridg
15. Key collaborators: Erik Ziegler, Trinity Urban, Gordon
Harris (OHIF), Markus Herrmann (BWH/MGH CCDS)
Image viewer: OHIF
● Browser-based (zero install!)
○ Open source, modern Javascript
○ DICOM standard images,
segmentations, annotations
○ Professional design
● DICOMweb supported by
Google, Siemens, and open
source servers
● VTK.js WebGL visualization
● Pathology plugin development
○ DICOM Whole Slide Imaging
○ Efficient DICOMweb pyramid
access
15
16. Annotation example:
Crowds Cure Cancer
● Expert annotation of cancer images
from TCIA
● Booth at RSNA 2017, 2018 (and
2019)
● Built on OHIF, react, dcm4chee,
AWS
● Desktop and Mobile
● > 5,000 measurements collected
● Help out at crowds-cure.org!
16
Key collaborators: Erik Ziegler, Trinity Urban, Dan Rukas, Gustavo Lelis, Jayashree Kalpathy-Cramer,
Gordon Harris, Fred Prior, Justin Kirby, and more...
17. Institute for Systems Biology - Cancer Genomic Cloud
(ISB-CGC)
● One of three Cancer Genomics Cloud Pilots, starting in September 2014
● Since October 2017, ISB-CGC is an NCI Cloud Resource (CR) component of the
NCI Cancer Research Data Commons (CRDC)
● As a Cloud Pilot, ISB-CGC built a platform that hosted and managed
controlled-access data stored in Google Cloud Storage buckets
○ This role is now being performed by the Genomic Data Commons (GDC) and the Data
Commons Framework (DCF)
○ ISB-CGC now uses Fence for handling A&A, linking Google IDs to eRA Commons IDs to provide
ISB-CGC users with access to controlled data
17https://isb-cgc.org/
18. Cloud platform
● Our existing ISB-CGC Web Application and API production code base can be
extensively reused and leveraged, and provides a low-risk path to stand up
the IDC minimum viable product (MVP) quickly
● Our knowledge of the existing CRDC ecosystem and roadmap will guide
architectural decisions
● Google is already providing imaging datasets
18
20. Pilot support of image viewing in ISB-CGC
20
ISB-CGC Web Application prototyped integration a pathology viewer (using caMicroscope -> transitioning
to OHIF) and a radiology viewer (using OHIF Viewer) for TCGA data:
21. Google Healthcare
● Google Cloud is the platform used
for ISB-CGC
● Google initiated work with OHIF and
PixelMed
○ Google engineers have contributed
Google Cloud support to OHIF
○ DICOMweb protocols
● Google hosts TCIA images
● BigQuery tools for extracting and
interrogating DICOM metadata
● Authentication, data security,
compute, GPU, notebooks …
21
22. Datasets
● De-identification and curation: TCIA
● TCGA
○ Radiology
■ 1731 cases, 3022 DICOM studies, 20317 DICOM series
○ Pathology
■ 11007 cases, 11963 diagnostic images, 18304 frozen tissue images
○ Available in ISB-CGC
● Most public TCIA datasets are already replicated on Google Healthcare
○ Digital pathology excluded
22
23. Other sources of data
23
● IDC is not intended to be limited to
radiology and pathology!
○ 3D atlases of the cellular, morphological,
molecular features of human cancers over time
● Human Tumor Atlas Network (HTAN)
○ Close coordination with David Gutman
○ IDC’s Bill Longabaugh is a member of HTAN
● CPTAC Imaging in TCIA: potential
proteomics use case
● Clinical trial groups (e.g., ECOG-ACRIN)
● Pharma datasets slated for public release
through research projects https://www.cancer.gov/research/key-initiatives/moonsh
ot-cancer-initiative/implementation/human-tumor-atlas
24. Approach: Analytics / applications
● Goals:
○ empower researchers to do better science (integrative, larger, faster, rigorous, traceable, enable
comparative studies)
○ metadata in - metadata out
● Computational workflows applied to large datasets
○ cover radiomics, pathomics, and genomics
○ integration with containerized computational tool
● Demonstrate capabilities by implementing representative use cases
○ in coordination with domain experts
○ batch processing tools, user-guided when needed
○ deep learning and engineered technologies
● Initial focus: reproduce previously published studies
● Later stages: investigate novel aspects of the data
24
25. Approach: Applications - radiomics
● Build on numerous studies based on
TCIA datasets
○ Including those integrating imaging and
genomics data
● Engineered and deep learning
● Considered use cases
○ Correlative analyses
○ Prognostication
○ Imaging-genomic studies
25
26. Approach: Applications - pathomics
● Academic studies + industry grade
pathomics tools
● Opportunity for open source tool
development
● Use cases considered
○ correlating texture or shape features
derived from pathology images with
malignancy or survival
○ correlating texture and shape features of
cellular structures with different end
points (histological grade, clinical stage,
metastasis, lymph node spread, survival)
26
From: Yu et al. 2016. Predicting non-small cell lung cancer prognosis
by fully automated microscopic pathology image features. Nat
Commun.
Tile-based steatosis quantification. Homeyer et al. Focused scores
enable reliable discrimination of small differences in steatosis. Diagn.
Pathol. 13, 76 (2018).
27. Approach: Applications
- radiomics + pathomics
Evaluate potential links between
● radiomics quantifying radiographic
information, including macroscopic
heterogeneity
● pathomics signatures characterizing
the immune responses
● genetic markers
● clinical information and outcomes
27
Grossmann et al. 2017.
Defining the biological basis
of radiomic phenotypes in
lung cancer. Elife
Saltz et al. 2018. Spatial
Organization and Molecular
Correlation of
Tumor-Infiltrating
Lymphocytes Using Deep
Learning on Pathology
Images. Cell Rep.
28. Governance
● IDC as an NCI contract to the Frederick National Labs for Cancer Research (or
Leidos Biomed.)
● Todd Pihl, PhD as FNLCR Program Manager
● Keyvan Farahani, PhD as NCI Program Director
● Monthly reporting to Leidos Biomedical
● Weekly stakeholder meetings
● IDC Advisory Committee (TBD) to provide “guidance on IDC scope, direction,
and other governance issues including what datasets the IDC should
incorporate. This group will be composed of extramural experts in cancer
imaging, related technologies and NCI driving projects"
28
29. Phase II: Minimal Viable Product
RFP-defined broad tasks:
● Implementation of Gen3 integration (Fence, IndexD)
● Demonstration of IDC portal
● Cloud installation of and access to TCGA and one other collection
● Demonstration of viewer implementation for radiological images
● Demonstration of artificial cohort generation and identification
● Cross-cloud provider interoperability and standards
● User testing
● Outreach to and input from imaging and other cancer research communities
● [Support of continuous] Availability
29
Target completion:
late Summer 2020
30. Reuse of ISB-CGC Codebase
● ISB-CGC (in the initial Cloud Pilot phase) was originally developed to handle
storing the data, finding the data, and computing on it
● The pieces of ISB-CGC that were built for the first two roles in the original pilot
phase are ideal for reuse to set up the IDC
● In the current CRDC ecosystem, the roles and functionality of the Cloud
Resources (e.g. the current version of ISB-CGC), the Cancer Data Aggregator
(CDA), and data nodes in such as the IDC are distributed:
○ Compute is done in Cloud Resources using data and e.g. Dockerized tools make available by
the IDC
○ Cohort creation involving multiple data types (i.e. Pan-*DC search) is implemented in higher
layers (e.g. the resources, using the CDA to search across nodes
30
31. Outreach strategy
● Web presence (website, GitHub, mail list, Slack?)
● Interactive demonstration and learning resources: Jupyter Notebook / Google
Collab, workspaces, integration with viewers
● Publications accompanied by datasets and computational workspaces, data
descriptor publications
● Crowd-sourced annotation / analysis
● Connectathons?
● Outreach and coordination with vendors
● Tutorials at the major conferences: RSNA, MICCAI, SPIE (resources allowing!)
31
https://github.com/ImagingDataCommons @CancerIDC
32. Prior examples of outreach activities at BWH
32
https://projectweek.na-mic.org/
https://dicom4qi.readthedocs.io/
http://qiicr.org/dicom4miccai/
https://discourse.slicer.org/
33. Phase III: Production / Further development
RFP-defined broad tasks:
● ATO, FISMA compliance
● User engagement / help desk
● Work with CRDC for cross-node searching
● Demonstration of digital pathology viewer tool and other visualization
● Incorporation of additional image collections
● Support of derived datasets
● Interoperability with workspaces and cloud resources
33
Target completion:
late May 2021
34. Security
● Initially, all imaging data is planned to be de-identified, therefore open access
● Federal Information Security MAnagement Act (FISMA) Low security to get
Authority to Operate (ATO)
● Since design of system based on ISB-CGC (FISMA Moderate), much re-use of
security approach (and documentation) planned
● TCIA for data de-identification - no PHI data on IDC!
34
35. Digital pathology
● OHIF Viewer for visualization of
images and annotations
● DICOM supports digital pathology
○ Including extensive specimen metadata
● DICOM pathology annotation
capabilities will need development
● Converters
● Markus Herrmann, BWH/MGH
Center for Clinical Data Science
(CCDS) - IDC key collaborator for
the digital pathology use case
35
36. Approach for image-derived data
● Use standard DICOM objects
○ Segmentations, measurements, annotations,
parametric maps, ...
○ Numerous examples for radiology use cases
○ Pathology will require development
○ Other image types will need to be prioritized
● Improve/develop conversion tools
● Documentation, use cases to encourage
and support adoption
● Derived data submission procedures
● Search interface features and data
modeling considerations
36
37. Phase IV: Further development / Maintenance
RFP-defined broad tasks:
● Interaction with tool repositories
● Continued access and coordination of collections
● Help desk continuation
● Community engagement
37
Target completion:
July 2023
38. FAQ (based on emails/questions so far)
● IDC vs TCIA - hopefully we covered this earlier
● What are your plans for image viewing and annotation?
○ OHIF; Image viewing and annotation visualization in MVP
● What image data will IDC be hosting besides TCIA?
○ Image data from high research value biomedical imaging projects generating public datasets
○ To be determined in coordination with the IDC Advisory Committee
● Will you be providing an API to IDC for accessing the images and annotations
and will you be supporting DICOMweb?
○ Yes
● Going forward when people contribute biomedical datasets, where do they go
— to TCIA or IDC? Or will data contributed to TCIA automatically go to IDC?
○ New imaging data should be submitted TCIA, and will be pulled into IDC post-curation
● Once we do analyses, we will generate image metadata, those could be
contributed to the community via IDC, will IDC will accept those?
○ Yes, our desire is to make the process of contributing those back as seamless as possible 38
39. Significance beyond Data Commons
● Scientific reproducibility and cloud/standards/containerized analysis as a
components of the solution
● Routine generation of standardized data
● Raise awareness of the value of metadata, introduce tools to enable its
collection and use
● Opportunities to engage and integrate various groups of stakeholders
(industry, clinical trial groups, pharma, researchers, clinicians)
● We believe tools developed can be applicable for establishing private “mini
commons”
39
40. Dedication
40
Ed Helton
1945 - 2019
Associate Director of Clinical Trials
Programs and Products, NCI
Lawrence (Larry) Clarke
1944 - 2016
Chief of the Image Technology
Development Branch, NCI