1. Provided to you by the
Canadian Bioinformatics
Workshop series
www.bioinformatics.ca
NCRI Cancer Conference:
Cancer data and its analysis
practical workshop
November 1, 2015
3. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and
respect the rights and licenses associated
with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
7. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Schedule for Module 1:
Cancer Genomic Databases
• Introduction to the Canadian Bioinformatics
Workshop series.
• The Databases:
– The Cancer Genome Atlas (TCGA)
– The International Cancer Genome Consortium (ICGC)
• Data Access: human genomes and security and
privacy issues:
Open Data vs. Controlled Access data
• Another Database:
– The Catalogue of Somatic Mutations in Cancer (COSMIC)
10. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Workshops planned for 2016:
http://bioinformatics.ca/workshops
1. Bioinformatics for Cancer Genomics
2. High-throughput Biology: From Sequence to Networks (2017 - CSHL)
3. Introduction to R
4. Exploratory Analysis of Biological Data using R
5. Informatics for RNA-sequence Analysis
6. Informatics on High Throughput Sequencing Data
7. Pathway and Network Analysis of -omics Data
8. Informatics and Statistics for Metabolomics
9. Analysis of Metagenomic Data
10. How to Work in the Cloud: Computing on Human Genome Data
11. Epigenomic Data Analysis
12. Big Data in Precision Genomics
13. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Soap-Box time!
• Open Access, Open Data and Open Source are essential for good
Science.
• Openness is a responsibility, an obligation, and something that comes
with the privilege of doing publicly funded work.
Open Access
Open Source
Open Data
Opencourseware
15. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Cancer therapy is like
beating the dog with
a stick to get rid of
his fleas.
- Anna Deavere Smith,
Let me down easy
17. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
The revolution in cancer
research can summed up
in a single sentence:
cancer is in essence,
a genetic disease.
- Bert Vogelstein
18. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Cancer: a Disease of the Genome
Challenge in Treating Cancer:
Every tumour is different
Every cancer patient is different
22. NCRI Workshop 2015 – Module 1 bioinformatics.ca
TCGA
The Cancer Genome Atlas is a
comprehensive and coordinated
effort to accelerate our
understanding of the molecular
basis of cancer through the
application of genome analysis
technologies, including large-
scale genome sequencing.
23. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
About the TCGA
• National Cancer Institute (NCI)
• National Human Genome Research Institute
(NHGRI)
• Phased Structure:
– Three-year pilot in 2006 with an investment of $50 million
from each
– TCGA will collect and characterize more than 20 additional
tumour types
25. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Division of Labour
• Biospecimen Core Resource (BCR)
– centre where samples are carefully catalogued, processed, qualitychecked
and stored along with participant clinical information
• Genome Sequencing Centre (GSC)
– uses high-throughput methods to identify changes to DNA sequences that are
associated with specific cancer types
• Genome Characterization Centre (GCC)
– uses high-throughput technologies to analyze genomic changes involved in cancer
• Genome Data Analysis Centre (GDAC)
– provides novel informatics tools to the research community
– provides analysis results using TCGA data.
• Data Coordinating Centre (DCC)
– Central provider of TCGA data.
– Standardizes data formats and validates submitted data.
26. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
TCGA Data
• Sequence reads from newer sequencing
technologies are available at the Cancer Genome
Hub: https://cghub.ucsc.edu/
• Higher level sequence data (variation calls and
abundance measures) are available at the TCGA
Portal: http://cancergenome.nih.gov/
• Also integrated with ICGC data (more on this later)
28. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
Data Coordinating Centre
• Play a central role
– Receiving data from BCR, GSC and GCC sites
– Providing access to users
– Performing analysis of data
• Responsibilities:
– Protecting participant privacy and confidentiality
– Developing data standards and controlled vocabularies
– Establishing informatics pipelines for data flow
– Developing new analytical and visualization technologies
to facilitate data analysis, for all audiences
29. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
TCGA DCC Data Portal
• Provides a platform to search, download and
analyze TCGA data sets
• Two data access tiers: Open and Controlled
• Analytic tools include: Cancer Molecular Analysis
and Cancer Genome Workbench (NCBIB),
Integrative Genomics Viewer (Broad) and
CancerGenomics Analysis (MSKCC).
30. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
TCGA Data Browser
https://tcga-data.nci.nih.gov/tcga/
Query TCGA
data online
using the
TCGA Data
Browser
31. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
The International Cancer Genome Consortium (ICGC)
• http://www.icgc.org/
• “ICGC was launched
to coordinate large-
scale cancer genome
studies in tumours
from 50 different
cancer types and/or
subtypes that are of
clinical and societal
importance across
the globe”
43. NCRI Workshop 2015 – Module 1 bioinformatics.ca
ICG
C
TCGA
Differences between ICGC & TCGA
• Different tumour types
• Different geographic rules
• Many countries vs one jurisdiction
• Different definitions of what is controlled
• Different data access rules
44. NCRI Workshop 2015 – Module 1 bioinformatics.ca
• Detailed Phenotype and Outcome data
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
• Germ line variants
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Somatic variants from Exome or WGS
ICGC Open
Access Datasets
http://goo.gl/w4mrV
45. NCRI Workshop 2015 – Module 1 bioinformatics.ca
• Primary sequence data
(BAM and FASTQ files)
• SNP6 array level 1 and level 2 data
• Exon array level 1 and level 2 data
• Somatic variants from whole
genome sequencing
• Certain information in MAFs
• A full list of controlled-access
data types can be found at:
http://goo.gl/K1h7zu
TCGA Controlled
Access Datasets
• De-identified clinical and
demographic data
• Gene expression data
• Copy number alterations in regions
of the genome
• Epigenetic data
• Summaries of data compiled across
individuals
• Anonymized single amplicon DNA
sequence data
• Somatic variants from scrubbed
exome sequencing
TCGA Open
Access Datasets
http://goo.gl/A1rMRB
46. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
TCGA/ICGC users agreed:
• … to keep all computer systems on which controlled
access data reside, or which provide access to such
data, up to date with respect to software and
security patches.
• … to protect Controlled Access Data against
disclosure to unauthorized individuals.
• … to monitor and control which individuals have
access to Controlled Access Data.
47. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
TCGA/ICGC users agreed:
• … to destroy all copies of controlled access data
after controlled access privileges expires.
• ... to only use secure transfer protocols:
e.g. https and sftp
• … to encrypt Controlled Access data in transfers
and storage
48. NCRI Workshop 2015 – Module 1 bioinformatics.ca
What does it mean for this file?
simple_somatic_mutation.aggregated.vcf.gz
https://dcc.icgc.org/repository/icgc/release_19/Summary
51. NCRI Workshop 2015 – Module 1 bioinformatics.ca
Identify
yourself
Fill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
63. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
DACO/DCC User Data Access Process
• Users approved through DACO are now automatically granted access to
ICGC controlled access datasets available through the ICGC Data Portal and
the EBI’s EGA repository
DACO Web
Application
DCC User
Registry
DCC Data
Portal
EBI EGA
application
approved
by DACO
user
accounts
activated
64. NCRI Workshop 2015 – Module 1 bioinformatics.ca
Catalogue of Somatic Mutations in Cancer
(COSMIC) • http://cancer.sanger.ac.uk/cancerg
enome/projects/cosmic/
• COSMIC is designed
to store and display
somatic mutation
information and
related details and
contains information
relating to human
cancers.
71. bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
In closing
• Remember all these sites have great amounts of
documentation
• The field is changing quickly, and so are the portals.
• New features are planned as we speak, and so you
need to use the sites, and keep coming back.
• Don’t be afraid to explore
• Interested in learning more after today? Consider
one of the bioinformatics.ca workshops!
72. NCRI Workshop 2015 – Module 1 bioinformatics.ca
Acknowledgements:
the CBW gang
Michelle Brazas
Michael
Stromberg
Marc
Fiume
Michael
Brudno