Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data
1. “Creating a High Performance Cyberinfrastructure to
Support Analysis of Illumina Metagenomic Data”
DNA Day
Department of Computer Science and Engineering
University of California, San Diego
September 16, 2015
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1
2. The National Science Foundation
Has Funded Over 100 Campuses to Build “Big Data Freeways”
134 awards,
128 projects
- All but 4 states
- 120+ institutions
3. Creating a “Big Data” Plane on Campus:
NSF Funded Prism@UCSD and CHeruB
Prism@UCSD, Phil Papadopoulos, SDSC, Calit2, PI
CHERuB, Mike Norman, SDSC PI
CHERuB
4. SDSC Big Data Compute/Storage Facility -
Interconnected at Over 1 Tbps
128
COMET
VM SC 2 PF
128
Gordon
Big Data SC Oasis Data Store
•
128
Source: Philip Papadopoulos, SDSC/Calit2
Arista Router
Can Switch
576 10Gps Light Paths
6000 TB
> 800 Gbps
# of Parallel 10Gbps
Optical Light Paths
128 x 10Gbps = 1.3TbpsSDSC
Supercomputers
5. Prism@UCSD Will Link Computational Mass Spectrometry
and Genome Sequencing Cores to the Big Data Freeway
ProteoSAFe: Compute-intensive
discovery MS at the click of a button
MassIVE: repository and
identification platform for all
MS data in the world
Source: proteomics.ucsd.edu
6. IDI Enhanced Cyberinfrastructure
Supporting Knight Lab
FIONA
12 Cores/GPU
128 GB RAM
3.5 TB SSD
48TB Disk
10Gbps NIC
Knight Lab
10Gbps
Gordon
Prism@UCSD
Data Oasis
7.5PB,
100GB/s
Knight 1024 Cluster
In SDSC Co-Lo
CHERuB
100Gbps
Emperor & Other Vis Tools
64Mpixel Data Analysis Wall
120Gbps
40Gbps
7. The Pacific Wave Platform
Creates a Regional Science-Driven “Big Data Freeway System”
Source:
John Hess, CENIC
Funded by NSF $5M Oct 2015-2020
Flash Disk to Flash Disk File Transfer Rate
8. Coupling Supercomputing to Illumina Metagenomics Sequencing
5 Ileal Crohn’s Patients,
3 Points in Time
2 Ulcerative Colitis Patients,
6 Points in Time
“Healthy” Individuals
Source: Jerry Sheehan, Calit2
Weizhong Li, Sitao Wu, CRBS, UCSD
Total of 27 Billion Reads
Or 2.7 Trillion Bases
Inflammatory Bowel Disease (IBD) Patients
250 Subjects
1 Point in Time
7 Points in Time
Each Sample Has 100-200 Million Illumina Short Reads (100 bases)
Larry Smarr
(Colonic Crohn’s)
9. We Created a Reference Database
Of Known Gut Genomes
• NCBI April 2013
– 2471 Complete + 5543 Draft Bacteria & Archaea Genomes
– 2399 Complete Virus Genomes
– 26 Complete Fungi Genomes
– 309 HMP Eukaryote Reference Genomes
• Total 10,741 genomes, ~30 GB of sequences
Now to Align Our 27 Billion Reads
Against the Reference Database
Source: Weizhong Li, Sitao Wu, CRBS, UCSD
10. Computational NextGen Sequencing Pipeline:
From Sequence to Taxonomy and Function
PI: (Weizhong Li, CRBS, UCSD):
NIH R01HG005978 (2010-2013, $1.1M)
11. To Map Out the Dynamics of Autoimmune Microbiome Ecology
Couples Next Generation Genome Sequencers to Big Data Supercomputers
Source: Weizhong Li, UCSD
Our Team Used 25 CPU-years
to Compute
Comparative Gut Microbiomes
Starting From
2.7 Trillion DNA Bases
of My Samples
and Healthy and IBD Controls
Illumina HiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
12. Next Step
Programmability, Scalability and Reproducibility using bioKepler
www.kepler-project.org
www.biokepler.org
National
Resources
(Gordon) (Comet)
(Stampede)(Lonestar)
Cloud
Resources
Optimized
Local Cluster
Resources
Source:
Ilkay
Altintas,
SDSC
13. We Found Major State Shifts in Microbial Ecology Phyla
Between Healthy and Two Forms of IBD
Most
Common
Microbial
Phyla
Average HE
Average Ulcerative Colitis Average LS Average Crohn’s Disease
Collapse of Bacteroidetes
Explosion of Actinobacteria
Explosion of
Proteobacteria
Hybrid of UC and CD
High Level of Archaea
14. Our Relative Abundance Results Across ~300 People
Reveal Potential Diagnostic Species
UC 100x Healthy
UC 100x CD
We Produced Similar Results for ~2500 Microbial Species
Healthy 100x CD
15. Dell Analytics Separates The 4 Patient Types in Our Data
Using Our Microbiome Species Data
Source: Thomas Hill, Ph.D.
Executive Director Analytics
Dell | Information Management Group, Dell Software
Healthy
Ulcerative Colitis
Colonic Crohn’s
Ileal Crohn’s
16. I Built on Dell Analytics to Show Dynamic Evolution of My Microbiome
Toward and Away from Healthy State – Colonic Crohn’s
Healthy
Ileal Crohn’s
Seven Time Samples Over 1.5 Years
Colonic Crohn’s
17. Time Series Reveals Oscillations in Immune Biomarkers
Associated with Time Progression of Autoimmune Disease
Immune &
Inflammation
Variables
Weekly
Symptoms
Pharma
Therapies
Stool
Samples
2009 20142013201220112010 2015
18. UC San Diego Will Be Carrying Out
a Major Clinical Study of IBD Using These Techniques
Inflammatory Bowel Disease Biobank
For Healthy and Disease Patients
Drs. William J. Sandborn, John Chang, & Brigid Boland
UCSD School of Medicine, Division of Gastroenterology
Over 200 Enrolled
Announced November 7, 2014
19. Next Step
Knight/Smarr Lab Collaboration
• Smarr Gut Microbiome Time Series
– From 7 to 50 Times Over Four Years
• Healthy Human Microbiome
– Use 255+ Raw Reads from NIH Human Microbiome Project
• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100
– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank
– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients,
• Illumina Reagent Grant Key
– Enables Deep Metagenomic (and 16S) Sequencing at IGM of Smarr + Sandborn Samples
• New Software Suite from Knight Lab
– Major Re-annotation of Reference Genomes, Functional and Taxonomic Variations
– Novel Assembly Algorithms from Pavel Pevzner-Very Computationally Intensive
– See Talk Later This Morning
• Supercomputer Grant On SDSC Comet (Awarded from XSEDE)
– From 25 Gordon to 100 Comet Core-Years
– Each Comet Core 40GF Peak=2x Gordon Core: 8X Increase in Compute