Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Bionimbus:

Lessons
from
a
Petabyte-‐Scale

Science
Cloud
Service
Provider
(CSP)

Robert
Grossman

Ins?tute
for
Genomics
&
Systems
Biology

Center
for
Research
Informa?cs

Computa?on
Ins?tute

Department
of
Medicine

University
of
Chicago

&

Open
Data
Group

September
11,
2012

The
OSDC
&
Bionimbus
Teams

•  Open
Science
Data
Cloud
(OSDC)
Team

–  MaM
Greenway,
Allison
Heath,
Ray
Powell,
Rafael

Suarez.

–  Major
funding
for
the
OSDC
is
provided
by
the
Gordon

and
BeMy
Moore
Founda?on.

•  Bionimbus
Team

–  Elizabeth
Bartom,
Casey
Brown,
Jason
Grundstad,
David

Hanley,
Nicolas
Negre,
Tom
Stricker,
MaM
SlaMery,

Rebecca
Spokony
&
Kevin
White.

–  Bionimbus
is
a
joint
project
between
Laboratory
for

Advanced
Compu?ng
&
White
Lab
at
the
University
of

Chicago
and
uses
in
part
the
OSDC
infrastructure.

Let’s
Step
Back
20
Years

•  1992-‐96:
Petabyte

Access
&
Storage

Solu?ons
(PASS)

Project
for
SSC.

•  It
developed
&

benchmarked

federated
rela?onal,

OO
DB,
object

stores,
&
column-‐
oriented
data

warehouse
solu?ons

at
the
TB-‐scale.

A
picture
of
Cern’s
Large
Hadron
Collider
(LHC).

The
LHC
took
about
a
decade
to
construct,
and
cost
about

$4.75
billion.

Source
of
picture:
Conrad
Melvin,
Crea?ve
Commons
BY-‐SA
2.0,
www.ﬂickr.com/photos/
58220828@N07/5350788732

Part
1.

Genomics
as
a
Big
Data
Science

One
Million
Genomes

•  Sequencing
a
million
genomes
would
most

likely
fundamentally
change
the
way
we

understand
genomic
varia?on.

•  The
genomic
data
for
a
pa?ent
is
about
1
TB

(including
samples
from
both
tumor
and

normal
?ssue).

•  One
million
genomes
is
about
1000
PB
or
1
EB

•  With
compression,
it
may
be
about
100
PB

•  At
$1000/genome,
the
sequencing
would
cost

about
$1B

Big
data
driven
discovery
on

1,000,000
genomes
and
1
EB
of
data.

Genomic-‐ Improved

Genomic-‐

driven
understanding
driven
drug

diagnosis
of
genomic
development

science

Precision
diagnosis
and

treatment.

Preven?ve

health
care.

ER+

TNBC

With
genomics,
we
can
stra?fy
diseases
and
treat
each

stratum
diﬀerently.
Source:
White
Lab,
University
of
Chicago.

Clonal
Evolu?on
of
Tumors

Tumors
evolve
temporally
and
spa?ally.

Source:
Mel
Greaves
&
Carlo
C.
Maley,
Clonal
evolu?on
in
cancer,
Nature,

Volume
241,
pages
306-‐312,
2012.

Combina?ons
of
Rare
Alleles

Penetrance

High

rare
examples
of

alleles
high-‐penetrance

causing
common
variants

Mendelian

inﬂuencing

Intermediate
disease
common
disease

Low-‐frequency

variants
with

intermediate
penetrance

rare
variants
of
most
common

Modest
variants

small
eﬀect

very
hard
to
iden?fy
implicated
in

by
gene?c
means
common
disease

by
GWA

Low

Allele

0.001
0.01
0.1
frequency

Very
rare
Rare
Uncommon
Common

Source:
Mark
McCarthy

TCGA
Analysis
of
Lung
Cancer

•  178
cases
of

SQCC
(lung

cancer)

•  Matched
tumor

&
normal

•  Mean
of
360

exonic

muta?ons,
323

CNV,
&
165

rearrangements

per
tumor

Source:
The
Cancer
Genome
Atlas
Research
Network,
Comprehensive
genomic

characteriza?on
of
squamous
cell
lung
cancers,
Nature,
2012,
doi:10.1038/nature11404.

Some
Examples
of
Big
Data
Science

Discipline
Dura3on
Size
#
Devices

HEP
-‐
LHC
10
years
15
PB/year*
One

Astronomy
-‐
LSST
10
years
12
PB/year**
One

Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s

*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par?cle
accelerator,
is
expected
to
produce
more
than
15

million
Gigabytes
of
data
each
year.

…
This
ambi?ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer

centres
in
33
countries.

Source:
hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html

**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes

processed),
resul?ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.

Source:
hMp://www.lsst.org/
News/enews/teragrid-‐1004.html

One
large
instrument
Many
smaller
instruments

Part
2.

What
Instrument
Do
we
Use
to

Make
Big
Data
Discoveries?

How
do
we
build
a
“datascope?”

TB?

PB?

EB?

ZB?

What
is
big
data?

Another
way:

opencompute.org

Think
of
data
as
big
if
you
measure
it
in
MW,
as
in

Facebook’s
Pineville
Data
Center
is
30
MW.

An
algorithm
and

compu?ng

infrastructure
is
“big-‐
data
scalable”
if
adding

a
rack
(or
container)
of

data
(and
corresponding

processors)
allows
you

to
do
the
same

computa?on
in
the

same
?me
but
over

more
data.

Commercial
Cloud
Service
Provider
(CSP)

15
MW
Data
Center

Monitoring,

Accoun?ng
and

network
security

billing
Customer

and
forensics

Facing

Portal

Automa?c

provisioning
and
100,000
servers

infrastructure
1
PB
DRAM

management
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth

25
operators
for
15
MW
Commercial
Cloud
Data
center
network

What
are
some
of
the
important

diﬀerences
between
commercial

and
research-‐focused
CSPs?

Science
CSP
Commercial
CSP

POV
Democra?ze
access
to
As
long
as
you
pay
the
bill;

data.

Integrate
data
to
as
long
as
the
business

make
discoveries.

Long
model
holds.

term
archive.

Data
&
Data
intensive
Internet
style
scale
out

Storage
Science
Clouds

compu?ng
&
HP
storage
and
object-‐based
storage

Flows
Large
data
ﬂows
in
and
Lots
of
small
web
ﬂows

out

Streams
Streaming
processing
NA

required

Accoun?ng
Essen?al
Essen?al

Lock
in
Moving
environment
Lock
in
is
good

between
CSPs
essen?al

Part
3.

The
Open
Cloud
Consor?um’s

Open
Science
Data
Cloud

•  U.S
based
not-‐for-‐proﬁt
corpora?on.

•  Manages
cloud
compu?ng
infrastructure
to

support
scien?ﬁc
research:
Open
Science

Data
Cloud.

•  Manages
cloud
compu?ng
testbeds:
Open

Cloud
Testbed.

www.opencloudconsor?um.org
23

Cloud
Services

Opera?ons
Centers
(CSOC)

•  The
OSDC
operates
Cloud
Services
Opera?ons

Center
(or
CSOC).

•  It
is
a
CSOC
focused
on
suppor?ng
Science

Clouds
for
researchers.

•  Compare
to
Network
Opera?ons
Center
or

NOC.

•  Both
are
an
important
part
of
cyber

infrastructure
for
big
data
science.

Diﬀerent
Styles
of
OSDC
Racks

•  Design
1:
Put
cores

over
spindles.

•  Higher
cost
but

easy
to
compute

over
all
the
data.

•  Design
2:
separate

(some
of
the
)
2012
OSDC
rack
design
(dray)

•  950
TB
/
rack

storage
from
the

•  600
cores
/
rack
compute.

Open
Science
Data
Cloud

Accoun?ng
and

Monitoring,
billing
(OSDC)

compliance,
&

security
Customer
Facing

Science
Cloud
SW

&
Services
Portal
(Tukey)

Automa?c

provisioning
and

3
PB
2011

infrastructure
10
PB
2012

management
~100
Gbps
bandwidth

able
to
scale
to

100
PB?

5-‐12
operators
to
operate
1-‐5
MW
Science
Cloud
Data
center
network

OSDC
Data
Stack
based
upon
OpenStack,
Hadoop,
GlusterFS,
UDT,
…

OSDC
Philosophy

•  We
try
to
automate
as
much
as
possible
(we

automate
the
setup
&
opera?ons
of
a
rack).

•  We
try
to
write
as
liMle
soyware
as
possible.

•  Each
project
is
a
bit
different,
but
in
general:

•  We
assign
(permanent)
IDs
to
data
managed
by

the
OSDC
and
manage
associated
metadata.

•  We
assign
and
enforce
permissions
for
users
&

groups
of
users
and
for
files/objects,
collec?ons

of
files/objects,
and
collec?ons
of
collec?ons.

•  We
Support
RESTful
interfaces.

•  Do
accoun?ng
for
storage
and
core-‐hours.

Some
Of
Our
Biggest
Mistakes

•  Not
charging
those
who
were
the
largest
users
of

our
services.

This
resulted
in
a
lot
of
bad

behavior.

•  Trying
to
support
donated
equipment
without

adequate
staﬀ.

•  Being
too
op?mis?c
about
when
big
data
soyware

would
be
ready
for
prime
?me.

•  Some
problems
with
big
data
soyware
doesn’t

show
up
at
less
than
the
full
scale
of
the
OSDC,
but

we
have
only
one
OSDC
and
it
is
diﬃcult
to
test
at

this
scale.

Essen?al
Services
for
a
Science
CSP

•  Support
for
data
intensive
compu?ng

•  Support
for
big
data
ﬂows

•  Account
management,
authen?ca?on
and

authoriza?on
services

•  Health
and
status
monitoring

•  Billing
and
accoun?ng

•  Ability
to
rapidly
provision
infrastructure

•  Security
services,
logging,
event
repor?ng

•  Access
to
large
amounts
of
public
data

•  High
performance
storage

•  Simple
data
export
and
import
services

Number

1000’s
Individual
scien?sts
&

small
projects

100’s

Community
based

science
via
Science
as
a

10’s
Service

very
large
projects

Data
Size

Small
Medium
to
Large

Very
Large

Public
Shared
community
Dedicated

infrastructure
infrastructure
infrastructure

Part
4.

Bionimbus

Bionimbus
is
a
joint
project
between
Laboratory
For
Advanced

Compu?ng
&
the
White
Lab
at
the
University
of
Chicago.

Step
1.
Prepare
a
Sample

Step
2.

Login
to
Bionimbus
and
get
a

Bionimbus
Key.

Step
3.

Send
your
sample
to
the

sequencing
center.

Step
4.

Login
on
to
Bionimbus
and

view
your
data

Step
5.

Use
Bionimbus
to
perform

standard
and
custom
pipelines.

Bionimbus
can
launch
mul?ple
virtual
machines.

Bionimbus
Virtual
Machine
Releases

Peak
Calling
MAT

MA2C

PeakSeq

MACS

SPP

Quality
Various

Control

Alignment
&
Bow?e

Genotyping

TopHat

Samtools

Picard

37

Soyware
Tools:
Moving
Genomes

Bionimbus
Community
Genomic
Cloud

researcher

•  1K
genomes
Cloud
for

•  PubMed
Public
Data

•  etc.

Personal
“dropbox”
+
compute

Bionimbus
Private
Genomic
Cloud

researcher

•  1K
genomes
Cloud
for
Cloud
for
TCGA

•  PubMed
Public
Data
Controlled
Data
dbGaP

•  etc.

Personal
“dropbox”

&
compute

Bionimbus
Private
Biomedical
Cloud

researcher

•  1K
genomes

•  PubMed
Cloud
for
Cloud
for
TCGA

•  etc.
Public
Data
Personal
“dropbox”
Controlled
Data
dbGaP

plus
compute

ScaMer,

gather
Clinical
Cloud
for

queries
Research
Data
PHI
data

Warehouse

Step
2.
Send
sample
to

Step
1.
Get
Bionimbus
ID

be
sequenced.

(BID),
assign
project,

private/community,

Internal
BID
Generator
public
cloud,
etc.

External

Sequencers

sequencing
partner

Step
5.

Cloud
based
analysis

using
IGSB
and
3rd

party
tools
and
applica?ons.

Step
3a.
Return
raw

reads.

Step
3b.
Return

variant
calls,

CNV,
annota?on…
Bionimbus
Bionimbus

Private
Cloud
Community

Step
4.
Secure
data
UC
Cloud

rou?ng
to
appropriate

cloud
based
upon
BID.

Bionimbus

Private
dbGaP
Amazon

Cloud
XY

(Eucalyptus,

web2py-‐based
Front
End
OpenStack)

U?lity
Cloud

(PostgreSQL)
Services

Database
Analysis
Pipelines
&

Services
Re-‐analysis
Services
Intercloud

Services

(IDs,
etc.)

(UDT,

Data

Data

replica?on)

Inges?on

Services
Cloud
Services

(Hadoop,

Sector/Sphere)

>300
ChIP
datasets

-‐ Chroma?n/RNA
?mecourse

-‐ CBP

-‐ PolII

-‐ Pho/silencers

-‐ HDACs

-‐ Insulators

-‐ TFs

Predic3ons

537
silencers

2,307
new
promoters

12,285
enhancers

14,145
insulators

www.modencode.org

44

Negre
et
al.
Nature
2011

Part
5.

Managing
One
Million
Genomes

Enrich
with

Rela?onal
databases
Summary
level

clinical
data

(10-‐100
TB)

NoSql
&
scien?ﬁc

databases

Varia?on
(VCF)
Files
(1-‐10
PB)

(Genomic
varia?on)

NoSql,
DFS,

Sequence
(BAM)
Files
(100-‐1000
PB)

ﬁle
overlays?

(Sequence
data
in
binary
form)

Acknowledgements

Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the

Gordon
and
BeMy
Moore
Founda?on.

This
funding
is
used
to
support
the
OSDC-‐Adler,

Sullivan
and
Root
facili?es.

Addi?onal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:

•  The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was

donated
by
Yahoo!
in
2011.

•  Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data

centers
with
10
Gbps
wide
area
networks.

•  NSF
awarded
the
OSDC
a
5-‐year
(2010-‐2016)
PIRE
award
to
train
scien?sts
to
use

the
OSDC
and
to
further
develop
the
underlying
technology.

•  OSDC
technology
for
high
performance
data
transport
is
support
in
part
by

NSF

Award
1127316.

•  The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high

performance
research
networks
around
the
world
at
10
Gbps
or
higher,
with
an

increasing
number
of
100
Gbps
connec?ons.

The
OSDC
is
managed
by
the
Open
Cloud
Consor?um,
a
501(c)(3)
not-‐for-‐proﬁt

corpora?on.
If
you
are
interested
in
providing
funding
or
dona?ng
equipment
or

services,
please
contact
us
at
info@opensciencedatacloud.org.

For
more
informa?on

•  You
can
ﬁnd
some
more
informa?on
on
my
blog:

rgrossman.com.

•  Some
of
my
technical
papers
are
also
available
there.

•  My
email
address
is
robert.grossman
at
uchicago
dot
edu

•  I
recently
wrote
a
popular
book
about
compu?ng
called:
The

Structure
of
Digital
Compu?ng:
From
Mainframes
to
Big
Data,

which
you
can
buy
from
Amazon.

Center for
Research
Informatics

Sources
for
images

•  The
image
of
the
hard
disk
is
from
Norlando
Pobre,
Crea?ve
Commons.

•  The
image
of
the
Facebook
Pineville
Data
Center
is
from
the
Intel
Free
Press,

www.ﬂickr.com/photos/intelfreepress/6722296855/,
Crea?ve
Commons
BY
2.0.

•  The
image
of
the
LHC
is
from
Conrad
Melvin,
Crea?ve
Commons
BY-‐SA
2.0,
www.ﬂickr.com/
photos/58220828@N07/5350788732

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Ähnlich wie Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)