The Open Science Data Cloud: Empowering the Long Tail of Science
1. A
501(c)(3)
not-‐for-‐profit
operaCng
clouds
for
science.
The
Open
Science
Data
Cloud:
Empowering
the
Long
Tail
of
Science
October
12,
2012
Robert
L.
Grossman
University
of
Chicago
and
Open
Cloud
ConsorCum
2. QuesCon
1.
What
is
the
cyberinfrastructure
required
to
manage,
analyze,
archive
and
share
big
data?
Call
this
analyCc
infrastructure.
3. QuesCon
2.
What
is
the
analogy
of
the
GLIF*
for
analyCc
infrastructure?
*GLIF
(www.glif.is),
the
Global
Lambda
Integrated
Facility,
is
an
internaConal
virtual
organizaCon
that
promotes
the
paradigm
of
lambda
networking.
GLIF
provides
lambdas
internaConally
as
an
integrated
facility
to
support
data-‐
intensive
scienCfic
research,
and
supports
middleware
development
for
lambda
networking.
4. Number
1000’s
Individual
scienCsts
&
small
projects
100’s
Community
based
science
via
Science
as
a
10’s
Service
very
large
projects
Data
Size
Small
Medium
to
Large
Very
Large
Public
Shared
community
Dedicated
infrastructure
infrastructure
infrastructure
5. The
long
tail
of
data
science
A
few
large
data
Many
smaller
data
science
projects.
science
projects.
6. Part
1.
What
Instrument
Do
we
Use
to
Make
Big
Data
Discoveries?
How
do
we
build
a
“datascope?”
8. Another
way:
opencompute.org
Think
of
data
as
big
if
you
measure
it
in
MW,
as
in
Facebook’s
Pineville
Data
Center
is
30
MW.
9. An
algorithm
and
compuCng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
(or
container)
of
data
(and
corresponding
processors)
allows
you
to
do
the
same
computaCon
in
the
same
Cme
but
over
more
data.
10. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
Monitoring,
AccounCng
and
network
security
billing
Customer
and
forensics
Facing
Portal
AutomaCc
provisioning
and
100,000
servers
infrastructure
1
PB
DRAM
management
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
Data
center
network
11. My
vote
for
a
datascope:
a
(bouCque)
data
center
scale
facility
with
a
big-‐
data
scalable
analyCc
infrastructure.
What
would
a
global
integrated
facility
for
datascopes
look
like?
12. Some
Examples
of
Big
Data
Science
Discipline
Dura2on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
parCcle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambiCous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resulCng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hjp://www.lsst.org/
News/enews/teragrid-‐1004.html
14. Sci
CSP
services
Data
scienCst
Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
15. What
are
some
of
the
important
differences
between
commercial
and
research-‐focused
Sci
CSPs?
16. Science
CSP
Commercial
CSP
POV
DemocraCze
access
to
As
long
as
you
pay
the
bill;
data.
Integrate
data
to
as
long
as
the
business
make
discoveries.
Long
model
holds.
term
archive.
Data
&
Data
intensive
Internet
style
scale
out
Storage
Science
Clouds
compuCng
&
HP
storage
and
object-‐based
storage
Flows
Large
data
flows
in
and
Lots
of
small
web
flows
out
Streams
Streaming
processing
NA
required
AccounCng
EssenCal
EssenCal
Lock
in
Moving
environment
Lock
in
is
good
between
CSPs
essenCal
17. Part
2.
The
Open
Cloud
ConsorCum’s
Open
Science
Data
Cloud
18. • U.S
based
not-‐for-‐profit
corporaCon.
• Manages
cloud
compuCng
infrastructure
to
support
scienCfic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compuCng
testbeds:
Open
Cloud
Testbed.
www.opencloudconsorCum.org
18
19. OCC
Members
&
Partners
• Companies:
Cisco,
Yahoo!,
Citrix,
…
• UniversiCes:
University
of
Chicago,
Northwestern
Univ.,
Johns
Hopkins,
Calit2,
ORNL,
University
of
Illinois
at
Chicago,
…
• Federal
agencies
and
labs:
NASA,
LLNL,
ORNL
• InternaConal
Partners:
AIST
(Japan),
U.
Edinburgh,
U.
Amsterdam,
…
• Partners:
NaConal
Lambda
Rail
19
20. OCC
2011
Resources
Resource
Type
Comments
OSDC
Adler
&
UClity
Cloud
1248
cores
and
0.4
PB
disk
Sullivan
OCC
–
Y
Data
Cloud
928
cores
and
1.0
PB
disk
OCC
–
Matsu
Mixed
1
rack
OSDC
Root
Storage
0.8
PB
• OCC-‐Adler,
Sullivan
&
Root
will
more
than
double
in
size
in
2012.
22. One
Million
Genomes
• Sequencing
a
million
genomes
would
most
likely
fundamentally
change
the
way
we
understand
genomic
variaCon.
• The
genomic
data
for
a
paCent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
Cssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
23. Big
data
driven
discovery
on
1,000,000
genomes
and
1
EB
of
data.
Genomic-‐ Improved
Genomic-‐
driven
understanding
driven
drug
diagnosis
of
genomic
development
science
Precision
diagnosis
and
treatment.
PrevenCve
health
care.
25. UDR
• UDT
is
a
high
performance
network
transport
protocol
• UDR
=
rsync
+
UDT
• It
is
easy
for
an
average
systems
administrator
to
keep
100’s
of
TB
of
distributed
data
synchronized.
• We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC
26. OpenFlow-‐Enabled
Hadoop
WG
• When
running
Hadoop
some
map
and
reduce
jobs
take
significantly
longer
than
others.
• These
are
stragglers
and
can
significantly
slow
down
a
MapReduce
computaCon.
• Stragglers
are
common
(dirty
secret
about
Hadoop)
• Infoblox
and
UChicago
are
leading
a
OCC
Working
Group
on
OpenFlow-‐enabled
Hadoop
that
will
provide
addiConal
bandwidth
to
stragglers.
• We
have
a
testbed
for
a
wide
area
version
of
this
project.
27. OSDC
PIRE
Project
We
select
OSDC
PIRE
Fellows
(US
ciCzens
or
permanent
residents):
• We
give
them
tutorials
and
training
on
big
data
science.
• We
provide
them
fellowships
to
work
with
OSDC
internaConal
partners.
• We
give
them
preferred
access
to
the
OSDC.
Nominate
your
favorite
scienCst
as
an
OSDC
PIRE
Fellow.
www.opensciencedatacloud.org
(look
for
PIRE)
29. Open
Science
Data
Cloud
AccounCng
and
Monitoring,
billing
(OSDC)
compliance,
&
security
Customer
Facing
Science
Cloud
SW
&
Services
Portal
(Tukey)
AutomaCc
provisioning
and
3
PB
2011
infrastructure
10
PB
2012
management
~100
Gbps
bandwidth
able
to
scale
to
100
PB?
5-‐12
operators
to
operate
1-‐5
MW
Science
Cloud
Data
center
network
OSDC
Data
Stack
based
upon
OpenStack,
Hadoop,
GlusterFS,
UDT,
…
30. Cloud
Services
OperaCons
Centers
(CSOC)
• The
OSDC
operates
Cloud
Services
OperaCons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
supporCng
Science
Clouds
for
researchers.
• Compare
to
Network
OperaCons
Center
or
NOC.
• Both
are
an
important
part
of
cyber
infrastructure
for
big
data
science.
31. OSDC
Racks
• How
quickly
can
we
set
up
a
rack?
• How
efficiently
can
we
operate
a
rack?
(racks/admin)
2012
OSDC
rack
design
(dray)
• 950
TB
/
rack
• 600
cores
/
rack
32. EssenCal
Services
for
a
Science
CSP
• Support
for
data
intensive
compuCng
• Support
for
big
data
flows
• Account
management,
authenCcaCon
and
authorizaCon
services
• Health
and
status
monitoring
• Billing
and
accounCng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
reporCng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
34. Acknowledgements
Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the
Gordon
and
Bejy
Moore
FoundaCon.
This
funding
is
used
to
support
the
OSDC-‐Adler,
Sullivan
and
Root
faciliCes.
AddiConal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:
• The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was
donated
by
Yahoo!
in
2011.
• Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data
centers
with
10
Gbps
wide
area
networks.
• NSF
awarded
the
OSDC
a
5-‐year
(2010-‐2016)
PIRE
award
to
train
scienCsts
to
use
the
OSDC
and
to
further
develop
the
underlying
technology.
• OSDC
technology
for
high
performance
data
transport
is
support
in
part
by
NSF
Award
1127316.
• The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high
performance
research
networks
around
the
world
at
10
Gbps
or
higher,
with
an
increasing
number
of
100
Gbps
connecCons.
The
OSDC
is
managed
by
the
Open
Cloud
ConsorCum,
a
501(c)(3)
not-‐for-‐profit
corporaCon.
If
you
are
interested
in
providing
funding
or
donaCng
equipment
or
services,
please
contact
us
at
info@opensciencedatacloud.org.
35. For
more
informaCon
• You
can
find
some
more
informaCon
on
my
blog:
rgrossman.com.
• Some
of
my
technical
papers
are
also
available
there.
• My
email
address
is
robert.grossman
at
uchicago
dot
edu.
Center for
Research
Informatics