The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Using the Open Science Data Cloud for Data Science Research
1. Using
the
Open
Science
Data
Cloud
for
Data
Science
Research
Robert
Grossman
University
of
Chicago
Open
Cloud
Consor=um
June
17,
2013
2. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
3. Part
1
What
Instrument
Do
we
Use
to
Make
Big
Data
Discoveries?
How
do
we
build
a
“datascope?”
5. An
algorithm
and
compu=ng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
(or
container)
of
data
(and
corresponding
processors)
allows
you
to
do
the
same
computa=on
in
the
same
=me
but
over
more
data.
6. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
100,000
servers
1
PB
DRAM
100’s
of
PB
of
disk
Automa=c
provisioning
and
infrastructure
management
Monitoring,
network
security
and
forensics
Accoun=ng
and
billing
Customer
Facing
Portal
Data
center
network
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
7. OSDC’s
vote
for
a
datascope:
a
(bou=que)
data
center
scale
facility
with
a
big-‐data
scalable
analy=c
infrastructure.
8. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
9. Discipline
Dura2on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s
Some
Examples
of
Big
Data
Science
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par=cle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambi=ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resul=ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hhp://www.lsst.org/
News/enews/teragrid-‐1004.html
12. There
Are
Two
Essen=al
Characteris=cs
of
a
Cloud
1. Self
service
2. Scale
• Clouds
enable
you
to
compute
over
large
amounts
of
data
with
the
necessity
of
first
downloading
the
data.
• Clouds
can
be
designed
to
be
secure
and
compliant.
12
15. Types
of
Clouds
• Public
Clouds
– Amazon
• Private
Clouds
– Run
internally
by
universi=es
or
companies
• Community
Clouds
– Run
by
organiza=ons
(either
formally
or
informally),
such
as
the
Open
Cloud
Consor=um
15
16. Amazon
Web
Services
(AWS)?
Community
clouds,
science
clouds,
etc.
• Lower
cost
(at
medium
scale)
• Data
too
important
for
commercial
cloud
• Compu=ng
over
scien=fic
data
is
a
core
competency
• Can
support
any
required
governance
/
security
• Scale
• Simplicity
of
a
credit
card
• Wide
variety
of
offerings.
vs.
OCC
supports
AWS
interop
and
burs=ng
when
permissible.
16
17. Science
Clouds
NFP
Science
Clouds
Commercial
Clouds
POV
Democra=ze
access
to
data.
Integrate
data
to
make
discoveries.
Long
term
archive.
As
long
as
you
pay
the
bill;
as
long
as
the
business
model
holds.
Data
&
Storage
Data
intensive
compu=ng
&
HP
storage
Internet
style
scale
out
and
object-‐based
storage
Flows
Large
&
small
data
flows
Lots
of
small
web
flows
Streams
Streaming
processing
required
NA
Accoun=ng
Essen=al
Essen=al
Lock
in
Moving
environment
between
CSPs
essen=al
Lock
in
is
good
Interop
Cri=cal,
but
difficult
Customers
will
drive
to
some
degree
17
18. Essen=al
Services
for
a
Science
CSP
• Support
for
data
intensive
compu=ng
• Support
for
big
data
flows
• Account
management,
authen=ca=on
and
authoriza=on
services
• Health
and
status
monitoring
• Billing
and
accoun=ng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
repor=ng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
19. Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
Data
scien=st
Sci
CSP
services
20. Cloud
Services
Opera=ons
Centers
(CSOC)
• The
OSDC
operates
Cloud
Services
Opera=ons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
suppor=ng
Science
Clouds
for
researchers.
• Compare
to
Network
Opera=ons
Center
or
NOC.
• Both
are
an
important
part
of
cyber
infrastructure
for
big
data
science.
21. Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
Data
scien=st
Sci
CSP
services
Cloud
Service
Opera=ons
Center
(CSOC)
23. Data
Founda=ons
of
data
science
General
and
discipline
specific
souware
applica=ons
and
tools
Models
and
algorithms
Establish
best
prac=ces,
strategies
for
data
science
in
general
and
discipline
specific
data
science
in
par=cular
Analy=c
infrastructure
Data
25. Theory
to
Big
Data
Spectrum
Simple
counts
and
sta=s=cs
over
big
data
Mathema=cal
theorems
No
data
Small
data
Big
data
Tradi=onal
sta=s=cal
modeling
Medium
data
(Semi-‐)Automa=ng
sta=s=cal
modeling
GB
TB
PB
OSDC
Datascope
0.5-‐2.0
MW
26. Part
4
The
Open
Science
Data
Cloud
www.opensciencedatacloud.org
27. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
29. Tukey
• Tukey
(based
in
part
on
Horizon).
• We
have
factored
out
digital
ID
service,
file
sharing,
and
transport
from
Bionimbus
and
Matsu.
30. Yates
• Automa=on
installa=on
of
OSDC
souware
stack
on
rack
of
computers.
• Based
upon
Chef
• Version
0.1
31. UDR
• UDT
is
a
high
performance
network
transport
protocol
• UDR
=
rsync
+
UDT
• It
is
easy
for
an
average
systems
administrator
to
keep
100’s
of
TB
of
distributed
data
synchronized.
• We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC
32. Open
Science
Data
Cloud
Services
• Digital
ID
services
• Data
sharing
services
• Data
transport
services
(UDR)
• What
other
core
services
are
essen&al?
• Of
course,
working
groups
and
applica=ons
always
add
their
own
services
• These
core
services
will
hopefully
make
the
OSDC
ahrac=ve
as
a
plaxorm
(PaaS)
for
scien=fic
discovery.
33. 33
www.opencloudconsor=um.org
• U.S
based
not-‐for-‐profit
corpora=on.
• Manages
cloud
compu=ng
infrastructure
to
support
scien=fic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compu=ng
infrastructure
to
support
medical
and
health
care
research:
Biomedical
Commons
Cloud
• Manages
cloud
compu=ng
testbeds:
Open
Cloud
Testbed.
34. OCC
Members
&
Partners
• Companies:
Cisco,
Yahoo!,
Intel,
…
• Universi=es:
University
of
Chicago,
Northwestern
Univ.,
Johns
Hopkins,
Calit2,
ORNL,
University
of
Illinois
at
Chicago,
…
• Federal
agencies
and
labs:
NASA
• Interna=onal
Partners:
Univ.
Edinburgh,
AIST
(Japan),
Univ.
Amsterdam,
…
• Partners:
Na=onal
Lambda
Rail
34
35. Third
party
open
source
souware
+
Tukey
Yates
Open
source
souware
developed
by
the
OCC
and
open
standards
+
Data
center
+
Data
with
permissions
+
Authoriza=on
of
users
access
to
data
+
Policies,
procedures,
controls,
etc.
+
Governance,
legal
agreements
+
Sustainability
model
35
37. Data:
1
PB
of
OSDC
data
across
several
disciplines
Instrument:
3000
cores
/
5
PB
OSDC
science
cloud
+
+
Team:
you
and
your
colleagues
Discoveries
correla=on
algorithms
+
38.
39. OSDC
Public
Data
Sets
• Over
800
TB
of
open
access
data
in
the
OSDC
• Earth
sciences
data
• Biological
sciences
data
• Social
sciences
data
• Digital
humani=es
40. Part
6
OSDC
Working
Groups
Just
look
around
you
42. Matsu
Architecture
Hadoop
HDFS
Matsu
Web
Map
Tile
Service
(WMTS)
Matsu
MR-‐based
Tiling
Service
NoSQL
Database
Images
at
different
zoom
layers
suitable
for
OGC
Web
Mapping
Server
Level
0,
Level
1
and
Level
2
images
MapReduce
used
to
process
Level
n
to
Level
n+1
data
and
to
par==on
images
for
different
zoom
levels
NoSQL-‐based
Analy=c
Services
Streaming
Analy=c
Services
MR-‐based
Analy=c
Services
Analy=c
Services
Storage
for
WMS
=les
and
derived
data
products
Presenta=on
Services
Web
Coverage
Processing
Service
(WCPS)
Workflow
Services
46. Analyzing
Data
From
The
Cancer
Genome
Atlas
(TCGA)
1. Apply
to
dbGaP
for
access
to
data.
2. Hire
staff,
set
up
and
operate
secure
compliant
compu=ng
environment
to
mange
10
–
100+
TB
of
data.
3. Get
environment
approved
by
your
research
center.
4. Setup
analysis
pipelines.
5. Download
data
from
CG-‐
Hub
(takes
days
to
weeks).
6. Begin
analysis.
Current
Prac2ce
With
Protected
Data
Cloud
(PDC)
1. Apply
to
dbGaP
for
access
to
data.
2. Use
your
eRA
commons
creden=als
to
login
to
the
PDC,
select
the
data
that
you
want
to
analyze,
and
the
pipelines
that
you
want
to
use.
3. Begin
analysis.
46
47. One
Million
Genomes
• Sequencing
a
million
genomes
would
most
likely
fundamentally
change
the
way
we
understand
genomic
varia=on.
• The
genomic
data
for
a
pa=ent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
=ssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
48. Big
data
driven
discovery
on
1,000,000
genomes
and
1
EB
of
data.
Genomic-‐
driven
diagnosis
Improved
understanding
of
genomic
science
Genomic-‐
driven
drug
development
Precision
diagnosis
and
treatment.
Preven=ve
health
care.
49. Biomedical
Commons
Cloud
(BCC)
Working
Group
Cloud
for
Public
Data
Cloud
for
Controlled
Genomic
Data
Cloud
for
EMR,
PHI,
data
Example:
Open
Cloud
Consor=um’s
Biomedical
Commons
Cloud
(BCC)
Medical
Research
Center
A
Medical
Research
Center
B
Hospital
D
Medical
Research
Center
C
49
50. Resource
Who
users
Who
operates
Open
Science
Data
Cloud
(OSDC)
Pan
science
data
for
researchers
Open
Cloud
Consor=um
(OCC)
supported
by
University
OCC
members
Biomedical
Commons
Clouds
(BCC)
(Interna=onal)
biomedical
researchers
OCC
Biomedical
Commons
Cloud
Working
Group
supported
by
OCC
University
members
Bionimbus
Protected
Data
Cloud
Genomics
researchers
University
of
Chicago
supported
by
the
OCC
50
51. OpenFlow-‐Enabled
Hadoop
WG
• When
running
Hadoop
some
map
and
reduce
jobs
take
significantly
longer
than
others.
• These
are
stragglers
and
can
significantly
slow
down
a
MapReduce
computa=on.
• Stragglers
are
common
(dirty
secret
about
Hadoop)
• Infoblox
and
UChicago
are
leading
a
OCC
Working
Group
on
OpenFlow-‐enabled
Hadoop
that
will
provide
addi=onal
bandwidth
to
stragglers.
• We
have
a
testbed
for
a
wide
area
version
of
this
project.
52. OSDC
PIRE
Project
We
select
OSDC
PIRE
Fellows
(US
ci=zens
or
permanent
residents):
• We
give
them
tutorials
and
training
on
big
data
science.
• We
provide
them
fellowships
to
work
with
OSDC
interna=onal
partners.
• We
give
them
preferred
access
to
the
OSDC.
Nominate
your
favorite
scien=st
as
an
OSDC
PIRE
Fellow.
www.opensciencedatacloud.org
(look
for
PIRE)
54. • Ques=on
1.
How
can
we
add
partner
sites
at
other
loca=ons
that
extend
the
OSDC?
In
par=cular,
how
can
we
extend
the
OSDC
to
sites
around
the
world?
How
can
the
OSDC
interoperate
with
other
science
clouds?
• Ques=on
2.
What
data
can
we
add
to
the
OSDC
to
facilitate
data
intensive
cross-‐disciplinary
discoveries?
• Ques=on
3.
How
can
we
build
a
plugin
structure
so
that
Tukey
can
be
extended
by
other
users
and
by
other
communi=es?
• Ques=on
4.
What
tools
and
applica=ons
can
we
add
to
the
OSDC
facilitate
data
intensive
cross-‐disciplinary
discoveries?
• Ques=on
5.
How
can
we
beher
integrate
digital
IDs
and
file
sharing
services
into
the
OSDC?
• Ques=on
6.
What
are
3-‐5
grand
challenge
ques=ons
that
leverage
the
OSDC?
56. Robert
Grossman
is
a
faculty
member
at
the
University
of
Chicago.
He
is
the
Chief
Research
Informa=cs
Officer
for
the
Biological
Sciences
Division,
a
Faculty
Member
and
Senior
Fellow
at
the
Computa=on
Ins=tute
and
the
Ins=tute
for
Genomics
and
Systems
Biology,
and
a
Professor
of
Medicine
in
the
Sec=on
of
Gene=c
Medicine.
His
research
group
focuses
on
big
data,
biomedical
informa=cs,
data
science,
cloud
compu=ng,
and
related
areas.
He
is
also
the
Founder
and
a
Partner
of
Open
Data
Group,
which
has
been
building
predic=ve
models
over
big
data
for
companies
for
over
ten
years.
He
recently
wrote
a
book
for
the
general
reader
that
discusses
big
data
(among
other
topics)
called
the
Structure
of
Digital
Compu=ng:
From
Mainframes
to
Big
Data,
which
can
be
purchased
from
Amazon.
He
blogs
occasionally
about
big
data
at
rgrossman.com.