Testing tools and AI - ideas what to try with some tool examples
What Are Science Clouds?
1. What
is
So
Special
About
Science
Clouds
and
Why
Does
It
Ma8er?
November
17,
2013
Robert
L.
Grossman
University
of
Chicago
Open
Data
Group
Open
Cloud
ConsorLum
7. Cloud
Deployment
Models
• Public
Clouds
– Vendors
offering
cloud
services,
such
as
Amazon.
• Private
Clouds
– Run
internally
by
company
or
organizaLon,
such
as
the
University
of
Chicago.
• Community
Clouds
– Run
by
a
community
or
organizaLons
(either
formally
or
informally),
such
as
the
Open
Cloud
ConsorLum
7
8. How
do
you
measure
compute
capacity
for
science
clouds?
TB?
PB?
EB?
100’s?
1,000’s?
10,000’s?
9. Another
way:
opencompute.org
Think
of
science
clouds
as
large
if
you
measure
them
in
MW,
as
in
Facebook’s
Pineville
Data
Center
is
30
MW.
13. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
Monitoring,
network
security
and
forensics
AutomaLc
provisioning
and
infrastructure
management
AccounLng
and
billing
Customer
Facing
Portal
100,000
servers
1
PB
DRAM
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
Data
center
network
14. Requirement of a cloud
computing infrastructure
Rack
/
Container
Test:
The
addiLon
of
racks
/
containers
of
cores
and
disks
is
automated
and
does
not
require
changing
the
soNware
stack,
but
aNerwards
the
capacity
of
the
system
has
increased.
15. • For
many
organizaLons,
system
administrators
are
just
performing
a
service.
• It’s
considered
a
good
pracLce
to
outsource
the
service
to
the
lowest
cost
provider.
15
• At
good
cloud
service
providers,
development
and
operaLons
are
integrated
(devops).
• SRE/Devops
are
considered
key
personnel.
19. Some
Examples
of
the
Sizes
of
Datasets
Produced
by
Instruments
Discipline
Dura5on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
Genomics
-‐
NGS
One
2-‐4
years
0.5
TB/genome
1000’s
N.B.
This
is
just
the
data
produced
by
the
instrument
itself.
The
analysis
of
this
data
produces
significantly
more
data.
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
parLcle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambiLous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
h8p://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resulLng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
h8p://www.lsst.org/
News/enews/teragrid-‐1004.html
20. Sci
CSP
services
Data
scienLst
Science
Cloud
Service
Provider
(Sci
CSP)
21. What
are
some
of
the
important
differences
between
commercial
and
research-‐focused
Sci
CSPs?
22. vs.
Amazon
Web
Services
(AWS)?
Community
clouds,
science
clouds,
etc.
• Lower
cost
(at
medium
&
large
scale)
• Some
data
too
important
to
be
stored
• Scale
exclusively
in
commercial
cloud
• Simplicity
of
a
credit
card
• CompuLng
over
scienLfic
data
is
a
core
• Wide
variety
of
offerings.
competency
• Can
support
any
required
governance
/
security
model
It
is
essenLal
that
community
science
clouds
interoperate
with
public
clouds.
22
23. POV
Science
Clouds
DemocraLze
access
to
data.
Integrate
data
to
make
discoveries.
Long
term
archive.
Commercial
Clouds
As
long
as
you
pay
the
bill;
as
long
as
the
business
model
holds.
Internet
style
scale
out
Science
Clouds
bject-‐based
storage
and
o
Data
&
Storage
In
addiLon,
data
intensive
compuLng
&
HP
storage
Flows
AccounLng
Lock
in
Large
&
small
data
flows
Lots
of
small
web
flows
EssenLal
EssenLal
Moving
environment
Lock
in
is
good
between
CSPs
essenLal
Interop
CriLcal,
but
difficult
Customers
will
drive
to
some
degree
23
24. EssenLal
Services
for
a
Science
CSP
• Support
for
data
intensive
compuLng
• Support
for
big
data
flows
• Account
management,
authenLcaLon
and
authorizaLon
services
• Health
and
status
monitoring
• Billing
and
accounLng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
reporLng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
25. Sci
CSP
services
Data
scienLst
Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
Cloud
Service
OperaLons
Center
(CSOC)
27. Number
1000’s
Individual
scienLsts
&
small
projects
100’s
Community
based
science
via
Science
as
a
Service
very
large
projects
10’s
Data
Size
Small
Public
infrastructure
Medium
to
Large
Very
Large
Shared
community
infrastructure
Dedicated
infrastructure
28. The
long
tail
of
data
science
A
few
large
data
science
projects.
Many
smaller
data
science
projects.
29. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
Monitoring,
network
security
and
forensics
AutomaLc
provisioning
and
infrastructure
management
AccounLng
and
billing
Customer
Facing
Portal
100,000
servers
1
PB
DRAM
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
Data
center
network
30. Open
Science
Data
Cloud
Compliance,
&
security
(OCM)
Infrastructure
automaLon
&
management
(Yates)
AccounLng
&
billing
(Salesforce.com)
Science
Cloud
SW
&
Services
Cores
&
Disks
(OpenStack,
GlusterFS
&
Hadoop)
6
engineers
to
operate
0.5
MW
Science
Cloud
•
•
•
•
•
Customer
Facing
Portal
(Tukey)
~10-‐100
Gbps
bandwidth
Data
center
network
Virtual
Machine
(VM)
containing
common
applicaLons
&
pipelines
Tukey
(OSDC
portal
&
middleware
v0.2)
Yates
(infrastructure
automaLon
and
management
v0.1)
UDR
/
UDT
for
high
performance
data
transport
Interoperate
with
other
clouds
(upcoming)
and
proprietary
systems
(such
as
Globus
Online.)
31. The
Open
Science
Data
Cloud
(OSDC)
is
a
producLon
5
PB*,
7500
core,
wide
area
10G
cloud.
*10
PB
raw
storage.
www.opensciencedatacloud.org
32. • U.S
based
not-‐for-‐profit
corporaLon.
• Manages
cloud
compuLng
infrastructure
to
support
scienLfic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compuLng
infrastructure
to
support
medical
and
health
care
research:
Biomedical
Commons
Cloud
• Manages
cloud
compuLng
testbeds:
Open
Cloud
Testbed.
www.opencloudconsorLum.org
32
33. • Companies:
Cisco,
Yahoo!,
Infoblox,
…
• UniversiLes:
University
of
Chicago,
Northwestern
Univ.,
Johns
Hopkins,
Calit2,
LLNL,
University
of
Illinois
at
Chicago,
…
• Federal
agencies
and
labs:
NASA,
LLNL,
…
• InternaLonal
Partners:
AIST
(Japan),
U.
Edinburgh,
U.
Amsterdam,
…
www.opencloudconsorLum.org
33
34. Science
Cloud
•
•
•
•
•
Earth
sciences
Biological
sciences
Social
sciences
Digital
humaniLes
ACL,
groups,
etc.
Biomedical
Cloud
Designed
to
hold
Protected
Health
InformaLon
(PHI)
e.g.
genomic
data,
electronic
medical
records,
etc.
(HIPAA,
FISMA)
35. What
You
Get
with
the
OSDC
• Login
with
your
university
credenLals
via
InCommon
• Launch
virtual
machines,
virtual
clusters,
access
to
large
Hadoop
clusters,
etc.
• Access
PB+
of
open
and
protected
data
• Manage
files,
collecLons
of
files,
collecLons
of
collecLons
• Manage
users,
groups
of
users
• Manage
accounts,
sub-‐accounts
• Efficient
transfer
of
large
data
(UDT,
UDR)
36. Our
Point
of
View
• We
want
to
develop
as
li8le
technology
and
soNware
as
possible
–
we
want
others
to
develop
soNware
and
technology.
• We
focus
on
providing
researchers
the
ability
to
compute
over
large
and
very
large
datasets.
• We
need
open
source
soluLons.
• We
can
interoperate
with
proprietary
soluLons.
• We
are
working
to
make
interoperaLon
with
AWS
seamless
• Run
lights
out
over
mulLple
data
centers
connected
with
10G
(soon
100G)
networks.
37. OSDC
Cloud
Services
OperaLons
Center
(CSOC)
• The
OSDC
operates
a
Cloud
Services
OperaLons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
supporLng
Science
Clouds
for
researchers.
38. OSDC
Racks
2013
OSDC
rack
design
• 1
PB
/
rack
• 1150
cores
/
rack
• How
quickly
can
we
set
up
a
rack?
• How
efficiently
can
we
operate
a
rack?
(racks/admin)
• How
few
changes
does
our
soNware
stack
and
operaLons
require
when
we
add
new
racks?
39. Tukey
• Tukey
(based
in
part
on
Horizon).
• We
have
factored
out
digital
ID
service,
file
sharing,
and
transport
from
the
Bionimbus
and
Matsu
Projects.
40. Yates
• AutomaLon
installaLon
of
OSDC
soNware
stack
on
rack
of
computers.
• Based
upon
Chef
• Version
0.1
41. UDR
• UDT
is
a
high
performance
network
transport
protocol
• UDR
=
rsync
+
UDT
• It
is
easy
for
an
average
systems
administrator
to
keep
100’s
of
TB
of
distributed
data
synchronized.
• We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC
43. Analyzing
Data
From
The
Cancer
Genome
Atlas
(TCGA)
Current
Prac5ce
With
Protected
Data
Cloud
(PDC)
1. Apply
to
dbGaP
for
access
1. Apply
to
dbGaP
for
access
to
data.
to
data.
2. Hire
staff,
set
up
and
2. Use
your
exisLng
NIH
grant
operate
secure
compliant
eRA
credenLals
to
login
to
compuLng
environment
to
mange
10
–
100+
TB
of
data.
the
PDC,
select
the
data
3. Get
environment
approved
that
you
want
to
analyze,
by
your
research
center.
and
the
pipelines
that
you
4. Setup
analysis
pipelines.
want
to
use.
5. Download
data
from
CG-‐
Hub
(takes
days
to
weeks).
3. Begin
analysis.
6. Begin
analysis.
45. Biomedical
Community
Cloud
Medical
Research
Center
A
Medical
Research
Center
C
Cloud
for
Public
Data
Cloud
for
Controlled
Genomic
Data
Cloud
for
EMR,
PHI,
data
Medical
Research
Center
B
Example:
Open
Cloud
ConsorLum’s
Biomedical
Commons
Cloud
(BCC)
Hospital
D
Company
E
45
47. Cyber
Condo
Model
• Research
insLtuLons
today
have
access
to
high
performance
networks
–
10G
&
100G.
• They
couldn’t
afford
access
to
these
networks
from
commercial
providers.
• Over
a
decade
ago,
they
got
together
to
buy
and
light
fiber.
• This
changed
how
we
do
scienLfic
research.
48. Cloud
Condos
• The
Open
Cloud
ConsorLum’s
Burnham
Facility
(in
planning)
is
a
Cloud
Condo
model.
• This
infrastructure
provides
a
sustainable
home
for
large
commons
of
research
data
(and
an
infrastructure
to
compute
over
it).
• Please
join
us.
49. Some
Data
Commons
Guidelines
for
the
Next
Five
Years
• There
is
a
societal
benefit
when
research
data
is
available
in
data
commons
operated
by
a
NFP
(vs
sold
exclusively
as
data
products
by
commercial
enLLes
or
only
offered
for
download
by
the
USG).
• Large
data
commons
providers
should
peer.
• Data
commons
providers
should
develop
standards
for
interoperaLng.
• Standards
should
not
be
developed
ahead
of
open
source
reference
implementaLons.
• We
need
a
period
of
experimentaLon
as
we
develop
the
best
technology
and
pracLces.
• The
details
are
hard
(consent,
publicaLon,
IDs,
open
vs
controlled
access,
sustainability,
etc.)
50. Working
with
the
OSDC
-‐
CSP
• If
you
have
a
cloud,
please
interoperate
it
with
the
OSDC.
• Work
with
us
to
design
and
prototype
standards
so
that
Science
Clouds
and
Science
Data
Commons
can
interoperate.
– Data
synchronizaLon
between
two
clouds
– APIs
to
access
data
– Resvul
queries
– Sca8ering
queries,
gathering
the
results
– Coordinated
analysis
51. OSDC
SoNware
Ecosystem
CSP
A
University
E
Hadoop
AWS
Tukey
Bioninmbus
Medical
Research
Center
B
GlusterFS
OpenStack
Hospital
D
R
Globus
Online
UDT
Startup
F
Startup
G
51
52. Working
with
the
OSDC
-‐
Researchers
•
•
•
•
•
Apply
for
an
account
and
make
a
discovery
Add
data
to
the
OSDC
Add
your
soNware
to
the
OSDC
Suggest
someone
else’s
data
to
add
Suggest
someone
else’s
soNware
to
add
53. Data
Commons
CSP
A
University
E
TCGA
EO1
Social
sciences
data
1000
Genomes
census
urban
sciences
data
EMR
Bookworm
Hospital
D
earth
cube
data
Medical
Research
Center
B
Startup
F
Startup
G
53
56. For
more
informaLon
• @bobgrossman
• You
can
find
more
informaLon
on
my
blog:
rgrossman.com.
• You
can
find
more
of
my
talks
on:
slideshare.net/rgrossman
Center for
Research
Informatics
57. Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the
Gordon
and
Be8y
Moore
FoundaLon.
This
funding
is
used
to
support
the
OSDC-‐Adler,
Sullivan
and
Root
faciliLes.
AddiLonal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:
• The
Bionimbus
Protected
Data
Cloud
is
supported
in
by
part
by
NIH/NCI
through
NIH/SAIC
Contract
13XS021
/
HHSN261200800001E.
• The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was
donated
by
Yahoo!
in
2011.
• Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data
centers
with
10
Gbps
wide
area
networks.
• The
OSDC
is
supported
by
a
5-‐year
(2010-‐2016)
PIRE
award
(OISE
–
1129076)
to
train
scienLsts
to
use
the
OSDC
and
to
further
develop
the
underlying
technology.
• OSDC
technology
for
high
performance
data
transport
is
support
in
part
by
NSF
Award
1127316.
• The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high
performance
research
networks
around
the
world
at
10
Gbps
or
higher.
• Any
opinions,
findings,
and
conclusions
or
recommendaLons
expressed
in
this
material
are
those
of
the
author(s)
and
do
not
necessarily
reflect
the
views
of
the
NaLonal
Science
FoundaLon,
NIH
or
other
funders
of
this
research.
The
OSDC
is
managed
by
the
Open
Cloud
ConsorLum,
a
501(c)(3)
not-‐for-‐profit
corporaLon.
If
you
are
interested
in
providing
funding
or
donaLng
equipment
or
services,
please
contact
us
at
info@opensciencedatacloud.org.