Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
1. An
Introduc+on
to
Data
Intensive
Compu+ng
Chapter
1:
Introduc+on
Robert
Grossman
University
of
Chicago
Open
Data
Group
Collin
BenneB
Open
Data
Group
November
14,
2011
1
2. 1. Introduc+on
(0830-‐0900)
a. Data
clouds
(e.g.
Hadoop)
b. U+lity
clouds
(e.g.
Amazon)
2. Managing
Big
Data
(0900-‐0945)
a. Databases
b. Distributed
File
Systems
(e.g.
Hadoop)
c. NoSql
databases
(e.g.
HBase)
3. Processing
Big
Data
(0945-‐1000
and
1030-‐1100)
a. Mul+ple
Virtual
Machines
&
Message
Queues
b. MapReduce
c. Streams
over
distributed
file
systems
4. Lab
using
Amazon’s
Elas+c
Map
Reduce
(1100-‐1200)
3. Our
perspec+ve
is
to
consider
data
intensive
compu+ng
from
the
viewpoint
of
u+lity
and
data
clouds.
For
the
most
current
version
of
these
notes,
please
see:
rgrossman.com
4. Sec+on
1.1
Data
Intensive
Science
Two
of
the
14
high
throughput
sequencers
at
the
Ontario
Ins+tute
for
Cancer
Research
(OICR).
4
5. Moore’s
law
also
applies
to
the
instruments
that
are
producing
data.
This
is
crea+ng
new
paradigms:
“data
intensive
science”
and
“data
intensive
compu+ng.”
7. Data
is
Big
If
It
is
Measured
in
MW
• Data
is
big
if
you
measure
it
in
MegawaBs.
• As
in,
a
good
sweet
spot
for
a
data
center
is
15
MW.
• As
in,
Facebook’s
leased
data
centers
are
typically
between
2.5
MW
and
6.0
MW.
• Facebook’s
new
Pineville
data
center
is
30
MW.
• Google’s
compu+ng
infrastructure
uses
260
MW.
8. Some
Big
Data
Sciences
Discipline
Dura-on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.4
TB/genome
1000’s
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par+cle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambi+ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resul+ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hBp://www.lsst.org/
News/enews/teragrid-‐1004.html
9. An
algorithm
and
compu+ng
infrastructure
is
“big-‐data
scalable”
if
adding
a
rack
of
data
(and
corresponding
processors)
does
not
increase
the
+me
required
to
complete
the
computa+on
but
increases
the
amount
of
data
that
can
be
processed.
Add
capacity
with
constant
+me
(ACCT)
11. The
Term
‘In
the
Cloud’
is
Annoying
• “Personally,
I
find
the
term
‘in
the
cloud’
preten+ous
and
annoying.
…
the
world’s
marketers
and
P.R.
people
seem
to
think
that
‘the
cloud’
just
means
‘online.’
”
David
Pogue,
NYT
June
16,
2011.
• More
specifically
he
notes
that
you
can
think
of
the
cloud
as
“data
and
applica+on
sopware
stored
on
remote
servers
[and
accessed
via
the
Internet]”
13. Data
Clouds
Large
Data
Cloud
Services
ad
targe+ng
Yahoo
Data
Center
13
14. Virtualiza+on
App
App
App
OS
App
App
App
OS
OS
OS
Hyperviser
Computer
Computer
14
15. Idea
Dates
Back
to
the
1960s
App
App
App
CMS
MVS
CMS
IBM
VM/370
IBM
Mainframe
Na+ve
(Full)
Virtualiza+on
Examples:
Vmware
ESX
• Virtualiza+on
first
widely
deployed
with
IBM
VM/370.
15
17. Usage
Based
Pricing
Is
New
costs
the
same
as
1
computer
in
a
rack
120
computers
in
three
for
120
hours
racks
for
1
hour
17
18. Simplicity
is
New
+
..
and
you
have
a
computer
ready
to
work.
Elas+c,
on
demand
provisioning.
A
new
programmer
can
develop
a
program
to
process
a
container
full
of
data
with
less
than
day
of
training
using
MapReduce.
18
22. NIST
Defini+on
• Cloud
compu+ng
is
a
model
for
enabling
ubiquitous,
convenient,
on-‐demand
network
access
to
a
shared
pool
of
configurable
compu+ng
resources
that
can
be
rapidly
provisioned
and
released
with
minimal
management
effort
or
service
provider
interac+on.
23. NIST
Defini+on
Essential Characteristics Deployment Models
• On-demand / self-service • Private
• Broad network access • Community
• Resource pooling • Public
• Rapid elasticity • Hybrid
• Measured service
Service Models
• Software as a Service (SaaS) – consumer runs
provider s applications on cloud infrastructure
• Platform as a Service (PaaS) – consumer runs
consumer-created applications on the cloud
using tools supported by provider
• Infrastructure as a Service (IaaS) – consumer uses
provider s processing, storage, and networks
25. Google’s
Large
Data
Cloud
Applica+ons
Compute
Services
Google’s
MapReduce
Data
Services
Google’s
BigTable
Storage
Services
Google
File
System
(GFS)
Google’s
Stack
25
26. Hadoop’s
Large
Data
Cloud
Applica+ons
Compute
Services
Hadoop’s
MapReduce
Data
Services
NoSQL
Databases
Storage
Services
Hadoop
Distributed
File
System
(HDFS)
Hadoop’s
Stack
26