Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

An
Introduc+on
to

Data
Intensive
Compu+ng

Chapter
1:
Introduc+on

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneB

Open
Data
Group

November
14,
2011

1

1.  Introduc+on
(0830-‐0900)

a.  Data
clouds
(e.g.
Hadoop)

b.  U+lity
clouds
(e.g.
Amazon)

2.  Managing
Big
Data
(0900-‐0945)

a.  Databases

b.  Distributed
File
Systems
(e.g.
Hadoop)

c.  NoSql
databases
(e.g.
HBase)

3.  Processing
Big
Data
(0945-‐1000
and
1030-‐1100)

a.  Mul+ple
Virtual
Machines
&
Message
Queues

b.  MapReduce

c.  Streams
over
distributed
ﬁle
systems

4.  Lab
using
Amazon’s
Elas+c
Map
Reduce

(1100-‐1200)

Our
perspec+ve
is
to
consider
data
intensive

compu+ng
from
the
viewpoint
of
u+lity
and

data
clouds.

For
the
most
current
version
of
these
notes,

please
see:

rgrossman.com

Sec+on
1.1

Data
Intensive
Science

Two
of
the
14
high
throughput
sequencers
at
the

Ontario
Ins+tute
for
Cancer
Research
(OICR).

4

Moore’s
law
also

applies
to
the

instruments
that
are

producing
data.

This
is
crea+ng
new

paradigms:
“data

intensive
science”

and
“data
intensive

compu+ng.”

Data
is
Big
If
It
is
Measured
in
MW

•  Data
is
big
if
you
measure
it
in

MegawaBs.

•  As
in,
a
good
sweet
spot
for
a

data
center
is
15
MW.

•  As
in,
Facebook’s
leased
data

centers
are
typically
between

2.5
MW
and
6.0
MW.

•  Facebook’s
new
Pineville
data

center
is
30
MW.

•  Google’s
compu+ng

infrastructure
uses
260
MW.

Some
Big
Data
Sciences

Discipline
Dura-on
Size
#
Devices

HEP
-‐
LHC
10
years
15
PB/year*
One

Astronomy
-‐
LSST
10
years
12
PB/year**
One

Genomics
-‐
NGS
2-‐4
years
0.4
TB/genome
1000’s

*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par+cle
accelerator,
is
expected
to
produce
more
than
15

million
Gigabytes
of
data
each
year.

…
This
ambi+ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer

centres
in
33
countries.

Source:
hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html

**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes

processed),
resul+ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.

Source:
hBp://www.lsst.org/
News/enews/teragrid-‐1004.html

An
algorithm
and

compu+ng
infrastructure

is
“big-‐data
scalable”
if

adding
a
rack
of
data
(and

corresponding
processors)

does
not
increase
the
+me

required
to
complete
the

computa+on
but
increases

the
amount
of
data
that

can
be
processed.

Add
capacity
with

constant
+me
(ACCT)

Sec+on
1.2

What’s
New
with
Clouds?

10

The
Term
‘In
the
Cloud’
is
Annoying

•  “Personally,
I
ﬁnd
the
term
‘in
the
cloud’

preten+ous
and
annoying.
…
the
world’s

marketers
and
P.R.
people
seem
to
think
that

‘the
cloud’
just
means
‘online.’
”

David
Pogue,

NYT
June
16,
2011.

•  More
speciﬁcally
he
notes
that
you
can
think

of
the
cloud
as
“data
and
applica+on
sopware

stored
on
remote
servers
[and
accessed
via

the
Internet]”

U+lity
Clouds

Infrastructure
as
a
Service
(IaaS)

Amazon
Data
Center

12

Data
Clouds

Large
Data
Cloud
Services

ad
targe+ng

Yahoo
Data
Center

13

Virtualiza+on

App

App

App

OS

App
App
App
OS

OS

OS

Hyperviser

Computer

Computer

14

Idea
Dates
Back
to
the
1960s

App
App
App

CMS
MVS
CMS

IBM
VM/370

IBM
Mainframe

Na+ve
(Full)
Virtualiza+on

Examples:
Vmware
ESX

•  Virtualiza+on
ﬁrst
widely
deployed
with
IBM

VM/370.

15

Usage
Based
Pricing
Is
New

costs
the
same
as

1
computer
in
a
rack
120
computers
in

three

for
120
hours
racks
for
1
hour

17

Simplicity
is
New

+
..
and
you
have
a
computer

ready
to
work.

Elas+c,
on
demand
provisioning.

A
new
programmer
can
develop
a

program
to
process
a
container
full

of
data
with
less
than
day
of

training
using
MapReduce.

18

Sec+on
1.4

U+lity
Clouds

Customer’s
Cloud
Service
Provider’s

Responsibility
Responsibility

IaaS
PaaS
SaaS

Apps
Apps
Apps

Frameworks
Frameworks
Frameworks

VM
VM
VM

Hyperviser,
Hyperviser,
Hyperviser,

network
network
network

Amazon
Style
Data
Cloud

Load
Balancer

Simple
Queue
Service

SDB
EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instance
EC2
Instance

EC2
Instances
EC2
Instances

S3
Storage
Services

21

NIST
Defini+on

•  Cloud
compu+ng
is
a
model
for
enabling

ubiquitous,
convenient,
on-‐demand
network

access
to
a
shared
pool
of
configurable

compu+ng
resources
that
can
be
rapidly

provisioned
and
released
with
minimal

management
effort
or
service
provider

interac+on.

NIST
Deﬁni+on

Essential Characteristics Deployment Models
•  On-demand / self-service •  Private
•  Broad network access •  Community
•  Resource pooling •  Public
•  Rapid elasticity •  Hybrid
•  Measured service
Service Models
•  Software as a Service (SaaS) – consumer runs
provider s applications on cloud infrastructure
•  Platform as a Service (PaaS) – consumer runs
consumer-created applications on the cloud
using tools supported by provider
•  Infrastructure as a Service (IaaS) – consumer uses
provider s processing, storage, and networks

Sec+on
1.5

Data
Clouds

Google’s
Large
Data
Cloud

Applica+ons

Compute
Services
Google’s
MapReduce

Data
Services
Google’s
BigTable

Storage
Services
Google
File
System
(GFS)

Google’s
Stack

25

Hadoop’s
Large
Data
Cloud

Applica+ons

Compute
Services
Hadoop’s
MapReduce

Data
Services
NoSQL
Databases

Storage
Services
Hadoop
Distributed
File

System
(HDFS)

Hadoop’s
Stack

26

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

Ähnlich wie Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial) (20)

Mehr von Robert Grossman

Mehr von Robert Grossman (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)