Hadoop pycon2011uk

Respect
for
the
elephant
–
Hadoop

Aditya
Sakhuja

aditya@sakhuja.us

Whoami

•  So=ware
Engineer
@
Yahoo
Inc.

•  Web
Search
-‐>
Cloud
PlaHorms
-‐>
Display
Ads
Serving

•  hKp://linkedin.com/in/adityasakhuja

9/24/11
PyCon
UK
2011

Agenda

•  MoVvaVon

•  History

•  Ecosystem

•  Daemon
processes
/
High
Level
View

•  Map
Reduce
Data
Flow

•  HDFS
Architecture
/
ReplicaVon

•  Can
/
Cannot

•  Ge[ng
started
yourself

•  Demo

•  Companies
Involved

•  Q&A

9/24/11
PyCon
UK
2011

MoVvaVon

•  ‘TradiVonal’
large-‐scale
compuVng
systems
-‐

problems

•  Desired
features
in
an
improved
system

•  How
Hadoop
addresses
them

9/24/11
PyCon
UK
2011

‘TradiVonal’
large-‐scale
compuVng
systems
-‐

problems

•  CPU
intensive
over
Data
intensive

•  MPI
,
PVM,

RPCs
–
Parallel
ComputaVon

Frameworks

•  Programming
for
tradiVonal
distributed
systems

is
complex

–  Data
exchange
requires
synchronizaVon

–  Temporal
dependencies
are
complicated

–  It
is
diﬃcult
to
deal
with
parVal
failures
of
the
system

•  Data
typically
stored
on
SAN

•  Data
brought
to
compute
nodes
@
runVme

9/24/11
PyCon
UK
2011

Desired
Features
in
a
Large
Scale
Data
Systems

•  Data
Driven

–  A
new
improved
system
should
avoid
data

boKlenecks

•  Scalable

•  Consistent

•  Recoverable

(
Data
/
Processor
)

•  ParVal
Failure
Support

9/24/11
PyCon
UK
2011

What
Hadoop
oﬀers

•  Provides
a
high
level
programming
model

–  No
worries
for
Locking/Temporal
Dependencies,

Sockets
..

•  and
the
list
of
features
in
the
desired
list
J

(
previous
slide
)

9/24/11
PyCon
UK
2011

History

•  Hadoop
is
based
on
work
done
by
Google
in

the
late
1990s/early
2000s

•  Speciﬁcally,
on
papers
describing
the
Google

File
System
(GFS)published
in
2003,
and
Map/
Reduce
published
in
2004

•  Hadoop
MapReduce
NextGeneraVon
–
2011

–  hKp://developer.yahoo.com/blogs/hadoop/
posts/2011/02/mapreduce-‐nextgen/

9/24/11
PyCon
UK
2011

Apache
Hadoop
Ecosystem

•  Hadoop
Common:
The
common
uVliVes
that
support
the
other
Hadoop
subprojects.

•  Hadoop
Distributed
File
System
(HDFS™):
A
distributed
ﬁle
system
that
provides
high-‐
throughput
access
to
applicaVon
data.

•  Hadoop
MapReduce:
A
so=ware
framework
for
distributed
processing
of
large
data
sets

on
compute
clusters.

Other
Hadoop-‐related
projects
at
Apache
include:

•  Cassandra™:
A
scalable
mulV-‐master
database
with
no
single
points
of
failure.

•  HBase™:
A
scalable,
distributed
database
that
supports
structured
data
storage
for
large

tables.

•  Hive™:
A
data
warehouse
infrastructure
that
provides
data
summarizaVon
and
ad
hoc

querying.

•  Mahout™:
A
Scalable
machine
learning
and
data
mining
library.

•  Pig™:
A
high-‐level
data-‐ﬂow
language
and
execuVon
framework
for
parallel

computaVon.

Source
:
hKp://hadoop.apache.org/

9/24/11
PyCon
UK
2011

Hadoop
Key
Daemon
Processes

•  Namenode

•  Secondary
NameNode

•  DataNode

•  JobTracker

•  TaskTracker

9/24/11
PyCon
UK
2011

High
level
Hadoop
cluster
view

9/24/11
PyCon
UK
2011

MapReduce
Data
Flow

9/24/11
PyCon
UK
2011

HDFS
Architecture

9/24/11
PyCon
UK
2011

HDFS
ReplicaVon

9/24/11
PyCon
UK
2011

Map
Reduce
Program
Components

•  MapReduce
programs
generally
consist
of

three
porVons

– 
The
Mapper

– 
The
Reducer

–  The
driver
code

•  AddiVonal
components
:

–  Combiner
(o=en
the
same
code
as
the
Reducer)

–  Custom
ParVVoner

9/24/11
PyCon
UK
2011

Hadoop
Is
/
Is
Not

•  High
Bandwidth,
High
Latency
System

•  Not
a
subsVtute
for
a
DBMS,
not
alone
at-‐least

•  HDFS
is
not
yet
a
Highly
Available
FS.

NameNode
is
a
SPOF

•  Is
a
“Share
nothing”
Architecture

–  Mappers
do
not
talk,
neither
do
Reducers

9/24/11
PyCon
UK
2011

Ge[ng
started
yourself

Requirements
:

•  Java
SE
SDK
[download
JDK
6
or
higher
)

•  Download
and
Install

Hadoop
Common

:
0.20.203.X
-‐
current
stable
version

Hadoop
HDFS
:
0.21
–
stable
version

Hadoop
MapReduce
:
0.21
–
stable
version

•  Subscribe
to
mailing
lists

for
Hadoop
subprojects,
depending
on
your

role

•  AddiVonally/AlternaVvely
one
can
setup
VMs
from
Cloudera
/
Yahoo

•  Details
:

•  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop

•  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic

9/24/11
PyCon
UK
2011

Simple
Demo

•  Using

–  Pig

–  Map/Reduce

9/24/11
PyCon
UK
2011

Streaming
Jobs

•  Any
language
that
can
read
from
stdin
and
write
to
stdout

•  hadoop
jar
$HADOOP_HOME/hadoop-‐streaming.jar

-‐input
myInputDirs

-‐output
myOutputDir

-‐mapper
myMapScript.py

-‐reducer
myReduceScript.py

-‐ﬁle
myMapScript.py

-‐ﬁle
myReduceScript.py

9/24/11
PyCon
UK
2011

Companies
involved

•  Yahoo

-‐
4500
nodes
cluster
(
2*4
cores,
4*1
TBs

Disk
,
16GB
RAM
)
–
(
AdServer,
Search
)

•  HortonWorks
,
Cloudera

•  Facebook

•  A9

(
Amazon
Product
Search
)

•  EBay
-‐
532
node
cluster
–
(
8
*
532
cores
,
5.3
PB
)

•  Last.fm,
TwiKer
…

•  ……
a
lot
more
can
be
found
on
the
link
below
:

hKp://wiki.apache.org/hadoop/PoweredBy

9/24/11
PyCon
UK
2011

Useful
Links

• 

hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop
-‐
Ge[ng
Started

•  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html
-‐
Cluster

Setup

•  hKp://developer.yahoo.com/hadoop/tutorial/module4.html
-‐
MapReduce

•  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html
-‐
PIG

•  hKp://hadoop.apache.org/common/docs/current/api/index.html
-‐
APIs

•  hKp://developer.yahoo.com/hadoop/tutorial/
-‐
YDN
resource
on
Hadoop

9/24/11
PyCon
UK
2011

Q&C

Contact
InformaFon
:

Aditya
Sakhuja

aditya@sakhuja.us

hKp://twiKer.com/sakhuja

hKp://linkedin.com/in/adityasakhuja

9/24/11
PyCon
UK
2011

Hadoop pycon2011uk

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hadoop pycon2011uk

Ähnlich wie Hadoop pycon2011uk (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop pycon2011uk