Hadoop 101

Introducing:

The Modern Data Operating System

Hadoop is ...
A scalable fault tolerant distributed for data storage and
processing (open source under the Apache license)
- Core Hadoop has two main systems:
● Hadoop Distributed FileSystem (HDFS):
self-healing, high-bandwidth clustered storage

● MapReduce: distributed fault-tolerant
resource management and scheduling
coupled with a scalable data programming
abstraction

Hadoop Origins

>>>

HDFS

>>>

MapReduce

GFS

Map/Reduce

>>>
BigTable

Hadoop Chronicles

GFS

Map/Reduce

BigTable

Doug Cutting

Etymology
● Hadoop was created in 2004
by "Douglass (Doug) Cutting"
● Implemented Google
Filesystem and Big Tables
papers
● He aimed it, to index the
internet in google style for
startup search engine 'Nutch'
● Named it after his son's
elephant shaped favourite
toy named hadoop

What is Big Data?
"In Information Technology, big data is loosely
defined term used to describe set so large and
complex that they became awkward to work with
using on-hand database management tools."
Wikipedia

How big is big?
● 2008: Google processes 20PB a day
● 2012: Facebook ingests 500TB of data a day
● 2009: eBay has 6.5 PB user data + 50 TB a day
● 2011: Yahoo! has 180-200 PB of data

Limitations of Existing Analytics Architecture
Can't explore original raw data

BI Reports + Online Apps
RDBMS (aggregated data)
ETL (Extract, Transfer & Load)

Moving Data from storage to
compute doesn't scale!
Storage Grid
Archiving = Premature death
Mostly Append
Data Collection

Instrumentation (Raw Data Sources)

Why Hadoop?
Challenge: Read 1 TB of data

1 Machine
- 4 IO channels
- Each channel: 100 MB/s

?
45 minutes

10 Machines
- 4 IO channels
- Each channel: 100 MB/s

4.5 minutes
?

The Key Benefit: Agility/Flexibility
Schema-On-Write (RDBMS)

Schema-On-Read (Hadoop)

- Schema must be created before any
data can be loaded

- Data is simply copied to the file store, no
transformations are needed

- An explicit load operation has to take
place which transforms data to DB internal
structure

- A SerDe (Serializer/Deserializer) is
applied during read tume to extract the
required column (late binding)

- New columns must be be added
explicitly before new data for such
columns can be loaded into the database

- New data can strat flowing anytime and
will appear retroactively once the SerDe is
updated to parse it

- Reads are fast
- Standards / Governance

- Load is fast
- Flexibility / Agility

Hadoop Components
Master/Slave Architecture

Name Node

Data Nodes

Job Tracker

Task Trackers

r=3

NameNode
File metadata:
/kenshoo/data1.txt ---> 1,2,3
/kenshoo/data2.txt ---> 4,5

hdfs-site.xml

dfs.replication

3

5

3

5

4

5

1

4

1

4

2

2

3

Data Nodes

1

2

Underlying FS options

ext3
- released in 2001
- Used by Yahoo!
- bootstrap + format slow
- set:
- noatime
- tune2fs (to turn
off reserved blocks)

ext4
- released in 2008
- Used by Google
- Fast as XFS
- set:
- delayed
allocation off
-noatime
- tune2fs (to turn off
reserved blocks)

XFS
- released in 1993
- Fast
- Drawbacks:
- deleting large # of files

Sample HDFS shell Commands
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop

fs
fs
fs
fs
fs
fs
fs
fs
fs

-ls
-mkdir
-copyFromLocal
-copyToLocal
-moveToLocal
-rm
-tail
-chmod
-setrep -w 4 -R /dir1/s-dir

Mounting using FUSE:
hadoop-fuse-dfs dfs://10.73.9.50 /hdfs

Network Topology

Yahoo! Installation

Name Node

Job Tracker

HBase Master

2

2

3

3

3

4

4

4

5
Rack 1

2

5

5

Rack 2

Rack 3

- 8 core switches
- 100 racks
- 40 servers/rack
- 1 GBit in rack
- 10 GBit among
racks
-Total 11PB

Rack Awareness

NameNode

Name Node

Job Tracker

metadata

HBase Master

file.txt =
A

2

A

7

3

A

8
B

4
5
Rack 1

B

Blk A: A
DN: 2,7,8

13
B

9
10

Rack 2

12

14
15

Rack 3

Blk B: B
DN: 9,12,14

HDFS Writes
Client
NameNode
Core
metadata
A

B

C

file.txt =
A

Blk A:
DN: 2,7,9

A

A

2
3

8
A

4
5
Rack 1

7

9
10

Rack 2

Reading Files
File1.txt parts:
Blk A: 2,7,8
Blk B: 9,12,14

wanna read file1.txt

Client
NameNode
Core
metadata
file.txt =
Blk A: A
DN: 2,7,8
A

2

A

7

3

A

8
B

4
5
Rack 1

B

13
B

9
10

Rack 2

12

14
15

Rack 3

Blk B: B
DN: 9,12,14

Hadoop 101

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop 101

Ähnlich wie Hadoop 101 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop 101