Distributed batch processing with Hadoop

Distributed batch
processing with Hadoop
Ferran Galí i Reniu
@ferrangali

09/01/2014

Ferran Galí i Reniu
● UPC - FIB
● Trovit

Problem
● Too much data
○ 90% of all the data in the world has been generated
in the last two years
○ Large Hadron Collider: 25 petabytes per year
○ Walmart: 1M transactions per hour

● Hard disks
○ Cheap!
○ Still slow access time
○ Write even slower

Solutions
● Multiple Hard Disks
○ Work in parallel
○ We can reduce access time!

● How to deal with hardware failure?
● What if we need to combine data?

Hadoop
● Doug Cutting & Mike Cafarella

Hadoop

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003

Hadoop

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
December 2004

Hadoop
● Doug Cutting & Mike Cafarella

● Yahoo!

Hadoop
● HDFS
○ Storage

● MapReduce
○ Processing

● Ecosystem

HDFS
● Distributed storage
○ Managed across a network of commodity machines

● Blocks
○ About 128Mb
○ Large data sets

● Tolerance to node failure
○ Data replication

● Streaming data access
○ Many access
○ Write once (batch)

HDFS
● DataNodes (Workers)
○ Store blocks

● NameNode (Master)
○
○
○
○
○

Maintains metadata
Knows where the blocks are located
Make DataNodes fault tolerant
Single point of failure
Secondary NameNode

HDFS

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS
● Interfaces
○ Java
○ Command line interface

● Load
hadoop fs -put file.csv /user/hadoop/file.csv

● Extract
hadoop fs -get /user/hadoop/file.csv file.csv

MapReduce
● Distributed processing paradigm
○ Moving computation is cheaper than moving data

● Map
○ Map(k1,v1) -> list(k2,v2)

● Reduce
○ Reduce(k2,list(v2)) -> list(v3)

Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Word Counter
emit(word, 1);

Java is great
Hadoop is also great

Word Counter - Map
emit(word, 1);

Key

Value

1

Java is great

2


Word Counter - Map
emit(word, 1);

Key

Value

1

Java is great

2


map(1, “Java is great”)

Word Counter - Map


emit(word, 1);

Key

Key

Value

1

Java is great

2


Value

Java

1

Word Counter - Map


emit(word, 1);

Key
Java

1

is


Value

1


Key

Value

1

Java is great

2


Word Counter - Map


emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

1

Java is great

2


Word Counter - Map

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

1

Java is great

2


Word Counter - Map

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

Hadoop 1

1

Java is great

is

1

2


also

1

great

1

Word Count - Group & Sort
map(k1,v1) -> list(k2, v2)

Key

Value

Java

1

is

1

great

1

Hadoop 1
is

1

also

1

great

1

reduce(k2, list(v2)) -> list(v3)

map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

is

1

Java

[1]

great

1

is

[1, 1]

Hadoop 1

great

[1, 1]

is

1

Hadoop [1]

also

1

also

great

1

group

[1]

map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]

Word Count - Reduce
emit(word, 1);

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Word Count - Reduce
emit(word, 1);

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

reduce(“also”, [1])

Word Count - Reduce

reduce(“also”, [1])

emit(word, 1);

Key

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1

Word Count - Reduce

reduce(“great”, [1, 1])

emit(word, 1);

Key

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1

Word Count - Reduce

reduce(“great”, [1, 1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Word Count - Reduce

reduce(“Hadoop”, [1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1

Word Count - Reduce

reduce(“is”, [1, 1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Word Count - Reduce

reduce(“Java”, [1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Java

1

Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping

Word Count - Partition
num partitions = 1

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]

Word Count - Partition
num partitions = 2

is

1

great

Value

Java

1

Key

Value

[1]

is

[1, 1]

is

[1, 1]

Java

[1]

Key

Value

Key

Value

great

[1, 1]

also

[1]
[1, 1]

1

up

Java

Key

Value

sort

gr
o

Key

p

ou

gr

Hadoop 1
is

1

also

1

Hadoop [1]

great

great

1

also

Hadoop [1]

sort

[1]

Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
○ Each partition executes a reduce task

MapReduce
● Job Tracker
○ Dispatches Map & Reduce Tasks

● Task Tracker
○ Executes Map & Reduce Tasks

MapReduce
Example 1:
● Map
● Reduce
● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

MapReduce
Example 2:
● Sorting
● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Big Data
● Too much data
○ Not a problem any more

● It’s just a matter of which tools use
● New opportunity for businesses

Big Data Platform
Consumption

logs

Processing

Serving

indexes
DB

DB
NoSQL

Hive
● Data Warehouse
● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;

● Executes MapReduce underneath!

HBase
●
●
●
●

Based on BigTable
Column-oriented database
Random realtime read/write access
Easy to bulk load from Hadoop

Hadoop Ecosystem
● ZooKeeper:
○ Centralized coordination system

● Pig
○ Data-flow language to analyze large data sets

● Kafka:
○ Distributed messaging system

● Sqoop:
○ Transfer between RDBMS - HDFS

● ...

Trovit
● What is it:
○ Vertical search engine.
○ Real estate, cars, jobs, products, vacations.

● Challenges:
○ Millions of documents to index
○ Traffic generates a huge amount of log files

Trovit
● Legacy:
○ Use MySQL as a support to document indexing
○ Didn’t scale!

● Batch processing:
○ Hadoop with a pipeline workflow
○ Problem solved!

● Real time processing:
○ Storm to improve freshness

● More challenges:
○ Content analysis
○ Traffic analysis

Questions?
Distributed batch processing with Hadoop
@ferrangali

Distributed batch processing with Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Distributed batch processing with Hadoop

Ähnlich wie Distributed batch processing with Hadoop (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Distributed batch processing with Hadoop