SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Distributed batch
processing with Hadoop
Ferran Galí i Reniu
@ferrangali

09/01/2014
Ferran Galí i Reniu
● UPC - FIB
● Trovit
Problem
● Too much data
○ 90% of all the data in the world has been generated
in the last two years
○ Large Hadron Collider: 25 petabytes per year
○ Walmart: 1M transactions per hour

● Hard disks
○ Cheap!
○ Still slow access time
○ Write even slower
Solutions
● Multiple Hard Disks
○ Work in parallel
○ We can reduce access time!

● How to deal with hardware failure?
● What if we need to combine data?
Hadoop
● Doug Cutting & Mike Cafarella
Hadoop

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003
Hadoop

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
December 2004
Hadoop
● Doug Cutting & Mike Cafarella

● Yahoo!
Hadoop
● HDFS
○ Storage

● MapReduce
○ Processing

● Ecosystem
HDFS
● Distributed storage
○ Managed across a network of commodity machines

● Blocks
○ About 128Mb
○ Large data sets

● Tolerance to node failure
○ Data replication

● Streaming data access
○ Many access
○ Write once (batch)
HDFS
● DataNodes (Workers)
○ Store blocks

● NameNode (Master)
○
○
○
○
○

Maintains metadata
Knows where the blocks are located
Make DataNodes fault tolerant
Single point of failure
Secondary NameNode
HDFS

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS
● Interfaces
○ Java
○ Command line interface

● Load
hadoop fs -put file.csv /user/hadoop/file.csv

● Extract
hadoop fs -get /user/hadoop/file.csv file.csv
MapReduce
● Distributed processing paradigm
○ Moving computation is cheaper than moving data

● Map
○ Map(k1,v1) -> list(k2,v2)

● Reduce
○ Reduce(k2,list(v2)) -> list(v3)
Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Java is great
Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great

map(1, “Java is great”)
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great

Value

Java

1
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
Java

1

is

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

Hadoop 1

1

Java is great

is

1

2

Hadoop is also great

also

1

great

1
Word Count - Group & Sort
map(k1,v1) -> list(k2, v2)

Key

Value

Java

1

is

1

great

1

Hadoop 1
is

1

also

1

great

1

reduce(k2, list(v2)) -> list(v3)
Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

is

1

Java

[1]

great

1

is

[1, 1]

Hadoop 1

great

[1, 1]

is

1

Hadoop [1]

also

1

also

great

1

group

[1]
Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]
Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]
Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

reduce(“also”, [1])
Word Count - Reduce
map (Long key, String value)

reduce(“also”, [1])

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1
Word Count - Reduce
map (Long key, String value)

reduce(“great”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1
Word Count - Reduce
map (Long key, String value)

reduce(“great”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]
Word Count - Reduce
map (Long key, String value)

reduce(“Hadoop”, [1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
Word Count - Reduce
map (Long key, String value)

reduce(“is”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2
Word Count - Reduce
map (Long key, String value)

reduce(“Java”, [1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Java

1
Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
Word Count - Partition
num partitions = 1

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]
Word Count - Partition
num partitions = 2

is

1

great

Value

Java

1

Key

Value

[1]

is

[1, 1]

is

[1, 1]

Java

[1]

Key

Value

Key

Value

great

[1, 1]

also

[1]
[1, 1]

1

up

Java

Key

Value

sort

gr
o

Key

p

ou

gr

Hadoop 1
is

1

also

1

Hadoop [1]

great

great

1

also

Hadoop [1]

sort

[1]
Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
○ Each partition executes a reduce task
MapReduce
● Job Tracker
○ Dispatches Map & Reduce Tasks

● Task Tracker
○ Executes Map & Reduce Tasks
MapReduce
Example 1:
● Map
● Reduce
● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop
MapReduce
Example 2:
● Sorting
● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop
Big Data
Big Data
● Too much data
○ Not a problem any more

● It’s just a matter of which tools use
● New opportunity for businesses
Big Data Platform
Consumption

logs

Processing

Serving

indexes
DB

DB
NoSQL
Hadoop Ecosystem
Hive
● Data Warehouse
● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;

● Executes MapReduce underneath!
HBase
●
●
●
●

Based on BigTable
Column-oriented database
Random realtime read/write access
Easy to bulk load from Hadoop
Hadoop Ecosystem
● ZooKeeper:
○ Centralized coordination system

● Pig
○ Data-flow language to analyze large data sets

● Kafka:
○ Distributed messaging system

● Sqoop:
○ Transfer between RDBMS - HDFS

● ...
Hadoop - Who’s using it?
Trovit
● What is it:
○ Vertical search engine.
○ Real estate, cars, jobs, products, vacations.

● Challenges:
○ Millions of documents to index
○ Traffic generates a huge amount of log files
Trovit
● Legacy:
○ Use MySQL as a support to document indexing
○ Didn’t scale!

● Batch processing:
○ Hadoop with a pipeline workflow
○ Problem solved!

● Real time processing:
○ Storm to improve freshness

● More challenges:
○ Content analysis
○ Traffic analysis
Questions?
Distributed batch processing with Hadoop
@ferrangali

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
Eduard Hildebrandt
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
Nam Nham
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 

Was ist angesagt? (19)

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 

Andere mochten auch (9)

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
La idiosincrasia española
La idiosincrasia españolaLa idiosincrasia española
La idiosincrasia española
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Paises mas importantes del continente europeo
Paises mas importantes del continente europeoPaises mas importantes del continente europeo
Paises mas importantes del continente europeo
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Types of operating system
Types of operating systemTypes of operating system
Types of operating system
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Ähnlich wie Distributed batch processing with Hadoop

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
Shree Shree
 

Ähnlich wie Distributed batch processing with Hadoop (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
2 hadoop
2  hadoop2  hadoop
2 hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Distributed batch processing with Hadoop

  • 1. Distributed batch processing with Hadoop Ferran Galí i Reniu @ferrangali 09/01/2014
  • 2. Ferran Galí i Reniu ● UPC - FIB ● Trovit
  • 3. Problem ● Too much data ○ 90% of all the data in the world has been generated in the last two years ○ Large Hadron Collider: 25 petabytes per year ○ Walmart: 1M transactions per hour ● Hard disks ○ Cheap! ○ Still slow access time ○ Write even slower
  • 4. Solutions ● Multiple Hard Disks ○ Work in parallel ○ We can reduce access time! ● How to deal with hardware failure? ● What if we need to combine data?
  • 5.
  • 6. Hadoop ● Doug Cutting & Mike Cafarella
  • 7. Hadoop The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung October 2003
  • 8. Hadoop MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat December 2004
  • 9. Hadoop ● Doug Cutting & Mike Cafarella ● Yahoo!
  • 10. Hadoop ● HDFS ○ Storage ● MapReduce ○ Processing ● Ecosystem
  • 11.
  • 12. HDFS ● Distributed storage ○ Managed across a network of commodity machines ● Blocks ○ About 128Mb ○ Large data sets ● Tolerance to node failure ○ Data replication ● Streaming data access ○ Many access ○ Write once (batch)
  • 13. HDFS ● DataNodes (Workers) ○ Store blocks ● NameNode (Master) ○ ○ ○ ○ ○ Maintains metadata Knows where the blocks are located Make DataNodes fault tolerant Single point of failure Secondary NameNode
  • 15. HDFS ● Interfaces ○ Java ○ Command line interface ● Load hadoop fs -put file.csv /user/hadoop/file.csv ● Extract hadoop fs -get /user/hadoop/file.csv file.csv
  • 16.
  • 17. MapReduce ● Distributed processing paradigm ○ Moving computation is cheaper than moving data ● Map ○ Map(k1,v1) -> list(k2,v2) ● Reduce ○ Reduce(k2,list(v2)) -> list(v3)
  • 18. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values));
  • 19. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Java is great Hadoop is also great
  • 20. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 21. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great map(1, “Java is great”)
  • 22. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great Value Java 1
  • 23. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 24. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 25. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 26. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value Hadoop 1 1 Java is great is 1 2 Hadoop is also great also 1 great 1
  • 27. Word Count - Group & Sort map(k1,v1) -> list(k2, v2) Key Value Java 1 is 1 great 1 Hadoop 1 is 1 also 1 great 1 reduce(k2, list(v2)) -> list(v3)
  • 28. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value is 1 Java [1] great 1 is [1, 1] Hadoop 1 great [1, 1] is 1 Hadoop [1] also 1 also great 1 group [1]
  • 29. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  • 30. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  • 31. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] reduce(“also”, [1])
  • 32. Word Count - Reduce map (Long key, String value) reduce(“also”, [1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  • 33. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  • 34. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  • 35. Word Count - Reduce map (Long key, String value) reduce(“Hadoop”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1
  • 36. Word Count - Reduce map (Long key, String value) reduce(“is”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2
  • 37. Word Count - Reduce map (Long key, String value) reduce(“Java”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2 Java 1
  • 38. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping
  • 39. Word Count - Partition num partitions = 1 Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  • 40. Word Count - Partition num partitions = 2 is 1 great Value Java 1 Key Value [1] is [1, 1] is [1, 1] Java [1] Key Value Key Value great [1, 1] also [1] [1, 1] 1 up Java Key Value sort gr o Key p ou gr Hadoop 1 is 1 also 1 Hadoop [1] great great 1 also Hadoop [1] sort [1]
  • 41. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping ○ Each partition executes a reduce task
  • 42. MapReduce ● Job Tracker ○ Dispatches Map & Reduce Tasks ● Task Tracker ○ Executes Map & Reduce Tasks
  • 43. MapReduce Example 1: ● Map ● Reduce ● Group & Partition $> hadoop jar jug-hadoop.jar example1 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  • 44. MapReduce Example 2: ● Sorting ● n-Job workflow $> hadoop jar jug-hadoop.jar example2 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  • 46. Big Data ● Too much data ○ Not a problem any more ● It’s just a matter of which tools use ● New opportunity for businesses
  • 49. Hive ● Data Warehouse ● SQL-Like analysis system SELECT SPLIT(line, “ ”) AS word, COUNT(*) FROM table GROUP BY word ORDER BY word ASC; ● Executes MapReduce underneath!
  • 50. HBase ● ● ● ● Based on BigTable Column-oriented database Random realtime read/write access Easy to bulk load from Hadoop
  • 51. Hadoop Ecosystem ● ZooKeeper: ○ Centralized coordination system ● Pig ○ Data-flow language to analyze large data sets ● Kafka: ○ Distributed messaging system ● Sqoop: ○ Transfer between RDBMS - HDFS ● ...
  • 52. Hadoop - Who’s using it?
  • 53. Trovit ● What is it: ○ Vertical search engine. ○ Real estate, cars, jobs, products, vacations. ● Challenges: ○ Millions of documents to index ○ Traffic generates a huge amount of log files
  • 54. Trovit ● Legacy: ○ Use MySQL as a support to document indexing ○ Didn’t scale! ● Batch processing: ○ Hadoop with a pipeline workflow ○ Problem solved! ● Real time processing: ○ Storm to improve freshness ● More challenges: ○ Content analysis ○ Traffic analysis
  • 55. Questions? Distributed batch processing with Hadoop @ferrangali