SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Hadoop, a distributed framework for
Big Data
Presented By:
Bhushan Kulkarni
T.E(I.T)
Contents
1. Introduction and Hadoop’s history
2. Architecture in detail
3. Hadoop in industry
What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable,
scalable, distributed computing and data storage.
• It is a flexible and highly-available architecture for
large scale computation and data processing on a
network of commodity hardware.
What is Hadoop
• Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google
MapReduce
• Hadoop is based on a simple programming model called
MapReduce
• Hadoop is based on a simple data model, any data will fit
4
Brief History of Hadoop
• Google introduced Map reduce Algorithm.
• Doug Cutting and team took the solution
provided by Google and started an Open Source
Project called HADOOP in 2005 and Doug
named it after his son's toy elephant.
Hadoop’s Developers
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
Large-Scale Data
Analytics
• MapReduce computing paradigm (E.g., Hadoop) vs. Traditional
database systems
7
Database
vs.
 Many enterprises are turning to Hadoop
 Especially applications generating big data
 Web applications, social networks, scientific applications
Why Hadoop is able to compete?
8
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-
tolerant mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Structured Data
Key Components
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
9
Hadoop: How it Works
10
Hadoop Architecture
11
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Hadoop Distributed File System
(HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F
Blocks (64 MB)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure occurs rarely
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
13
What is MapReduce?
• MapReduce is a programming model
• Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines
• MapReduce is an associated implementation for processing
and generating large data sets.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the
namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
15
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
16
The Programming Model Of MapReduce
Map takes an input pair and produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values
MapReduce data flow with a single reduce task
MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
Example 1 : Color Count
21
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines
Example 2: Color Filter
22
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware
• Optimized for Batch Processing
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
24
THANK YOU!!
25

Weitere ähnliche Inhalte

Was ist angesagt?

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 

Was ist angesagt? (20)

Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Andere mochten auch

Day 1 1505 - 1550 - pearl 1 - vimal kumar khanna
Day 1   1505 - 1550 - pearl 1 - vimal kumar khannaDay 1   1505 - 1550 - pearl 1 - vimal kumar khanna
Day 1 1505 - 1550 - pearl 1 - vimal kumar khanna
PMI2011
 
Mapa procesos pmbok 5
Mapa procesos pmbok 5Mapa procesos pmbok 5
Mapa procesos pmbok 5
Pedro Arcas
 

Andere mochten auch (18)

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Day 1 1505 - 1550 - pearl 1 - vimal kumar khanna
Day 1   1505 - 1550 - pearl 1 - vimal kumar khannaDay 1   1505 - 1550 - pearl 1 - vimal kumar khanna
Day 1 1505 - 1550 - pearl 1 - vimal kumar khanna
 
Sync is hard: building offline-first Android apps from the ground up
Sync is hard: building offline-first Android apps from the ground up	Sync is hard: building offline-first Android apps from the ground up
Sync is hard: building offline-first Android apps from the ground up
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
Logging Application Behavior to MongoDB
Logging Application Behavior to MongoDBLogging Application Behavior to MongoDB
Logging Application Behavior to MongoDB
 
Facebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challengeFacebook architecture presentation: scalability challenge
Facebook architecture presentation: scalability challenge
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it Open
 
2017 Silicon Valley Investment Trends by Edith Yeung
2017 Silicon Valley Investment Trends by Edith Yeung 2017 Silicon Valley Investment Trends by Edith Yeung
2017 Silicon Valley Investment Trends by Edith Yeung
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
The Australian Startup Stack
The Australian Startup StackThe Australian Startup Stack
The Australian Startup Stack
 
Scalable JavaScript Application Architecture
Scalable JavaScript Application ArchitectureScalable JavaScript Application Architecture
Scalable JavaScript Application Architecture
 
深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Mapa procesos pmbok 5
Mapa procesos pmbok 5Mapa procesos pmbok 5
Mapa procesos pmbok 5
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Solution Architecture – Approach to Rapidly Scoping The Initial Solution Options
Solution Architecture – Approach to Rapidly Scoping The Initial Solution OptionsSolution Architecture – Approach to Rapidly Scoping The Initial Solution Options
Solution Architecture – Approach to Rapidly Scoping The Initial Solution Options
 
Structured Approach to Solution Architecture
Structured Approach to Solution ArchitectureStructured Approach to Solution Architecture
Structured Approach to Solution Architecture
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 

Ähnlich wie Hadoop

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Ähnlich wie Hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
hadoop
hadoophadoop
hadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Anju
AnjuAnju
Anju
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Kürzlich hochgeladen

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Kürzlich hochgeladen (20)

Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 

Hadoop

  • 1. Hadoop, a distributed framework for Big Data Presented By: Bhushan Kulkarni T.E(I.T)
  • 2. Contents 1. Introduction and Hadoop’s history 2. Architecture in detail 3. Hadoop in industry
  • 3. What is Hadoop? • Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. • It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
  • 4. What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 4
  • 5. Brief History of Hadoop • Google introduced Map reduce Algorithm. • Doug Cutting and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
  • 6. Hadoop’s Developers Doug Cutting 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  • 7. Large-Scale Data Analytics • MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems 7 Database vs.  Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications
  • 8. Why Hadoop is able to compete? 8 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault- tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Structured Data
  • 9. Key Components • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) 9
  • 10. Hadoop: How it Works 10
  • 11. Hadoop Architecture 11 Master node (single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce)
  • 12. Hadoop Distributed File System (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)
  • 13. Main Properties of HDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure occurs rarely • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 13
  • 14. What is MapReduce? • MapReduce is a programming model • Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines • MapReduce is an associated implementation for processing and generating large data sets. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 15. Properties of MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (concept of locality) 15 • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3
  • 16. Properties of MapReduce Engine (Cont’d) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress 16
  • 17. The Programming Model Of MapReduce Map takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
  • 18. MapReduce data flow with a single reduce task
  • 19. MapReduce data flow with multiple reduce tasks
  • 20. MapReduce data flow with no reduce tasks
  • 21. Example 1 : Color Count 21 Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines
  • 22. Example 2: Color Filter 22 Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines
  • 23. Why use Hadoop? • Need to process Multi Petabyte Datasets • Data may not have strict schema • Expensive to build reliability in each application • Need common infrastructure • Very Large Distributed File System • Assumes Commodity Hardware • Optimized for Batch Processing
  • 24. Who Uses MapReduce/Hadoop • Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs 24