SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
HADOOP⼤大数据分布式	

计算框架
Hadoop Distributed Computing Framework for Big Data	

http://www.cyanny.com/2013/12/05/hadoop-overview/
The Motivation for Hadoop
• Hadoop is an open source distributed computing framework
for large-scale data sets processing.
• Created by Doug Cutting, origins in Apache Nutch, moved
out from Nutch in 2006
• Based on Google GFS paper (2003) and MapReduce Paper
(Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes
• Yahoo : 42000nodes,LinkedIn: 4100 nodes, Facebook:
1400, eBay: 500, TaoBao: 2000(biggest in CN)
• Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
Why Hadoop?
• Problems in traditional big data processing(MPI, Grid
Computing, Volunteer Computing):
✴It’s difficult to deal with partial failures of the system.
✴Finite and precious bandwidth must be available to
combine data from different disks and transfer time is
very slow for big data volume.
✴Data exchange requires synchronization.
✴Temporal dependencies are complicated.
How Hadoop Save Big Data
• Hadoop provide partial failure support. Hadoop Distributed File System
(HDFS) can store large data sets with high reliability and scalability.
• HDFS provide great fault tolerance. Partial Failure will not result in the
failure of the entire system. And HDFS provide data recoverability for partial
failure.
• Hadoop introduce MapReduce, which spares programmers from low-level
details, like partial failure. The MapReduce framework will detect failed tasks
and reschedule them automatically.
• Hadoop provide data locality. The MapReduce framework tries to collocate
data with the compute nodes. Data is local, and tasks are separated with no
dependence on each other. So the shared-nothing and data locality
architecture can save more bandwidth and solve the complicated dependence
problem
Hadoop Basic Concepts
• The core concepts for Hadoop are to distribute the
data as it is initially stored in the system. That is
data locality.
• Applications are written in high-level code.
• Nodes Dependency as little as possible.
• Data Replica, data is spread among machines in
advance
Hadoop High-Level Overview
• HDFS (Hadoop Distributed File System), which is
a distributed file system designed to store large data
sets and streaming data sets on commodity
hardware with high scalability, reliability and
availability.
• MapReduce is a parallel programming model and
an associated implementation for processing and
generating large data sets. It provides a clean
abstraction for programmers.
Master-Slave Architecture
• NameNode: HDFS namespace and
metadata.
• Secondary NameNode, which performs
housekeeping functions for NameNode, and
isn’t a backup or hot standby for the
NameNode.
• DataNode, which stores actual HDFS data
blocks. In Hadoop, a large file is split into
64M or 128M blocks.
• JobTracker, which manages MapReduce
jobs, distributes individual tasks to machines
running.
• TaskTracker, which initiates and monitors
each individual Map and Reduce tasks.

Each Daemon Runs its own JVM
An Example
• WordCount

POSIX: Portable Operating System Interface

• fs -copyFromLocal conf input
• bin/hadoop jar hadoop-examples-1.2.1.jar grep input
output 'dfs[a-z.]+'
• bin/hadoop fs -cat output/*
• localhost:50030, check MapReduce status
• localhost:50070, check HDFS status
HDFS: Basic Concepts

• Highly fault-tolerant: handle partial failure
• 	Streaming Data Access: Block Data(64 MB,
128MB), “Write-once-read-many-times”
• Large data sets: GB, TB,PB
HDFS Architecture
• NameNode:
namespace
tree(logical file
location and
physical location
in RAM)
• DataNode: store
actual data blocks
• Communication
: RPC
Secondary NameNode
• NameNode Data Persistent: FSImage and EditLog
✤ FSImage persistent for filesystem tree, mapping
of files and blocks, filesystem properties
✤ No persistent for block physical locations, which
are in RAM
• Checkpoint: Merge Editlog with FSImage
• Secondary NameNode Housekeeping: Periodically
checkpoint
HDFS: Data Replica
• 3 Replica: high reliability
• one replica on one node
in the local rack
• the second one on a
node in a different
remote rack
• the third one on a
different node in the
same remote rack.
SPOF: HDFS Federation
• Scale NameNode
• Each NameNode has
Namespace Volume:
✴NameSpace
✴Block Pool
• DataNode: Stores
blocks from different
NN.

SPOF: Single Point of Failure
SPOF: HDFS High Availability(HA)
• A ad-hoc standby
NameNode
• Active NN write update
to shared NFS
• Standby NN pulls and
merges logs, up-to-date in
memory
• DataNodes: sends Block
reports to both NN
• Failover in tens of seconds
MapReduce
• Map task is to process a key/value pair to generate a set of
intermediate key/value pairs.
✴ Input: key is the offset of each line, value is each line
✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS

• Reduce task is to merge all intermediate values associated with the
same intermediated key
• Shuffle and sort
• Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1>
• Output: <apple, 5>, written to HDFS
• No reduce task can start until every map task has finished (Speculative Execution)
MapReduce
MapReduce v1 Framework
MapReduce v2 Framework
YARN(Yet Another Resource Negotiator)
Scheduler
Applications Manager

Application
Master:
monitor task
YARN’s
Beauty
Memory dynamic grained(1G~10G), not fixed slots
No JVM reuse, each task runs on each JVM
MapReduce is kind of Application
App Master Aggregates Job status, not Resource Manager
When not use Hadoop?
• Low-latency Data Access: real-time needs, HBase
• Structured Data: RDBMS, ad-hoc sql query
• When data isn’t that big: Hadoop needs TB and PB, not GB
• Too many small files
• Write more than read
• MapReduce may be not the best choice: data no
dependency, and parallel.
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

Was ist angesagt? (19)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Andere mochten auch

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu Solution
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Big data visualization framework
Big data visualization frameworkBig data visualization framework
Big data visualization frameworkAbhinav Krishna
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Andere mochten auch (14)

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Big data visualization framework
Big data visualization frameworkBig data visualization framework
Big data visualization framework
 
Distributed Computing and Big Data
Distributed Computing and Big DataDistributed Computing and Big Data
Distributed Computing and Big Data
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Ähnlich wie Hadoop distributed computing framework for big data

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 

Ähnlich wie Hadoop distributed computing framework for big data (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Anju
AnjuAnju
Anju
 
Hadoop
HadoopHadoop
Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Hadoop distributed computing framework for big data

  • 1. HADOOP⼤大数据分布式 计算框架 Hadoop Distributed Computing Framework for Big Data http://www.cyanny.com/2013/12/05/hadoop-overview/
  • 2. The Motivation for Hadoop • Hadoop is an open source distributed computing framework for large-scale data sets processing. • Created by Doug Cutting, origins in Apache Nutch, moved out from Nutch in 2006 • Based on Google GFS paper (2003) and MapReduce Paper (Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes • Yahoo : 42000nodes,LinkedIn: 4100 nodes, Facebook: 1400, eBay: 500, TaoBao: 2000(biggest in CN) • Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
  • 3. Why Hadoop? • Problems in traditional big data processing(MPI, Grid Computing, Volunteer Computing): ✴It’s difficult to deal with partial failures of the system. ✴Finite and precious bandwidth must be available to combine data from different disks and transfer time is very slow for big data volume. ✴Data exchange requires synchronization. ✴Temporal dependencies are complicated.
  • 4. How Hadoop Save Big Data • Hadoop provide partial failure support. Hadoop Distributed File System (HDFS) can store large data sets with high reliability and scalability. • HDFS provide great fault tolerance. Partial Failure will not result in the failure of the entire system. And HDFS provide data recoverability for partial failure. • Hadoop introduce MapReduce, which spares programmers from low-level details, like partial failure. The MapReduce framework will detect failed tasks and reschedule them automatically. • Hadoop provide data locality. The MapReduce framework tries to collocate data with the compute nodes. Data is local, and tasks are separated with no dependence on each other. So the shared-nothing and data locality architecture can save more bandwidth and solve the complicated dependence problem
  • 5. Hadoop Basic Concepts • The core concepts for Hadoop are to distribute the data as it is initially stored in the system. That is data locality. • Applications are written in high-level code. • Nodes Dependency as little as possible. • Data Replica, data is spread among machines in advance
  • 6. Hadoop High-Level Overview • HDFS (Hadoop Distributed File System), which is a distributed file system designed to store large data sets and streaming data sets on commodity hardware with high scalability, reliability and availability. • MapReduce is a parallel programming model and an associated implementation for processing and generating large data sets. It provides a clean abstraction for programmers.
  • 7. Master-Slave Architecture • NameNode: HDFS namespace and metadata. • Secondary NameNode, which performs housekeeping functions for NameNode, and isn’t a backup or hot standby for the NameNode. • DataNode, which stores actual HDFS data blocks. In Hadoop, a large file is split into 64M or 128M blocks. • JobTracker, which manages MapReduce jobs, distributes individual tasks to machines running. • TaskTracker, which initiates and monitors each individual Map and Reduce tasks. Each Daemon Runs its own JVM
  • 8. An Example • WordCount POSIX: Portable Operating System Interface • fs -copyFromLocal conf input • bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs[a-z.]+' • bin/hadoop fs -cat output/* • localhost:50030, check MapReduce status • localhost:50070, check HDFS status
  • 9. HDFS: Basic Concepts • Highly fault-tolerant: handle partial failure • Streaming Data Access: Block Data(64 MB, 128MB), “Write-once-read-many-times” • Large data sets: GB, TB,PB
  • 10. HDFS Architecture • NameNode: namespace tree(logical file location and physical location in RAM) • DataNode: store actual data blocks • Communication : RPC
  • 11. Secondary NameNode • NameNode Data Persistent: FSImage and EditLog ✤ FSImage persistent for filesystem tree, mapping of files and blocks, filesystem properties ✤ No persistent for block physical locations, which are in RAM • Checkpoint: Merge Editlog with FSImage • Secondary NameNode Housekeeping: Periodically checkpoint
  • 12. HDFS: Data Replica • 3 Replica: high reliability • one replica on one node in the local rack • the second one on a node in a different remote rack • the third one on a different node in the same remote rack.
  • 13. SPOF: HDFS Federation • Scale NameNode • Each NameNode has Namespace Volume: ✴NameSpace ✴Block Pool • DataNode: Stores blocks from different NN. SPOF: Single Point of Failure
  • 14. SPOF: HDFS High Availability(HA) • A ad-hoc standby NameNode • Active NN write update to shared NFS • Standby NN pulls and merges logs, up-to-date in memory • DataNodes: sends Block reports to both NN • Failover in tens of seconds
  • 15. MapReduce • Map task is to process a key/value pair to generate a set of intermediate key/value pairs. ✴ Input: key is the offset of each line, value is each line ✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS • Reduce task is to merge all intermediate values associated with the same intermediated key • Shuffle and sort • Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1> • Output: <apple, 5>, written to HDFS • No reduce task can start until every map task has finished (Speculative Execution)
  • 18.
  • 19. MapReduce v2 Framework YARN(Yet Another Resource Negotiator) Scheduler Applications Manager Application Master: monitor task
  • 20. YARN’s Beauty Memory dynamic grained(1G~10G), not fixed slots No JVM reuse, each task runs on each JVM MapReduce is kind of Application App Master Aggregates Job status, not Resource Manager
  • 21.
  • 22. When not use Hadoop? • Low-latency Data Access: real-time needs, HBase • Structured Data: RDBMS, ad-hoc sql query • When data isn’t that big: Hadoop needs TB and PB, not GB • Too many small files • Write more than read • MapReduce may be not the best choice: data no dependency, and parallel.