SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
HADOOP⼤大数据分布式	

计算框架
Hadoop Distributed Computing Framework for Big Data	

http://www.cyanny.com/2013/12/05/hadoop-overview/
The Motivation for Hadoop
• Hadoop is an open source distributed computing framework
for large-scale data sets processing.
• Created by Doug Cutting, origins in Apache Nutch, moved
out from Nutch in 2006
• Based on Google GFS paper (2003) and MapReduce Paper
(Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes
• Yahoo : 42000nodes,LinkedIn: 4100 nodes, Facebook:
1400, eBay: 500, TaoBao: 2000(biggest in CN)
• Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
Why Hadoop?
• Problems in traditional big data processing(MPI, Grid
Computing, Volunteer Computing):
✴It’s difficult to deal with partial failures of the system.
✴Finite and precious bandwidth must be available to
combine data from different disks and transfer time is
very slow for big data volume.
✴Data exchange requires synchronization.
✴Temporal dependencies are complicated.
How Hadoop Save Big Data
• Hadoop provide partial failure support. Hadoop Distributed File System
(HDFS) can store large data sets with high reliability and scalability.
• HDFS provide great fault tolerance. Partial Failure will not result in the
failure of the entire system. And HDFS provide data recoverability for partial
failure.
• Hadoop introduce MapReduce, which spares programmers from low-level
details, like partial failure. The MapReduce framework will detect failed tasks
and reschedule them automatically.
• Hadoop provide data locality. The MapReduce framework tries to collocate
data with the compute nodes. Data is local, and tasks are separated with no
dependence on each other. So the shared-nothing and data locality
architecture can save more bandwidth and solve the complicated dependence
problem
Hadoop Basic Concepts
• The core concepts for Hadoop are to distribute the
data as it is initially stored in the system. That is
data locality.
• Applications are written in high-level code.
• Nodes Dependency as little as possible.
• Data Replica, data is spread among machines in
advance
Hadoop High-Level Overview
• HDFS (Hadoop Distributed File System), which is
a distributed file system designed to store large data
sets and streaming data sets on commodity
hardware with high scalability, reliability and
availability.
• MapReduce is a parallel programming model and
an associated implementation for processing and
generating large data sets. It provides a clean
abstraction for programmers.
Master-Slave Architecture
• NameNode: HDFS namespace and
metadata.
• Secondary NameNode, which performs
housekeeping functions for NameNode, and
isn’t a backup or hot standby for the
NameNode.
• DataNode, which stores actual HDFS data
blocks. In Hadoop, a large file is split into
64M or 128M blocks.
• JobTracker, which manages MapReduce
jobs, distributes individual tasks to machines
running.
• TaskTracker, which initiates and monitors
each individual Map and Reduce tasks.

Each Daemon Runs its own JVM
An Example
• WordCount

POSIX: Portable Operating System Interface

• fs -copyFromLocal conf input
• bin/hadoop jar hadoop-examples-1.2.1.jar grep input
output 'dfs[a-z.]+'
• bin/hadoop fs -cat output/*
• localhost:50030, check MapReduce status
• localhost:50070, check HDFS status
HDFS: Basic Concepts

• Highly fault-tolerant: handle partial failure
• 	Streaming Data Access: Block Data(64 MB,
128MB), “Write-once-read-many-times”
• Large data sets: GB, TB,PB
HDFS Architecture
• NameNode:
namespace
tree(logical file
location and
physical location
in RAM)
• DataNode: store
actual data blocks
• Communication
: RPC
Secondary NameNode
• NameNode Data Persistent: FSImage and EditLog
✤ FSImage persistent for filesystem tree, mapping
of files and blocks, filesystem properties
✤ No persistent for block physical locations, which
are in RAM
• Checkpoint: Merge Editlog with FSImage
• Secondary NameNode Housekeeping: Periodically
checkpoint
HDFS: Data Replica
• 3 Replica: high reliability
• one replica on one node
in the local rack
• the second one on a
node in a different
remote rack
• the third one on a
different node in the
same remote rack.
SPOF: HDFS Federation
• Scale NameNode
• Each NameNode has
Namespace Volume:
✴NameSpace
✴Block Pool
• DataNode: Stores
blocks from different
NN.

SPOF: Single Point of Failure
SPOF: HDFS High Availability(HA)
• A ad-hoc standby
NameNode
• Active NN write update
to shared NFS
• Standby NN pulls and
merges logs, up-to-date in
memory
• DataNodes: sends Block
reports to both NN
• Failover in tens of seconds
MapReduce
• Map task is to process a key/value pair to generate a set of
intermediate key/value pairs.
✴ Input: key is the offset of each line, value is each line
✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS

• Reduce task is to merge all intermediate values associated with the
same intermediated key
• Shuffle and sort
• Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1>
• Output: <apple, 5>, written to HDFS
• No reduce task can start until every map task has finished (Speculative Execution)
MapReduce
MapReduce v1 Framework
MapReduce v2 Framework
YARN(Yet Another Resource Negotiator)
Scheduler
Applications Manager

Application
Master:
monitor task
YARN’s
Beauty
Memory dynamic grained(1G~10G), not fixed slots
No JVM reuse, each task runs on each JVM
MapReduce is kind of Application
App Master Aggregates Job status, not Resource Manager
When not use Hadoop?
• Low-latency Data Access: real-time needs, HBase
• Structured Data: RDBMS, ad-hoc sql query
• When data isn’t that big: Hadoop needs TB and PB, not GB
• Too many small files
• Write more than read
• MapReduce may be not the best choice: data no
dependency, and parallel.
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

Was ist angesagt? (19)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Andere mochten auch

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu Solution
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Big data visualization framework
Big data visualization frameworkBig data visualization framework
Big data visualization frameworkAbhinav Krishna
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Andere mochten auch (14)

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Big data visualization framework
Big data visualization frameworkBig data visualization framework
Big data visualization framework
 
Distributed Computing and Big Data
Distributed Computing and Big DataDistributed Computing and Big Data
Distributed Computing and Big Data
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Ähnlich wie Hadoop distributed computing framework for big data

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 

Ähnlich wie Hadoop distributed computing framework for big data (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Anju
AnjuAnju
Anju
 
Hadoop
HadoopHadoop
Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Hadoop distributed computing framework for big data

  • 1. HADOOP⼤大数据分布式 计算框架 Hadoop Distributed Computing Framework for Big Data http://www.cyanny.com/2013/12/05/hadoop-overview/
  • 2. The Motivation for Hadoop • Hadoop is an open source distributed computing framework for large-scale data sets processing. • Created by Doug Cutting, origins in Apache Nutch, moved out from Nutch in 2006 • Based on Google GFS paper (2003) and MapReduce Paper (Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes • Yahoo : 42000nodes,LinkedIn: 4100 nodes, Facebook: 1400, eBay: 500, TaoBao: 2000(biggest in CN) • Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
  • 3. Why Hadoop? • Problems in traditional big data processing(MPI, Grid Computing, Volunteer Computing): ✴It’s difficult to deal with partial failures of the system. ✴Finite and precious bandwidth must be available to combine data from different disks and transfer time is very slow for big data volume. ✴Data exchange requires synchronization. ✴Temporal dependencies are complicated.
  • 4. How Hadoop Save Big Data • Hadoop provide partial failure support. Hadoop Distributed File System (HDFS) can store large data sets with high reliability and scalability. • HDFS provide great fault tolerance. Partial Failure will not result in the failure of the entire system. And HDFS provide data recoverability for partial failure. • Hadoop introduce MapReduce, which spares programmers from low-level details, like partial failure. The MapReduce framework will detect failed tasks and reschedule them automatically. • Hadoop provide data locality. The MapReduce framework tries to collocate data with the compute nodes. Data is local, and tasks are separated with no dependence on each other. So the shared-nothing and data locality architecture can save more bandwidth and solve the complicated dependence problem
  • 5. Hadoop Basic Concepts • The core concepts for Hadoop are to distribute the data as it is initially stored in the system. That is data locality. • Applications are written in high-level code. • Nodes Dependency as little as possible. • Data Replica, data is spread among machines in advance
  • 6. Hadoop High-Level Overview • HDFS (Hadoop Distributed File System), which is a distributed file system designed to store large data sets and streaming data sets on commodity hardware with high scalability, reliability and availability. • MapReduce is a parallel programming model and an associated implementation for processing and generating large data sets. It provides a clean abstraction for programmers.
  • 7. Master-Slave Architecture • NameNode: HDFS namespace and metadata. • Secondary NameNode, which performs housekeeping functions for NameNode, and isn’t a backup or hot standby for the NameNode. • DataNode, which stores actual HDFS data blocks. In Hadoop, a large file is split into 64M or 128M blocks. • JobTracker, which manages MapReduce jobs, distributes individual tasks to machines running. • TaskTracker, which initiates and monitors each individual Map and Reduce tasks. Each Daemon Runs its own JVM
  • 8. An Example • WordCount POSIX: Portable Operating System Interface • fs -copyFromLocal conf input • bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs[a-z.]+' • bin/hadoop fs -cat output/* • localhost:50030, check MapReduce status • localhost:50070, check HDFS status
  • 9. HDFS: Basic Concepts • Highly fault-tolerant: handle partial failure • Streaming Data Access: Block Data(64 MB, 128MB), “Write-once-read-many-times” • Large data sets: GB, TB,PB
  • 10. HDFS Architecture • NameNode: namespace tree(logical file location and physical location in RAM) • DataNode: store actual data blocks • Communication : RPC
  • 11. Secondary NameNode • NameNode Data Persistent: FSImage and EditLog ✤ FSImage persistent for filesystem tree, mapping of files and blocks, filesystem properties ✤ No persistent for block physical locations, which are in RAM • Checkpoint: Merge Editlog with FSImage • Secondary NameNode Housekeeping: Periodically checkpoint
  • 12. HDFS: Data Replica • 3 Replica: high reliability • one replica on one node in the local rack • the second one on a node in a different remote rack • the third one on a different node in the same remote rack.
  • 13. SPOF: HDFS Federation • Scale NameNode • Each NameNode has Namespace Volume: ✴NameSpace ✴Block Pool • DataNode: Stores blocks from different NN. SPOF: Single Point of Failure
  • 14. SPOF: HDFS High Availability(HA) • A ad-hoc standby NameNode • Active NN write update to shared NFS • Standby NN pulls and merges logs, up-to-date in memory • DataNodes: sends Block reports to both NN • Failover in tens of seconds
  • 15. MapReduce • Map task is to process a key/value pair to generate a set of intermediate key/value pairs. ✴ Input: key is the offset of each line, value is each line ✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS • Reduce task is to merge all intermediate values associated with the same intermediated key • Shuffle and sort • Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1> • Output: <apple, 5>, written to HDFS • No reduce task can start until every map task has finished (Speculative Execution)
  • 18.
  • 19. MapReduce v2 Framework YARN(Yet Another Resource Negotiator) Scheduler Applications Manager Application Master: monitor task
  • 20. YARN’s Beauty Memory dynamic grained(1G~10G), not fixed slots No JVM reuse, each task runs on each JVM MapReduce is kind of Application App Master Aggregates Job status, not Resource Manager
  • 21.
  • 22. When not use Hadoop? • Low-latency Data Access: real-time needs, HBase • Structured Data: RDBMS, ad-hoc sql query • When data isn’t that big: Hadoop needs TB and PB, not GB • Too many small files • Write more than read • MapReduce may be not the best choice: data no dependency, and parallel.