SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Hadoop Distributed File System

                Dhruba Borthakur
   Apache Hadoop Project Management Committee
              dhruba@apache.org
                 June 3rd, 2008
Who Am I?
• Hadoop Developer
   – Core contributor since Hadoop’s infancy
   – Focussed on Hadoop Distributed File System
• Facebook (Hadoop)
• Yahoo! (Hadoop)
• Veritas (San Point Direct, VxFS)
• IBM Transarc (Andrew File System)
Hadoop, Why?
• Need to process huge datasets on large clusters
  of computers
• Very expensive to build reliability into each
  application.
• Nodes fail every day
  – Failure is expected, rather than exceptional.
  – The number of nodes in a cluster is not constant.
• Need common infrastructure
  – Efficient, reliable, easy to use
  – Open Source, Apache License
Hadoop History
• Dec 2004    – Google GFS paper published
•   July 2005 – Nutch uses MapReduce
•   Jan 2006 – Doug Cutting joins Yahoo!
•   Feb 2006 – Becomes Lucene subproject
•   Apr 2007 – Yahoo! on 1000-node cluster
•   Jan 2008 – An Apache Top Level Project
•   Feb 2008 – Yahoo! production search index
Who uses Hadoop?
•   Amazon/A9
•   Facebook
•   Google
•   IBM : Blue Cloud?
•   Joost
•   Last.fm
•   New York Times
•   PowerSet
•   Veoh
•   Yahoo!
Commodity Hardware



Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Goals of HDFS
• Very Large Distributed File System
  – 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
  – Files are replicated to handle hardware failure
  – Detect failures and recovers from them
• Optimized for Batch Processing
  – Data locations exposed so that computations can
  move to where data resides
  – Provides very high aggregate bandwidth
HDFS Architecture
                                                                                           Cluster Membership



                                                                  NameNode




                                                                   Secondary
                                                                   NameNode


          Client




                                                                                                                Cluster Membership




NameNode : Maps a file to a file-id and list of MapNodes
                                                                               DataNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
  – Write-once-read-many access model
  – Client can only append to existing files
• Files are broken up into blocks
  – Typically 128 MB block size
  – Each block replicated on multiple DataNodes
• Intelligent Client
  – Client can find location of blocks
  – Client accesses data directly from DataNode
Functions of a NameNode
• Manages File System Namespace
  – Maps a file name to a set of blocks
  – Maps a block to the DataNodes where it resides

• Cluster Configuration Management
• Replication Engine for Blocks
NameNode Metadata
• Meta-data in Memory
  – The entire metadata is in main memory
  – No demand paging of meta-data
• Types of Metadata
  – List of files
  – List of Blocks for each file
  – List of DataNodes for each block
  – File attributes, e.g creation time, replication factor
• A Transaction Log
  – Records file creations, file deletions. etc
DataNode
• A Block Server
  – Stores data in the local file system (e.g. ext3)
  – Stores meta-data of a block (e.g. CRC)
  – Serves data and meta-data to Clients
• Block Report
  – Periodically sends a report of all existing blocks to
  the NameNode
• Facilitates Pipelining of Data
  – Forwards data to other specified DataNodes
Block Placement
• Current Strategy
  -- One replica on local node
  -- Second replica on a remote rack
  -- Third replica on same remote rack
  -- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
Heartbeats
• DataNodes send heartbeat to the NameNode
  – Once every 3 seconds
• NameNode used heartbeats to detect DataNode
  failure
Replication Engine
• NameNode detects DataNode failures
  – Chooses new DataNodes for new replicas
  – Balances disk usage
  – Balances communication traffic to DataNodes
Data Correctness
• Use Checksums to validate data
  – Use CRC32
• File Creation
  – Client computes checksum per 512 byte
  – DataNode stores the checksum
• File access
  – Client retrieves the data and checksum from
  DataNode
  – If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
  – A directory on the local file system
  – A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Data Pipelining
• Client retrieves a list of DataNodes on which to place
  replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next
  DataNode in the Pipeline
• When all replicas are written, the Client moves on to
  write the next block in file
Rebalancer
• Goal: % disk full on DataNodes should be similar
   –   Usually run when new DataNodes are added
   –   Cluster is online when Rebalancer is active
   –   Rebalancer is throttled to avoid network congestion
   –   Command line tool
Secondary NameNode
• Copies FsImage and Transaction Log from
  NameNode to a temporary directory
• Merges FSImage and Transaction Log into a new
  FSImage in temporary directory
• Uploads new FSImage to the NameNode
  – Transaction Log on NameNode is purged
User Interface
• Command for HDFS User:
  – hadoop dfs -mkdir /foodir
  – hadoop dfs -cat /foodir/myfile.txt
  – hadoop dfs -rm /foodir myfile.txt
• Command for HDFS Administrator
  – hadoop dfsadmin -report
  – hadoop dfsadmin -decommission datanodename
• Web Interface
  – http://host:port/dfshealth.jsp
Hadoop Map/Reduce
• The Map-Reduce programming model
  – Framework for distributed processing of large data
  sets
  – Pluggable user code runs in generic framework
• Common design pattern in data processing
  cat * | grep | sort    | unique -c | cat > file
   input | map | shuffle | reduce | output
• Natural for:
  – Log processing
   – Web search indexing
   – Ad-hoc queries
Hadoop Subprojects
• Pig (Initiated by Yahoo!)
   – High-level language for data analysis
• HBase (initiated by Powerset)
   – Table storage for semi-structured data
• Zookeeper (Initiated by Yahoo!)
   – Coordinating distributed applications
• Hive (initiated by Facebook, coming soon)
   – SQL-like Query language and Metastore
• Mahout
   – Machine learning
Useful Links
• HDFS Design:
  – http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Hadoop API:
  – http://hadoop.apache.org/core/docs/current/api/

Weitere ähnliche Inhalte

Was ist angesagt?

introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsnehabsairam
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 

Was ist angesagt? (20)

introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 

Ähnlich wie Hdfs Dhruba (20)

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
HDFS Basics
HDFS BasicsHDFS Basics
HDFS Basics
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop
HadoopHadoop
Hadoop
 

Mehr von Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 

Kürzlich hochgeladen

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Hdfs Dhruba

  • 1. Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3rd, 2008
  • 2. Who Am I? • Hadoop Developer – Core contributor since Hadoop’s infancy – Focussed on Hadoop Distributed File System • Facebook (Hadoop) • Yahoo! (Hadoop) • Veritas (San Point Direct, VxFS) • IBM Transarc (Andrew File System)
  • 3. Hadoop, Why? • Need to process huge datasets on large clusters of computers • Very expensive to build reliability into each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License
  • 4. Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses MapReduce • Jan 2006 – Doug Cutting joins Yahoo! • Feb 2006 – Becomes Lucene subproject • Apr 2007 – Yahoo! on 1000-node cluster • Jan 2008 – An Apache Top Level Project • Feb 2008 – Yahoo! production search index
  • 5. Who uses Hadoop? • Amazon/A9 • Facebook • Google • IBM : Blue Cloud? • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo!
  • 6. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 7. Goals of HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 8. HDFS Architecture Cluster Membership NameNode Secondary NameNode Client Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log
  • 9. Distributed File System • Single Namespace for entire cluster • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 10.
  • 11. Functions of a NameNode • Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks
  • 12. NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 13. DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. Block Placement • Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed • Clients read from nearest replica • Would like to make this policy pluggable
  • 15. Heartbeats • DataNodes send heartbeat to the NameNode – Once every 3 seconds • NameNode used heartbeats to detect DataNode failure
  • 16. Replication Engine • NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 17. Data Correctness • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 18. NameNode Failure • A single point of failure • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution
  • 19. Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next DataNode in the Pipeline • When all replicas are written, the Client moves on to write the next block in file
  • 20. Rebalancer • Goal: % disk full on DataNodes should be similar – Usually run when new DataNodes are added – Cluster is online when Rebalancer is active – Rebalancer is throttled to avoid network congestion – Command line tool
  • 21. Secondary NameNode • Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 22. User Interface • Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt • Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename • Web Interface – http://host:port/dfshealth.jsp
  • 23. Hadoop Map/Reduce • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 24. Hadoop Subprojects • Pig (Initiated by Yahoo!) – High-level language for data analysis • HBase (initiated by Powerset) – Table storage for semi-structured data • Zookeeper (Initiated by Yahoo!) – Coordinating distributed applications • Hive (initiated by Facebook, coming soon) – SQL-like Query language and Metastore • Mahout – Machine learning
  • 25. Useful Links • HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html • Hadoop API: – http://hadoop.apache.org/core/docs/current/api/