SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop Distributed File System

                Dhruba Borthakur
   Apache Hadoop Project Management Committee
              dhruba@apache.org
             dhruba@facebook.com
Hadoop, Why?
•  Need to process huge datasets on large clusters
   of computers
•  Nodes fail every day
   – Failure is expected, rather than exceptional.
   – The number of nodes in a cluster is not constant.
•  Very expensive to build reliability into each
   application.
•  Need common infrastructure
   – Efficient, reliable, easy to use
   – Open Source, Apache License
Hadoop History
•    Dec 2004 – Google paper published
•    July 2005 – Nutch uses new MapReduce implementation
•    Jan 2006 – Doug Cutting joins Yahoo!
•    Feb 2006 – Hadoop becomes a new Lucene subproject
•    Apr 2007 – Yahoo! running Hadoop on 1000-node cluster
•    Jan 2008 – An Apache Top Level Project
•    Feb 2008 – Yahoo! production search index with Hadoop
•    July 2008 – First reports of a 4000-node cluster
Who uses Hadoop?
•    Amazon/A9
•    Facebook
•    Google
•    IBM
•    Joost
•    Last.fm
•    New York Times
•    PowerSet
•    Veoh
•    Yahoo!
What is Hadoop used for?
•  Search
   –  Yahoo, Amazon, Zvents,
•  Log processing
   –  Facebook, Yahoo, ContextWeb. Joost, Last.fm
•  Recommendation Systems
   –  Facebook
•  Data Warehouse
   –  Facebook, AOL
•  Video and Image Analysis
   –  New York Times, Eyealike
Public Hadoop Clouds
•  Hadoop Map-reduce on Amazon EC2 instances
    –  http://wiki.apache.org/hadoop/AmazonEC2
•  IBM Blue Cloud
   –  Partnering with Google to offer web-scale infrastructure
•  Global Cloud Computing Testbed
   –  Joint effort by Yahoo, HP and Intel
Commodity Hardware



Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Goals of HDFS
•  Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
•  Assumes Commodity Hardware
   – Files are replicated to handle hardware failure
   – Detect failures and recovers from them
•  Optimized for Batch Processing
   – Data locations exposed so that computations can
   move to where data resides
   – Provides very high aggregate bandwidth
Distributed File System
•  Single Namespace for entire cluster
•  Data Coherency
   – Write-once-read-many access model
   – Client can only append to existing files
•  Files are broken up into blocks
   – Typically 128 MB block size
   – Each block replicated on multiple DataNodes
•  Intelligent Client
   – Client can find location of blocks
   – Client accesses data directly from DataNode
Functions of a NameNode
•  Manages File System Namespace
   – Maps a file name to a set of blocks
   – Maps a block to the DataNodes where
   it resides
•  Cluster Configuration Management
•  Replication Engine for Blocks
NameNode Metadata
•  Meta-data in Memory
   – The entire metadata is in main memory
   – No demand paging of FS meta-data
•  Types of Metadata
   – List of files
   – List of Blocks for each file
   – List of DataNodes for each block
   – File attributes, e.g access time, replication factor
•  A Transaction Log
   – Records file creations, file deletions. etc
DataNode
•  A Block Server
   – Stores data in the local file system (e.g. ext3)
   – Stores meta-data of a block (e.g. CRC)
   – Serves data and meta-data to Clients
•  Block Report
   – Periodically sends a report of all existing blocks to
   the NameNode
•  Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes
Block Placement
•  Current Strategy
   -- One replica on random node on local rack
   -- Second replica on a random remote rack
   -- Third replica on same remote rack
   -- Additional replicas are randomly placed
•  Clients read from nearest replica
•  Would like to make this policy pluggable
Replication Engine
•  NameNode detects DataNode failures
   – Chooses new DataNodes for new
   replicas
   – Balances disk usage
   – Balances communication traffic to
   DataNodes
Data Correctness
•  Use Checksums to validate data
   – Use CRC32
•  File Creation
   – Client computes checksum per 512 byte
   – DataNode stores the checksum
•  File access
   – Client retrieves the data and checksum
   from DataNode
   – If Validation fails, Client tries other replicas
Namenode Failure
•  A single point of failure
•  Transaction Log stored in multiple
   directories
   – A directory on the local file system
   – A directory on a remote file system
   (NFS/CIFS)
•  Need to develop a real HA solution
Data Pipelining
•  Client retrieves a list of DataNodes on
   which to place replicas of a block
•  Client writes block to the first DataNode
•  The first DataNode forwards the data to
   the next DataNode in the Pipeline
•  When all replicas are written, the Client
   moves on to the next block in file
Secondary NameNode
•  Copies FsImage and Transaction Log from
   NameNode to a temporary directory
•  Merges FSImage and Transaction Log into a
   new FSImage in temporary directory
•  Uploads new FSImage to the NameNode
   – Transaction Log on NameNode is purged
User Interface
•  Command for HDFS User:
   – hadoop dfs -mkdir /foodir
   – hadoop dfs -cat /foodir/myfile.txt
   – hadoop dfs -rm /foodir myfile.txt
•  Command for HDFS Administrator
   – hadoop dfsadmin -report
   – hadoop dfsadmin -decommission datanodename
•  Web Interface
   – http://host:port/dfshealth.jsp
Hadoop Map/Reduce
•  Implementation of the Map-Reduce programming model
   – Framework for distributed processing of large data sets
   – Data handled as collections of key-value pairs
   – Pluggable user code runs in generic framework
•  Very common design pattern in data processing
   – Demonstrated by a unix pipeline example:
   cat * | grep | sort | unique -c | cat > file
   input | map | shuffle | reduce | output
•  Natural for:
    – Log processing
    – Web search indexing
    – Ad-hoc queries
Hadoop Subprojects
•  Hive
  –  A Data Warehouse with SQL support
•  HBase
  – table storage for semi-structured data
•  Zookeeper
  – coordinating distributed applications
•  Mahout
  –  Machine learning
Hadoop at Facebook
•  Hardware
  –  4800 cores, 600 machines, 16GB/8GB per
     machine – Nov 2008
  –  4 SATA disks of 1 TB each per machine
  –  2 level network hierarchy, 40 machines per
     rack
Hadoop at Facebook
•  Single HDFS cluster across all cores
   –  2 PB raw capacity
   –  Ingest rate is 2 TB compressed per day
          – 10 TB uncompressed
   –  12 Million files
Hadoop Growth at Facebook
Data Flow at Facebook



Web
Servers
           Scribe
Servers



                                                  Network

                                                  Storage





 Oracle
RAC
   Hadoop
Cluster
           MySQL

Hadoop Usage at Facebook
•  Statistics per day:
   –  55TB of compressed data scanned per day
   –  3200+ jobs on production cluster per day
   –  80M compute minutes per day
•  Barrier to entry is significantly reduced:
   –  SQL like language called Hive
   –  http://hadoop.apache.org/hive/
Useful Links
•  HDFS Design:
  – http://hadoop.apache.org/core/docs/current/hdfs_design.html

•  Hadoop API:
   – http://lucene.apache.org/hadoop/api/
•  Hadoop Wiki
   –  http://wiki.apache.org/hadoop/

More Related Content

What's hot

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
rICh morrow
 

What's hot (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
Sqoop
SqoopSqoop
Sqoop
 

Viewers also liked

presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
Brett Keim
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Anand Kulkarni
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
Ioanna Tsalouchidou
 

Viewers also liked (20)

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewellery
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger Sulbha
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 
Microsoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemMicrosoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvem
 

Similar to Hadoop Distributed File System

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
saintdevil163
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Media Gorod
 

Similar to Hadoop Distributed File System (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

More from elliando dias

Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
elliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
elliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
elliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
elliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
elliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
elliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
elliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
elliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Hadoop Distributed File System

  • 1. Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
  • 2. Hadoop, Why? •  Need to process huge datasets on large clusters of computers •  Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. •  Very expensive to build reliability into each application. •  Need common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License
  • 3. Hadoop History •  Dec 2004 – Google paper published •  July 2005 – Nutch uses new MapReduce implementation •  Jan 2006 – Doug Cutting joins Yahoo! •  Feb 2006 – Hadoop becomes a new Lucene subproject •  Apr 2007 – Yahoo! running Hadoop on 1000-node cluster •  Jan 2008 – An Apache Top Level Project •  Feb 2008 – Yahoo! production search index with Hadoop •  July 2008 – First reports of a 4000-node cluster
  • 4. Who uses Hadoop? •  Amazon/A9 •  Facebook •  Google •  IBM •  Joost •  Last.fm •  New York Times •  PowerSet •  Veoh •  Yahoo!
  • 5. What is Hadoop used for? •  Search –  Yahoo, Amazon, Zvents, •  Log processing –  Facebook, Yahoo, ContextWeb. Joost, Last.fm •  Recommendation Systems –  Facebook •  Data Warehouse –  Facebook, AOL •  Video and Image Analysis –  New York Times, Eyealike
  • 6. Public Hadoop Clouds •  Hadoop Map-reduce on Amazon EC2 instances –  http://wiki.apache.org/hadoop/AmazonEC2 •  IBM Blue Cloud –  Partnering with Google to offer web-scale infrastructure •  Global Cloud Computing Testbed –  Joint effort by Yahoo, HP and Intel
  • 7. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 8. Goals of HDFS •  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB •  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them •  Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 9. Distributed File System •  Single Namespace for entire cluster •  Data Coherency – Write-once-read-many access model – Client can only append to existing files •  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes •  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 10.
  • 11. Functions of a NameNode •  Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides •  Cluster Configuration Management •  Replication Engine for Blocks
  • 12. NameNode Metadata •  Meta-data in Memory – The entire metadata is in main memory – No demand paging of FS meta-data •  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g access time, replication factor •  A Transaction Log – Records file creations, file deletions. etc
  • 13. DataNode •  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients •  Block Report – Periodically sends a report of all existing blocks to the NameNode •  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. Block Placement •  Current Strategy -- One replica on random node on local rack -- Second replica on a random remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed •  Clients read from nearest replica •  Would like to make this policy pluggable
  • 15. Replication Engine •  NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 16. Data Correctness •  Use Checksums to validate data – Use CRC32 •  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum •  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 17. Namenode Failure •  A single point of failure •  Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) •  Need to develop a real HA solution
  • 18. Data Pipelining •  Client retrieves a list of DataNodes on which to place replicas of a block •  Client writes block to the first DataNode •  The first DataNode forwards the data to the next DataNode in the Pipeline •  When all replicas are written, the Client moves on to the next block in file
  • 19. Secondary NameNode •  Copies FsImage and Transaction Log from NameNode to a temporary directory •  Merges FSImage and Transaction Log into a new FSImage in temporary directory •  Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 20. User Interface •  Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt •  Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename •  Web Interface – http://host:port/dfshealth.jsp
  • 21. Hadoop Map/Reduce •  Implementation of the Map-Reduce programming model – Framework for distributed processing of large data sets – Data handled as collections of key-value pairs – Pluggable user code runs in generic framework •  Very common design pattern in data processing – Demonstrated by a unix pipeline example: cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output •  Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 22.
  • 23. Hadoop Subprojects •  Hive –  A Data Warehouse with SQL support •  HBase – table storage for semi-structured data •  Zookeeper – coordinating distributed applications •  Mahout –  Machine learning
  • 24. Hadoop at Facebook •  Hardware –  4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008 –  4 SATA disks of 1 TB each per machine –  2 level network hierarchy, 40 machines per rack
  • 25. Hadoop at Facebook •  Single HDFS cluster across all cores –  2 PB raw capacity –  Ingest rate is 2 TB compressed per day – 10 TB uncompressed –  12 Million files
  • 26. Hadoop Growth at Facebook
  • 27. Data Flow at Facebook Web
Servers
 Scribe
Servers
 Network
 Storage
 Oracle
RAC
 Hadoop
Cluster
 MySQL

  • 28. Hadoop Usage at Facebook •  Statistics per day: –  55TB of compressed data scanned per day –  3200+ jobs on production cluster per day –  80M compute minutes per day •  Barrier to entry is significantly reduced: –  SQL like language called Hive –  http://hadoop.apache.org/hive/
  • 29. Useful Links •  HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html •  Hadoop API: – http://lucene.apache.org/hadoop/api/ •  Hadoop Wiki –  http://wiki.apache.org/hadoop/