SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Hadoop and Hadoop Distributed File
    System (HDFS)




Public 2009/5/13



                                         1
Outline

       • Introduction to Hadoop project
       • Introduction to Hadoop Distributed File System
                 – Architecture
                 – Administration
                 – Client Interface
       • Reference




                                                          Copyright 2009 - Trend Micro Inc.
Classification

                                                                                              2
What is Hadoop?

•   Hadoop is a cloud computing platform for processing and keeping
    vast amount of data.
•   Apache top-level project
•   Open Source                                             Cloud Applications

•   Hadoop Core includes
     – Hadoop Distributed File System (HDFS)                  MapReduce                 HBase
     – MapReduce framework
• Hadoop subprojects                                          Hadoop Distributed File System
     – HBase, Zookeeper,…                                               (HDFS)

•   Written in Java.
•   Client interfaces come in C++/Java/Shell Scripting            A Cluster of Machines

•   Runs on
     – Linux, Mac OS/X, Windows, and Solaris
     – Commodity hardware


                                                                         Copyright 2009 - Trend Micro Inc.



                                                                                                             3
A Brief History of Hadoop

• 2003.2
   – First MapReduce library written at Google
• 2003.10
   – Google File System paper published
• 2004.12
   – Google MapReduce paper published
• 2005.7
   – Doug Cutting reports that Nutch now use new MapReduce
     implementation
• 2006.2
   – Hadoop code moves out of Nutch into new Lucene sub-project
• 2006.11
   – Google Bigtable paper published

                                                       Copyright 2009 - Trend Micro Inc.



                                                                                           4
A Brief History of Hadoop (continue)

       • 2007.2
                 – First HBase code drop from Mike Cafarella
       • 2007.4
                 – Yahoo! Running Hadoop on 1000-node cluster
       • 2008.1
                 – Hadoop made an Apache Top Level Project




                                                                Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                    5
Who use Hadoop?
     • Yahoo!
             – More than 100,000 CPUs in ~20,000 computers running Hadoop
     • Google
             – University Initiative to Address Internet-Scale Computing Challenges
     • Amazon
             – Amazon builds product search indices using the streaming API and pre-
               existing C++, Perl, and Python tools.
             – Process millions of sessions daily for analytics
     • IBM
             – Blue Cloud Computing Clusters
     • Trend Micro
             – Threat solution research
     • More…
             – http://wiki.apache.org/hadoop/PoweredBy



                                                                            Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                                6
Introduction to
Hadoop Distributed File System




                                 Copyright 2009 - Trend Micro Inc.



                                                                     7
HDFS Design

• Single Namespace for entire cluster
• Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
• Data Coherency
   – Write-once-read-many access model
   – A file once created, written and closed need not be changed.
   – Appending write to existing files (in the future)
• Files are broken up into blocks
   – Typically 128 MB block size
   – Each block replicated on multiple DataNodes




                                                          Copyright 2009 - Trend Micro Inc.



                                                                                              8
HDFS Design

       • Moving computation is cheaper than moving data
                 – Data locations exposed so that computations can move to where
                   data resides
       • File replication
                 – Default is 3 copies.
                 – Configurable by clients
       • Assumes Commodity Hardware
                 – Files are replicated to handle hardware failure
                 – Detect failures and recovers from them
       • Streaming Data Access
                 – High throughput of data access rather than low latency of data
                   access
                 – Optimized for Batch Processing

                                                                         Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                             9
Copyright 2009 - Trend Micro Inc.



                                    10
Copyright 2009 - Trend Micro Inc.



                                    11
NameNode

• Manages File System Namespace
   – Maps a file name to a set of blocks
   – Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks




                                                      Copyright 2009 - Trend Micro Inc.



                                                                                          12
NameNode Metadata

• Meta-data in Memory
   – The entire metadata is in main memory
   – No demand paging of meta-data
• Types of Metadata
   –   List of files
   –   List of Blocks for each file
   –   List of DataNodes for each block
   –   File attributes, e.g creation time, replication factor




                                                                Copyright 2009 - Trend Micro Inc.



                                                                                                    13
NameNode Metadata (cont.)


       • A Transaction Log (called EditLog)
                 – Records file creations, file deletions. Etc
       • FsImage
                 – The entire namespace, mapping of blocks to files and file system
                   properties are stored in a file called FsImage.
                 – NameNode can be configured to maintain multiple copies of
                   FsImage and EditLog.
       • Checkpoint
                 – Occur when NameNode startup
                 – Read FsImange and EditLog from disk and apply all transactions
                   from the EditLog to the in-memory representation of the
                   FsImange
                 – Then flushes out the new version into new FsImage


                                                                        Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                            14
Secondary NameNode

• Copies FsImage and EditLog from NameNode to a temporary
  directory
• Merges FSImage and EditLog into a new FSImage in
  temporary directory.
• Uploads new FSImage to the NameNode
   – Transaction Log on NameNode is purged




          FsImage

                                   FsImage
                                    (new)
           EditLog




                                                 Copyright 2009 - Trend Micro Inc.



                                                                                     15
NameNode Failure

• A single point of failure
• Transaction Log stored in multiple directories
   – A directory on the local file system
   – A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
                     SPOF!!




                                                      Copyright 2009 - Trend Micro Inc.



                                                                                          16
DataNode

• A Block Server
   –   Stores data in the local file system (e.g. ext3)
   –   Stores meta-data of a block (e.g. CRC)
   –   Serves data and meta-data to Clients
   –   Serves data and meta-data to Clients
• Block Report
   – Periodically sends a report of all existing blocks to the
     NameNode
• Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes




                                                                 Copyright 2009 - Trend Micro Inc.



                                                                                                     17
HDFS - Replication

• Default is 3x replication.
• The block size and replication factor are configured per file.
• Block placement algorithm is rack-aware




                                                          Copyright 2009 - Trend Micro Inc.



                                                                                              18
Block Placement

• Strategy (v0.19. doc, 2009.2.25, in progress)
   –   One replica on local node in the local rack
   –   Second replica on a different node in the local rack
   –   Third replica in the remote rack
   –   Additional replicas are randomly placed
• Clients read from nearest replica




                                                              Copyright 2009 - Trend Micro Inc.



                                                                                                  19
Heartbeats

• DataNodes send heartbeat to the NameNode
   – Once every 3 seconds
• NameNode used heartbeats to detect DataNode failure




                                                Copyright 2009 - Trend Micro Inc.



                                                                                    20
Data Correctness

• Use Checksums to validate data
   – Use CRC32
• File Creation
   – Client computes checksum per 512 byte
   – DataNode stores the checksum
• File access
   – Client retrieves the data and checksum from
   – If Validation fails, Client tries other replicas




                                                        Copyright 2009 - Trend Micro Inc.



                                                                                            21
User Interface

• API
   – Java API
   – C language wrapper (libhdfs) for the Java API is also available

• POSIX like command
   – hadoop dfs -mkdir /foodir
   – hadoop dfs -cat /foodir/myfile.txt
   – hadoop dfs -rm /foodir myfile.txt

• HDFS Admin
   – bin/hadoop dfsadmin –safemode
   – bin/hadoop dfsadmin –report
   – bin/hadoop dfsadmin -refreshNodes

• Web Interface
  – Ex: http://172.16.203.136:50070


                                                                Copyright 2009 - Trend Micro Inc.



                                                                                                    22
Web Interface
(http://172.16.203.136:50070)




                                Copyright 2009 - Trend Micro Inc.



                                                                    23
Web Interface
     (http://172.16.203.136:50070)
     Browse the file system




                                     Copyright 2009 - Trend Micro Inc.
Classification

                                                                         24
POSIX Like command




                     Copyright 2009 - Trend Micro Inc.



                                                         25
Java API
• Latest API
   – http://hadoop.apache.org/core/docs/current/api/




                                                       Copyright 2009 - Trend Micro Inc.



                                                                                           26
Reference

      • Hadoop document and installation
                 – Installation
                     • http://hadoop.apache.org/core/docs/r0.20.0/quickstart.html
                     • http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html
                 – Document
                     • http://hadoop.apache.org/
      • Hadoop Wiki
                 – http://wiki.apache.org/hadoop/
      • Google File System Paper
                 – http://labs.google.com/papers/gfs.html




                                                                              Copyright 2009 - Trend Micro Inc.
Classification

                                                                                                                  27

Weitere ähnliche Inhalte

Was ist angesagt?

Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 

Was ist angesagt? (20)

Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 

Andere mochten auch

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsTrendProgContest13
 
Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingTrendProgContest13
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceTrendProgContest13
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Andere mochten auch (10)

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfs
 
Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computing
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduce
 
Zh tw introduction_to_h_base
Zh tw introduction_to_h_baseZh tw introduction_to_h_base
Zh tw introduction_to_h_base
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Ähnlich wie Introduction to hadoop and hdfs

Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...
Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesy...Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesy...
Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...Cloudera, Inc.
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3GlusterFS
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure  to Enable Data Analysis CollaborationThe Efficient Use of Cyberinfrastructure  to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure to Enable Data Analysis CollaborationCybera Inc.
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 

Ähnlich wie Introduction to hadoop and hdfs (20)

Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...
Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesy...Hw09   Map Reduce Over Tahoe   A Least Authority Encrypted Distributed Filesy...
Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure  to Enable Data Analysis CollaborationThe Efficient Use of Cyberinfrastructure  to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 

Kürzlich hochgeladen

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Introduction to hadoop and hdfs

  • 1. Hadoop and Hadoop Distributed File System (HDFS) Public 2009/5/13 1
  • 2. Outline • Introduction to Hadoop project • Introduction to Hadoop Distributed File System – Architecture – Administration – Client Interface • Reference Copyright 2009 - Trend Micro Inc. Classification 2
  • 3. What is Hadoop? • Hadoop is a cloud computing platform for processing and keeping vast amount of data. • Apache top-level project • Open Source Cloud Applications • Hadoop Core includes – Hadoop Distributed File System (HDFS) MapReduce HBase – MapReduce framework • Hadoop subprojects Hadoop Distributed File System – HBase, Zookeeper,… (HDFS) • Written in Java. • Client interfaces come in C++/Java/Shell Scripting A Cluster of Machines • Runs on – Linux, Mac OS/X, Windows, and Solaris – Commodity hardware Copyright 2009 - Trend Micro Inc. 3
  • 4. A Brief History of Hadoop • 2003.2 – First MapReduce library written at Google • 2003.10 – Google File System paper published • 2004.12 – Google MapReduce paper published • 2005.7 – Doug Cutting reports that Nutch now use new MapReduce implementation • 2006.2 – Hadoop code moves out of Nutch into new Lucene sub-project • 2006.11 – Google Bigtable paper published Copyright 2009 - Trend Micro Inc. 4
  • 5. A Brief History of Hadoop (continue) • 2007.2 – First HBase code drop from Mike Cafarella • 2007.4 – Yahoo! Running Hadoop on 1000-node cluster • 2008.1 – Hadoop made an Apache Top Level Project Copyright 2009 - Trend Micro Inc. Classification 5
  • 6. Who use Hadoop? • Yahoo! – More than 100,000 CPUs in ~20,000 computers running Hadoop • Google – University Initiative to Address Internet-Scale Computing Challenges • Amazon – Amazon builds product search indices using the streaming API and pre- existing C++, Perl, and Python tools. – Process millions of sessions daily for analytics • IBM – Blue Cloud Computing Clusters • Trend Micro – Threat solution research • More… – http://wiki.apache.org/hadoop/PoweredBy Copyright 2009 - Trend Micro Inc. Classification 6
  • 7. Introduction to Hadoop Distributed File System Copyright 2009 - Trend Micro Inc. 7
  • 8. HDFS Design • Single Namespace for entire cluster • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Data Coherency – Write-once-read-many access model – A file once created, written and closed need not be changed. – Appending write to existing files (in the future) • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Copyright 2009 - Trend Micro Inc. 8
  • 9. HDFS Design • Moving computation is cheaper than moving data – Data locations exposed so that computations can move to where data resides • File replication – Default is 3 copies. – Configurable by clients • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Streaming Data Access – High throughput of data access rather than low latency of data access – Optimized for Batch Processing Copyright 2009 - Trend Micro Inc. Classification 9
  • 10. Copyright 2009 - Trend Micro Inc. 10
  • 11. Copyright 2009 - Trend Micro Inc. 11
  • 12. NameNode • Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks Copyright 2009 - Trend Micro Inc. 12
  • 13. NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor Copyright 2009 - Trend Micro Inc. 13
  • 14. NameNode Metadata (cont.) • A Transaction Log (called EditLog) – Records file creations, file deletions. Etc • FsImage – The entire namespace, mapping of blocks to files and file system properties are stored in a file called FsImage. – NameNode can be configured to maintain multiple copies of FsImage and EditLog. • Checkpoint – Occur when NameNode startup – Read FsImange and EditLog from disk and apply all transactions from the EditLog to the in-memory representation of the FsImange – Then flushes out the new version into new FsImage Copyright 2009 - Trend Micro Inc. Classification 14
  • 15. Secondary NameNode • Copies FsImage and EditLog from NameNode to a temporary directory • Merges FSImage and EditLog into a new FSImage in temporary directory. • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged FsImage FsImage (new) EditLog Copyright 2009 - Trend Micro Inc. 15
  • 16. NameNode Failure • A single point of failure • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution SPOF!! Copyright 2009 - Trend Micro Inc. 16
  • 17. DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes Copyright 2009 - Trend Micro Inc. 17
  • 18. HDFS - Replication • Default is 3x replication. • The block size and replication factor are configured per file. • Block placement algorithm is rack-aware Copyright 2009 - Trend Micro Inc. 18
  • 19. Block Placement • Strategy (v0.19. doc, 2009.2.25, in progress) – One replica on local node in the local rack – Second replica on a different node in the local rack – Third replica in the remote rack – Additional replicas are randomly placed • Clients read from nearest replica Copyright 2009 - Trend Micro Inc. 19
  • 20. Heartbeats • DataNodes send heartbeat to the NameNode – Once every 3 seconds • NameNode used heartbeats to detect DataNode failure Copyright 2009 - Trend Micro Inc. 20
  • 21. Data Correctness • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from – If Validation fails, Client tries other replicas Copyright 2009 - Trend Micro Inc. 21
  • 22. User Interface • API – Java API – C language wrapper (libhdfs) for the Java API is also available • POSIX like command – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt • HDFS Admin – bin/hadoop dfsadmin –safemode – bin/hadoop dfsadmin –report – bin/hadoop dfsadmin -refreshNodes • Web Interface – Ex: http://172.16.203.136:50070 Copyright 2009 - Trend Micro Inc. 22
  • 23. Web Interface (http://172.16.203.136:50070) Copyright 2009 - Trend Micro Inc. 23
  • 24. Web Interface (http://172.16.203.136:50070) Browse the file system Copyright 2009 - Trend Micro Inc. Classification 24
  • 25. POSIX Like command Copyright 2009 - Trend Micro Inc. 25
  • 26. Java API • Latest API – http://hadoop.apache.org/core/docs/current/api/ Copyright 2009 - Trend Micro Inc. 26
  • 27. Reference • Hadoop document and installation – Installation • http://hadoop.apache.org/core/docs/r0.20.0/quickstart.html • http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html – Document • http://hadoop.apache.org/ • Hadoop Wiki – http://wiki.apache.org/hadoop/ • Google File System Paper – http://labs.google.com/papers/gfs.html Copyright 2009 - Trend Micro Inc. Classification 27