SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
      Instructor: Professor Lothar Piepmayer




  HDFS at a glance
Agenda

1. Design of HDFS
2.1. HDFS Concepts – Blocks
2.1. HDFS Concepts - Namenode and datanode
3.1 Dataflow - Anatomy of a read file
3.2 Dataflow - Anatomy of a write file
3.3 Dataflow - Coherency model
4. Parallel copying
5. Demo - Command line
The Design of HDFS

Very large distributed file system
  Up to 10K nodes, 1 billion files, 100PB
Streaming data access
  Write once, read many times
Commodity hardware
  Files are replicated to handle hardware failure
        Detect failures and recover from them
Worst fit with

Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications
HDFS Blocks

Normal Filesystem blocks are few kilobytes
HDFS has Large block size
    Default 64MB
    Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is
 smaller than a single block does not occupy a full block
HDFS Blocks


A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes
 across the cluster.
HDFS Blocks




Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
Why blocks in HDFS so large?

Minimize the cost of seeks
=> Make transfer time = disk transfer rate
Benefit of Block abstraction

A file can be larger than any single disk in the network
Simplify the storage subsystem
Providing fault tolerance and availability
Namenode & Datanodes
Namenode & Datanodes

 Namenode (master)
 – manages the filesystem namespace
 – maintains the filesystem tree and metadata for all the
 files and directories in the tree.
 Datanodes (slaves)
 – store data in the local file system
 – Periodically report back to the namenode with lists of all
 existing blocks
 Clients communicate with both namenode and datanodes.
Anatomy of a File Read
Anatomy of a File Read


Benefits:
- Avoid “bottle neck”
- Multi-Clients
Writing in HDFS


Namenode
Datanode
Block
Writing in HDFS


Exeptions: Node failed
  Pipeline close, remove block and addr of failed
   node
  Namenode arrange new datanode
Coherency Model


Not visible when copying
use sync()
Apply in applications
Parallel copying in HDFS

Transfer data between clusters
   % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Implemented as MapReduce, each file per map
Each map take at least 256MB
Default max maps is 20 per node
The diffirent versions only supported by webhdfs protocol:
   % hadoop distcp webhdfs://namenode1:50070/foo
      webhdfs://namenode2:50070/bar
Setup

Cluster with 03 nodes:
    04 GB RAM
    02 CPU @ 2.0Ghz+
    100G HDD
Using vmWare on 03 different servers
Network: 100Mbps
Operating System: Ubuntu 11.04
    Windows: Not tested
Setup Guide - Single Node


java runtime ssh
  http://hadoop.apache.org/common/docs/r1.0.3/si
   ngle_node_setup.html
/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml
Cluster


/etc/hadoop/masters
/etc/hadoop/slaves
http://hadoop.apache.org/common/docs/r1.0.3
/cluster_setup.html
Command Line

Similar to *nix
    hadoop fs -ls /
    hadoop fs -mkdir /test
    hadoop fs -rmr /test
    hadoop fs -cp /1 /2
    hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific:
    hadoop namenode -format
    start-all.sh
Command Line

Sorting: Standard method to test cluster
    TeraGen: Generate dummy data
    TeraSort: Sort
    TeraValidate: Validate sort result
Command Line:
    hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
     terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
Benchmark Result

2 Nodes, 1GB data: 0:03:38
3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07
3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck
Who
wins…?
References

Hadoop The Definitive Guide

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving toolsShay Cohen
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsDataWorks Summit
 

Was ist angesagt? (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving tools
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
6 technical-dns-workshop-day3
6 technical-dns-workshop-day36 technical-dns-workshop-day3
6 technical-dns-workshop-day3
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 

Ähnlich wie Hadoop at a glance

Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptxAyush .
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangaloreKelly Technologies
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabadKelly Technologies
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptxSwarnaSLcse
 

Ähnlich wie Hadoop at a glance (20)

Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Hdfs
HdfsHdfs
Hdfs
 

Mehr von Tan Tran

Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)Tan Tran
 
Managing for results
Managing for resultsManaging for results
Managing for resultsTan Tran
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniquesTan Tran
 
Personal task management
Personal task managementPersonal task management
Personal task managementTan Tran
 
Jira in action
Jira in actionJira in action
Jira in actionTan Tran
 
Beautifying Data in the real world
Beautifying Data in the real worldBeautifying Data in the real world
Beautifying Data in the real worldTan Tran
 
BIS Vietnamese-German University
BIS Vietnamese-German UniversityBIS Vietnamese-German University
BIS Vietnamese-German UniversityTan Tran
 
Phac thao compendium
Phac thao compendiumPhac thao compendium
Phac thao compendiumTan Tran
 
Management skills in IT - Communication
Management skills in IT - CommunicationManagement skills in IT - Communication
Management skills in IT - CommunicationTan Tran
 
Internet governance and the filtering problems
Internet governance and the filtering problemsInternet governance and the filtering problems
Internet governance and the filtering problemsTan Tran
 
C# conventions & good practices
C# conventions & good practicesC# conventions & good practices
C# conventions & good practicesTan Tran
 
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy YênTổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy YênTan Tran
 
Flash coding convention for action script 3
Flash coding convention for action script 3Flash coding convention for action script 3
Flash coding convention for action script 3Tan Tran
 
Java convention
Java conventionJava convention
Java conventionTan Tran
 
VGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information ManagementVGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information ManagementTan Tran
 
Scrum introduction
Scrum introductionScrum introduction
Scrum introductionTan Tran
 

Mehr von Tan Tran (16)

Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)
 
Managing for results
Managing for resultsManaging for results
Managing for results
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniques
 
Personal task management
Personal task managementPersonal task management
Personal task management
 
Jira in action
Jira in actionJira in action
Jira in action
 
Beautifying Data in the real world
Beautifying Data in the real worldBeautifying Data in the real world
Beautifying Data in the real world
 
BIS Vietnamese-German University
BIS Vietnamese-German UniversityBIS Vietnamese-German University
BIS Vietnamese-German University
 
Phac thao compendium
Phac thao compendiumPhac thao compendium
Phac thao compendium
 
Management skills in IT - Communication
Management skills in IT - CommunicationManagement skills in IT - Communication
Management skills in IT - Communication
 
Internet governance and the filtering problems
Internet governance and the filtering problemsInternet governance and the filtering problems
Internet governance and the filtering problems
 
C# conventions & good practices
C# conventions & good practicesC# conventions & good practices
C# conventions & good practices
 
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy YênTổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
 
Flash coding convention for action script 3
Flash coding convention for action script 3Flash coding convention for action script 3
Flash coding convention for action script 3
 
Java convention
Java conventionJava convention
Java convention
 
VGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information ManagementVGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information Management
 
Scrum introduction
Scrum introductionScrum introduction
Scrum introduction
 

Kürzlich hochgeladen

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Hadoop at a glance

  • 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  • 2. Agenda 1. Design of HDFS 2.1. HDFS Concepts – Blocks 2.1. HDFS Concepts - Namenode and datanode 3.1 Dataflow - Anatomy of a read file 3.2 Dataflow - Anatomy of a write file 3.3 Dataflow - Coherency model 4. Parallel copying 5. Demo - Command line
  • 3. The Design of HDFS Very large distributed file system Up to 10K nodes, 1 billion files, 100PB Streaming data access Write once, read many times Commodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  • 4. Worst fit with Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
  • 5. HDFS Blocks Normal Filesystem blocks are few kilobytes HDFS has Large block size  Default 64MB  Typical 128MB Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  • 6. HDFS Blocks A file is stored in blocks on various nodes in hadoop cluster. HDFS creates several replication of the data blocks Each and every data block is replicated to multiple nodes across the cluster.
  • 7. HDFS Blocks Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  • 8. Why blocks in HDFS so large? Minimize the cost of seeks => Make transfer time = disk transfer rate
  • 9. Benefit of Block abstraction A file can be larger than any single disk in the network Simplify the storage subsystem Providing fault tolerance and availability
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 12. Anatomy of a File Read
  • 13. Anatomy of a File Read Benefits: - Avoid “bottle neck” - Multi-Clients
  • 15.
  • 16. Writing in HDFS Exeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  • 17. Coherency Model Not visible when copying use sync() Apply in applications
  • 18. Parallel copying in HDFS Transfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar Implemented as MapReduce, each file per map Each map take at least 256MB Default max maps is 20 per node The diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  • 19. Setup Cluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDD Using vmWare on 03 different servers Network: 100Mbps Operating System: Ubuntu 11.04  Windows: Not tested
  • 20. Setup Guide - Single Node java runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
  • 22. Command Line Similar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/ Namedone-specific:  hadoop namenode -format  start-all.sh
  • 23. Command Line Sorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort result Command Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  • 24. Benchmark Result 2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13 2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28 Virtual Machine's harddisks are the bottle-neck