SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
FUN WITH
HADOOP
FILE SYSTEMS
© Bradley Childs / bdc@redhat.com
HISTORY
•  Distributed file systems have been around for a long time
•  DFS battle optimizing the CAP theorem
•  Hadoops DFS implementation is called HDFS
•  Wide adoption of hadoop, users forced to use HDFS as the
only alternative
•  HDFS has technical trade offs and limitations
HDFS ARCHITECTURE
client
Name
Node
client
Data
Node
client
Data
Node
client
Data
Node
Store & Compute
HDFS ISSUES
Handy
•  Locking around metadata operations permitted by single name
node
•  File locking permitted by single name node
Frustrating
•  Difficult to get data in and out (ingest)
•  Name Node is single point of failure
•  Name Node is system bottleneck
GLUSTER FILE
SYSTEM
Gluster is an open source multi purpose DFS
Features:
•  Data Striping
•  Global elastic hashing for file placement
•  Basic and GEO Replication
•  Full POSIX Compliant Interface
•  Flexible architecture
•  Supports Storage Resident Apps – Compute and Data on
same machine
More Info: www.gluster.org
GLUSTER
ARCHITECTURE
client
Trusted Peers
client
Data
Brick
client
Data
Brick
client
Data
Brick
VolumeVolume
Store & Compute
HCFS
HCFS: Hadoop Compatible File System
•  Implementing the o.a.h.fs.FileSystem interface not enough for
existing hadoop jobs to run on a different file system
•  HDFS architecture created semantics and assumptions
•  HCFS defines these semantics so any file system can replace
HDFS without fear of compatibility
•  Open ongoing effort to define file system semantics decoupled
from architecture
JIRA:
issues.apache.org/jira/browse/HADOOP-9371
COMMON FILESYSTEM
ATTRIBUTES
•  Hierarchical structure of directories containing directories and
files
•  File contain between 0 and MAX_SIZE data
•  Directories contain 0 or more files or directories
•  Directories have no data, only child elements
NETWORK
ASSUMPTIONS
•  The final state of a file system after a network failure is
undefined
•  The immediate consistency state of a file system after a
network failure is undefined
•  If a network failure can be reported to the client, the failure
MUST be an instance of IOException
NETWORK FAILURE
•  Any operation with a file system MAY signal an error by
throwing an instance of IOException
•  File system operations MUST NOT throw RuntimeException
exceptions on the failure of a remote operations, authentication
or other operational problems
•  Stream read operations MAY fail if the read channel has been
idle for a file system specific period of time
•  Stream write operations MAY fail if the write channel has been
idle for a file system specific period of time
•  Network failures MAY be raised in the Stream close() operation
ATOMICITY
•  Rename of a file MUST be atomic
•  Rename of a directory SHOULD be atomic
•  Delete of a file MUST be atomic
•  Delete of an empty directory MUST be atomic
•  Recursive directory deletion MAY be atomic. Although HDFS
offers atomic recursive directory deletion, none of the other file
systems that Hadoop supports offers such a guarantee -
including the local file systems
•  mkdir() SHOULD be atomic
•  mkdirs() MAY be atomic. [It is currently atomic on HDFS, but
this is not the case for most other filesystems -and cannot be
guaranteed for future versions of HDFS]
CONCURRENCY
•  The data added to a file during a write or append MAY be visible
while the write operation is in progress
•  If a client opens a file for a read() operation while another read()
operation is in progress, the second operation MUST succeed.
Both clients MUST have a consistent view of the same data
•  If a file is deleted while a read() operation is in progress, the read()
operation MAY complete successfully. Implementations MAY
cause read() operations to fail with an IOException instead
•  Multiple writers MAY open a file for writing. If this occurs, the
outcome is undefined
•  Undefined: action of delete() while a write or append operation is
in progress
CONSISTENCY
The consistency model of a Hadoop file system is one-copy-update-semantics; partially
generally that of a traditional Posix file system.
•  Create: once the close() operation on an output stream writing a newly created file has
completed, in-cluster operations querying the file metadata and contents MUST
immediately see the file and its data
•  Update: Once the close() operation on an output stream writing a newly created file has
completed, in-cluster operations querying the file metadata and contents MUST
immediately see the new data
•  Delete: once a delete() operation is on a file has completed, listStatus() , open() ,
rename() and append() operations MUST fail
•  When file is deleted then overwritten, listStatus() , open() , rename() and append()
operations MUST succeed: the file is visible
•  Rename: after a rename has completed, operations against the new path MUST succeed;
operations against the old path MUST fail
•  The consistency semantics out of cluster client MUST be the same as in-cluster clients: All
clients calling read() on a closed file MUST see the same metadata and data until it is
changed from a create() , append() , rename() and append() operation
REFERENCES
Apache HCFS Wiki:
wiki.apache.org/hadoop/HCFS
Apache file Systems semantics JIRA:
issues.apache.org/jira/browse/HADOOP-9371
Some of this text is taken from the working draft linked in above Jira, credit Steve Loughran et al.
The opinions expressed do not necessarily represent those of RedHat Inc. or any of its affiliates.
© Bradley Childs / bdc@redhat.com

Weitere ähnliche Inhalte

Was ist angesagt?

Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013
Dipti Borkar
 

Was ist angesagt? (20)

HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Elastic{ON} 2017 Recap
Elastic{ON} 2017 RecapElastic{ON} 2017 Recap
Elastic{ON} 2017 Recap
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Using Elasticsearch for Analytics
Using Elasticsearch for AnalyticsUsing Elasticsearch for Analytics
Using Elasticsearch for Analytics
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 

Andere mochten auch

Andere mochten auch (7)

De klant van hoofd naar hart met persona's
De klant van hoofd naar hart met persona'sDe klant van hoofd naar hart met persona's
De klant van hoofd naar hart met persona's
 
Effect Van 10 Wijze Mentors Op Leiderschap Van 10 Jonge Managers
Effect Van 10 Wijze Mentors Op Leiderschap Van 10 Jonge ManagersEffect Van 10 Wijze Mentors Op Leiderschap Van 10 Jonge Managers
Effect Van 10 Wijze Mentors Op Leiderschap Van 10 Jonge Managers
 
Ontdek je Sterke Punten & Talenten
Ontdek je Sterke Punten & TalentenOntdek je Sterke Punten & Talenten
Ontdek je Sterke Punten & Talenten
 
Social Media Networks Marketing Izzinosa
Social Media Networks Marketing IzzinosaSocial Media Networks Marketing Izzinosa
Social Media Networks Marketing Izzinosa
 
2015 03-30 bilsen fonds
2015 03-30 bilsen fonds2015 03-30 bilsen fonds
2015 03-30 bilsen fonds
 
The Prime Directive. How To Charter Your Team Best (With LEGO Serious Play)
The Prime Directive. How To Charter Your Team Best (With LEGO Serious Play)The Prime Directive. How To Charter Your Team Best (With LEGO Serious Play)
The Prime Directive. How To Charter Your Team Best (With LEGO Serious Play)
 
Koen Marichal - het leiderschap van de toekomst
Koen Marichal - het leiderschap van de toekomstKoen Marichal - het leiderschap van de toekomst
Koen Marichal - het leiderschap van de toekomst
 

Ähnlich wie AHUG Presentation: Fun with Hadoop File Systems

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Milad Sobhkhiz
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Ähnlich wie AHUG Presentation: Fun with Hadoop File Systems (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Dustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep DiveDustin Black - Red Hat Storage Server Administration Deep Dive
Dustin Black - Red Hat Storage Server Administration Deep Dive
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
CNIT 152: 13 Investigating Mac OS X Systems
CNIT 152: 13 Investigating Mac OS X SystemsCNIT 152: 13 Investigating Mac OS X Systems
CNIT 152: 13 Investigating Mac OS X Systems
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 
CNIT 152 13 Investigating Mac OS X Systems
CNIT 152 13 Investigating Mac OS X SystemsCNIT 152 13 Investigating Mac OS X Systems
CNIT 152 13 Investigating Mac OS X Systems
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 

Mehr von Infochimps, a CSC Big Data Business

Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
Infochimps, a CSC Big Data Business
 

Mehr von Infochimps, a CSC Big Data Business (17)

Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Report: CIOs & Big Data
Report: CIOs & Big DataReport: CIOs & Big Data
Report: CIOs & Big Data
 
Infographic: CIOs & Big Data
Infographic: CIOs & Big DataInfographic: CIOs & Big Data
Infographic: CIOs & Big Data
 
5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
451 Research Impact Report
451 Research Impact Report451 Research Impact Report
451 Research Impact Report
 
[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects
 
[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
 
Taming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel ArchitectureTaming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel Architecture
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Real-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the AgencyReal-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the Agency
 
Ironfan: Your Foundation for Flexible Big Data Infrastructure
Ironfan: Your Foundation for Flexible Big Data InfrastructureIronfan: Your Foundation for Flexible Big Data Infrastructure
Ironfan: Your Foundation for Flexible Big Data Infrastructure
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
 
Meet the Infochimps Platform
Meet the Infochimps PlatformMeet the Infochimps Platform
Meet the Infochimps Platform
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

AHUG Presentation: Fun with Hadoop File Systems

  • 1. FUN WITH HADOOP FILE SYSTEMS © Bradley Childs / bdc@redhat.com
  • 2. HISTORY •  Distributed file systems have been around for a long time •  DFS battle optimizing the CAP theorem •  Hadoops DFS implementation is called HDFS •  Wide adoption of hadoop, users forced to use HDFS as the only alternative •  HDFS has technical trade offs and limitations
  • 4. HDFS ISSUES Handy •  Locking around metadata operations permitted by single name node •  File locking permitted by single name node Frustrating •  Difficult to get data in and out (ingest) •  Name Node is single point of failure •  Name Node is system bottleneck
  • 5. GLUSTER FILE SYSTEM Gluster is an open source multi purpose DFS Features: •  Data Striping •  Global elastic hashing for file placement •  Basic and GEO Replication •  Full POSIX Compliant Interface •  Flexible architecture •  Supports Storage Resident Apps – Compute and Data on same machine More Info: www.gluster.org
  • 7. HCFS HCFS: Hadoop Compatible File System •  Implementing the o.a.h.fs.FileSystem interface not enough for existing hadoop jobs to run on a different file system •  HDFS architecture created semantics and assumptions •  HCFS defines these semantics so any file system can replace HDFS without fear of compatibility •  Open ongoing effort to define file system semantics decoupled from architecture JIRA: issues.apache.org/jira/browse/HADOOP-9371
  • 8. COMMON FILESYSTEM ATTRIBUTES •  Hierarchical structure of directories containing directories and files •  File contain between 0 and MAX_SIZE data •  Directories contain 0 or more files or directories •  Directories have no data, only child elements
  • 9. NETWORK ASSUMPTIONS •  The final state of a file system after a network failure is undefined •  The immediate consistency state of a file system after a network failure is undefined •  If a network failure can be reported to the client, the failure MUST be an instance of IOException
  • 10. NETWORK FAILURE •  Any operation with a file system MAY signal an error by throwing an instance of IOException •  File system operations MUST NOT throw RuntimeException exceptions on the failure of a remote operations, authentication or other operational problems •  Stream read operations MAY fail if the read channel has been idle for a file system specific period of time •  Stream write operations MAY fail if the write channel has been idle for a file system specific period of time •  Network failures MAY be raised in the Stream close() operation
  • 11. ATOMICITY •  Rename of a file MUST be atomic •  Rename of a directory SHOULD be atomic •  Delete of a file MUST be atomic •  Delete of an empty directory MUST be atomic •  Recursive directory deletion MAY be atomic. Although HDFS offers atomic recursive directory deletion, none of the other file systems that Hadoop supports offers such a guarantee - including the local file systems •  mkdir() SHOULD be atomic •  mkdirs() MAY be atomic. [It is currently atomic on HDFS, but this is not the case for most other filesystems -and cannot be guaranteed for future versions of HDFS]
  • 12. CONCURRENCY •  The data added to a file during a write or append MAY be visible while the write operation is in progress •  If a client opens a file for a read() operation while another read() operation is in progress, the second operation MUST succeed. Both clients MUST have a consistent view of the same data •  If a file is deleted while a read() operation is in progress, the read() operation MAY complete successfully. Implementations MAY cause read() operations to fail with an IOException instead •  Multiple writers MAY open a file for writing. If this occurs, the outcome is undefined •  Undefined: action of delete() while a write or append operation is in progress
  • 13. CONSISTENCY The consistency model of a Hadoop file system is one-copy-update-semantics; partially generally that of a traditional Posix file system. •  Create: once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the file and its data •  Update: Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the new data •  Delete: once a delete() operation is on a file has completed, listStatus() , open() , rename() and append() operations MUST fail •  When file is deleted then overwritten, listStatus() , open() , rename() and append() operations MUST succeed: the file is visible •  Rename: after a rename has completed, operations against the new path MUST succeed; operations against the old path MUST fail •  The consistency semantics out of cluster client MUST be the same as in-cluster clients: All clients calling read() on a closed file MUST see the same metadata and data until it is changed from a create() , append() , rename() and append() operation
  • 14. REFERENCES Apache HCFS Wiki: wiki.apache.org/hadoop/HCFS Apache file Systems semantics JIRA: issues.apache.org/jira/browse/HADOOP-9371 Some of this text is taken from the working draft linked in above Jira, credit Steve Loughran et al. The opinions expressed do not necessarily represent those of RedHat Inc. or any of its affiliates. © Bradley Childs / bdc@redhat.com