SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
The Google File System
Based on
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
2003
I am Sergio Shevchenko
You can find me in Telegram with @sshevchenko
Hello!
2
Agenda
◉ Why GFS?
◉ System Design
◉ System Interactions
◉ Master operation
◉ Fault tolerance and diagnosis
◉ Measurements
◉ Experiences and Conclusions
3
Why GFS?
What prompted Google to build GFS
1
4
Historical context
◉ August 1996: Larry Page and Sergey Brin launch Google
◉ ...
◉ December 1998: Index of about 60 million pages
◉ ...
◉ February 2003: Index 3+ billion pages
◉ ...
5
Historical context
6
Historical context
7
System design2
8
System design: Assumptions
◉ COTS x86 servers
◉ Few million files, typically 100MB or larger
◉ Large streaming reads, small random reads
◉ Sequential writes, append data
◉ Multiple concurrent clients that append to the same file
◉ High sustained bandwidth more important than low latency
9
System design: Interface
◉ Not posix
◉ Usual file operations:
open, close, create, delete, write, read
◉ Enhanced file operations:
snapshot, record append
10
System design: Architecture
11
System design: Architecture
◉ Single Master
○ Design simplification (there is a replica)
○ Global knowledge permit sophisticated chunk placement and
replication decisions
○ No data plane bottleneck: GFS clients communicate with
■ master only for metadata (cached with TTL expiration)
■ chunks only for read/write
◉ Chunk size
○ 64 Mb stored as a linux file
○ Extended only as needed (lazy allocation)
12
System design: Metadata
◉ 3 types of metadata
○ File and chunk namespaces
○ Mappings from file to chunks
○ Location of each chunks’ replicas
◉ Metadata all in RAM memory of master
○ Re-replication
○ Chunk migration
○ GC
◉ OpLog
○ Timeline for concurrent operations
○ No persistent record for chunk locations
○ Saved on disk and replicated
13
Consistency model
◉ File namespace mutation are atomic
○ A file region is consistent if all clients
will always see the same data,
regardless of which replicas they
read from.
○ A region is defined after a file data
mutation if it is consistent and clients
will see what the mutation writes in
its entirety.
◉ GFS doesn't guarantee againsts duplicates
14
15
In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer
scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide
more than two out of this three guarantees.
System interactions2
16
Leases and mutation order
◉ Each mutation (write or append) is performed at all the chunk’s replicas
◉ Leases used to maintain a consistent mutation order across replicas
○ Master grants a chunk lease for 60 seconds to one of the replicas: the
primary replica
○ The primary replica picks a serial order for all mutations to the chunk
○ All replicas follow this order when applying mutations
17
Leases and mutation order
18
1. Client ask master for primary and secondaries replicas info for a chunk
2. Master replies with replicas identity (client put info in cache with
timeout)
3. Client pushes Data to all replicas (pipelined fashion) Each chunkserver
store the Data in an internal LRU buffer cache until Data is used or aged out
4. One all replicas have acknowledged receiving the data, client send a
write request to the primary replica
5. Primary replica forward the write request to all secondary replicas that
applies mutations in the same serial number order assigned by the primary
6. Secondary replicas acknowledge the primary that they have completed
the operation
7. The primary replica replies to the client. Any errors encountered at any of
the replicas are reported to the client.
System interactions:
Dataflow
◉ Data flow decoupled from Control flow to use the network efficiently
◉ Data is pushed linearly along a carefully picked chain of chunkservers in a TCP
pipelined fashion
◉ Each machine forwards the data to the closest machine in the network topology
that has not received it
19
Atomic record appends
◉ GFS decides the offset
◉ Maximum size of an append record is 16 MB
◉ Client push the data to all replicas of the last chunk of the file (as for write)
◉ Client sends its request to the primary replica
○ Primary replica check if the data size would cause the chunk to exceed the
chunk maximum size (64MB)
◉ Atomic records are heavily used in Google for handling multiple producers and
single consumer o MapRed operation
20
Snapshot
◉ Make a instantaneously copy of a file or a directory tree by using standard
copy-on-write techniques
○ Master receives a snapshot request
○ Master revokes any outstanding leases on the chunks in the files to
snapshot
○ Master wait leases to be revoked or expired
○ Master logs the snapshot operation to disk
○ Master duplicate the metadata for the source file or directory tree: newly
created snapshot files point to the same chunk as source files
○ When client asks write to chunk which was not accessed before clone,
chunk is locally cloned on replica
21
Master operation3
22
Replica placement
◉ Serves for 2 purposes:
○ maximize data reliability and availability
○ maximize network bandwidth utilization
◉ Spread chunks in different racks
○ Ensure chunk survivability due rack failure (switch failure e.g.)
○ Aggregate bandwidth of different racks
○ Write traffic has to flow through multiple racks
■ Tradeoff they make willingly
23
Creation, Re-replication and Balancing
◉ Creation
○ Place new replicas on chunkservers with below-average disk space
utilization
○ Limit the number of "recent" creations on each chunkserver
○ Spread replicas of a chunk across racks
◉ Re-Replication
○ Master re-replicate a chunk as soon as the number of available replicas
falls below a user-specified goal
○ Re-replicated chunk is prioritized based
○ Re-replication placement is similar as for "creation"
○ Master limits the numbers of active clone operations both for the cluster
and for each chunkservers
24
Creation, Re-replication and Balancing
◉ Balancing
○ Master re-balances replicas periodically for better disk space and load
balancing
○ Master gradually fills up a new chunkserver rather than instantly swaps it
with new chunks (and the heavy write traffic that come with them !)
○ Re-balancing placement is similar as for “creation”
25
Garbage collection and Stale replica detection
◉ File is not deleted istantly but lazily
○ Master logs deletions
○ File renamed to a hidden name that include deletion timestamp
○ After 3 days hidden files are removed on regular scan
○ Same procedure for orphaned chunks
○ Chunks “heartbeat” his chunks to master, if they are not present in metadata
chunkserver is free to remove chunks
◉ Stale replica
○ For each chunk master maintains a version
26
Fault tolerance and diagnosis
HA and Data Integrity in GFS
4
27
High availability in GFS
Chunk
Replication
Fast
Recovery
Master
Replication
28
Data integrity
READS
Before returning
block chankserver
verifies integrity, if
fails, returns an error
and notifies master
WRITES
Chunkserver verifies
the checksum of first
and last data blocks
that overlap the write
range before perform
the write, and finally
compute and record
the new checksums
APPEND
No checksum
verification on the
last block, just
incrementally update
the checksum for the
last partial checksum
block
29
Each chunkserver uses checksumming to detect corruption of stored chunk. For each
64 Kb adds a 32 Bit checksum
Measurements
Micro and Macro benchmarks
5
30
Micro benchmarks
1 master, 2 master replicas, 16
chunkservers with 16 clients.
All the machines are configured
with dual 1.4 GHz PIII processors,
2 GB of memory, two 80 GB
5400 rpm disks, and a 100 Mbps
full-duplex Ethernet connection
to an HP 2524 switch.
31
Micro benchmarks - READS
32
Each client read a randomly selected 4MB region 256 times (= 1GB of data) from a 320MB file.
The chunkservers taken together have only 32GB of memory, so is expected at most a 10% hit rate in the
Linux buffer cache.
75-80%
efficiency
Micro benchmarks - WRITES
33
Each client writes 1 GB of data to a new file in a series of 1 MB writes
50%
efficiency
Micro benchmarks - APPENDS
34
Each client appends simultaneously to a single file
Performance is limited by the network bandwidth of the 3 chunkservers that store the last chunk of the
file, independent of the number of clients
35-40%
efficiency
Real World clusters
35
Cluster A used regularly for R&D by +100 engineers
Cluster B used for production data processing
Fault tolerance
◉ Kill 1 chunk servers in cluster B
○ 15.000 chunks on it (= 600 GB of data)
○ Cluster limited to 91 concurrent chunk cloning (= 40% of 227 chunkservers)
○ Each clone operation consume at most 50 Mbps
■ 15.000 chunks restored in 23.2 minutes effective replication rate of
440 MB/s
◉ Kill 2 chunk servers in cluster B
○ 16.000 chunks on each (= 660 GB of data)
○ This double failure reduced 266 chunks to having a single replica
■ These 266 chunks cloned at higher priority and were all restored to a
least 2x replication within 2 minutes
36
Experiences and Conclusions6
37
Experiences
◉ Operational issues
○ Original design for backend file system on production systems, not with
R&D tasks in mind (lack of support for permissions and quotas…)
◉ Technical issues
○ IDE protocol versions support that corrupt data silently: this problem
motivated checksums usage
○ fsync() cost in kernel 2.2 towards large operational logs files
38
Conclusions
◉ GFS is:
○ Support large-scale data processing workloads on COTS x86 servers
○ Component failures as the norm rather than the exception
○ Optimize for huge files mostly append to and then read sequentially
○ Fault tolerance by constant monitoring, replicating crucial data and fast and
automatic recovery (+ checksum to detect data corruption)
○ Delivers high aggregate throughput to many concurrent readers and writers
39
Conclusions
◉ GFS now:
○ Source code is closed
○ Substituted with Colossus in 2010s
■ Smaller chunks 1 Mb
■ Eliminated single point of failure
■ Built for “real-time” big batch operations
40
Any questions ?
You can find me at
◉ @sshevchenko
◉ s.shevchenko@studenti.unisa.it
Thanks!
41

Weitere ähnliche Inhalte

Was ist angesagt?

Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMJYoTHiSH o.s
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file Systemdiptipan
 
Teoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSTeoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSAsociatia ProLinux
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Corbett osdi12 slides (1)
Corbett osdi12 slides (1)Corbett osdi12 slides (1)
Corbett osdi12 slides (1)Aksh54
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed PostgresStas Kelvich
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
 

Was ist angesagt? (20)

Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
Google file system
Google file systemGoogle file system
Google file system
 
Google File System
Google File SystemGoogle File System
Google File System
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
Google file system
Google file systemGoogle file system
Google file system
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
 
Teoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFSTeoria efectului defectului hardware: GoogleFS
Teoria efectului defectului hardware: GoogleFS
 
The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
Google File System
Google File SystemGoogle File System
Google File System
 
Corbett osdi12 slides (1)
Corbett osdi12 slides (1)Corbett osdi12 slides (1)
Corbett osdi12 slides (1)
 
Google
GoogleGoogle
Google
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Spanner osdi2012
Spanner osdi2012Spanner osdi2012
Spanner osdi2012
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed Postgres
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
Cloud Spanner
Cloud SpannerCloud Spanner
Cloud Spanner
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Kfs presentation
Kfs presentationKfs presentation
Kfs presentation
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 

Ähnlich wie The Google file system

storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptxShimoFcis
 
Google File System
Google File SystemGoogle File System
Google File Systemnadikari123
 
google file system
google file systemgoogle file system
google file systemdiptipan
 
Coherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureCoherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureUniversity of Pisa
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...HostedbyConfluent
 
Arbiter volumes in gluster
Arbiter volumes in glusterArbiter volumes in gluster
Arbiter volumes in glusteritisravi
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia DatabasesJaime Crespo
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioAlluxio, Inc.
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 

Ähnlich wie The Google file system (20)

storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
Google File System
Google File SystemGoogle File System
Google File System
 
Google file system
Google file systemGoogle file system
Google file system
 
Google File System
Google File SystemGoogle File System
Google File System
 
google file system
google file systemgoogle file system
google file system
 
Lalit
LalitLalit
Lalit
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
Coherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureCoherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architecture
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
 
Arbiter volumes in gluster
Arbiter volumes in glusterArbiter volumes in gluster
Arbiter volumes in gluster
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Gfs
GfsGfs
Gfs
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 

Mehr von Sergio Shevchenko

Gestione dell'economia nelle reti di Self Sovereign Identity con Algorand Sm...
Gestione dell'economia nelle reti di  Self Sovereign Identity con Algorand Sm...Gestione dell'economia nelle reti di  Self Sovereign Identity con Algorand Sm...
Gestione dell'economia nelle reti di Self Sovereign Identity con Algorand Sm...Sergio Shevchenko
 
Kubernetes - from sketch to production
Kubernetes - from sketch to productionKubernetes - from sketch to production
Kubernetes - from sketch to productionSergio Shevchenko
 
Burrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesBurrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesSergio Shevchenko
 
Design patterns: Creational patterns
Design patterns: Creational patternsDesign patterns: Creational patterns
Design patterns: Creational patternsSergio Shevchenko
 
Qt Multiplatform development
Qt Multiplatform developmentQt Multiplatform development
Qt Multiplatform developmentSergio Shevchenko
 

Mehr von Sergio Shevchenko (12)

Gestione dell'economia nelle reti di Self Sovereign Identity con Algorand Sm...
Gestione dell'economia nelle reti di  Self Sovereign Identity con Algorand Sm...Gestione dell'economia nelle reti di  Self Sovereign Identity con Algorand Sm...
Gestione dell'economia nelle reti di Self Sovereign Identity con Algorand Sm...
 
Kubernetes - from sketch to production
Kubernetes - from sketch to productionKubernetes - from sketch to production
Kubernetes - from sketch to production
 
Meltdown & spectre
Meltdown & spectreMeltdown & spectre
Meltdown & spectre
 
μ-Kernel Evolution
μ-Kernel Evolutionμ-Kernel Evolution
μ-Kernel Evolution
 
Burrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabasesBurrows-Wheeler transform for terabases
Burrows-Wheeler transform for terabases
 
Presentazione CERT-CHECK
Presentazione CERT-CHECKPresentazione CERT-CHECK
Presentazione CERT-CHECK
 
Design patterns: Creational patterns
Design patterns: Creational patternsDesign patterns: Creational patterns
Design patterns: Creational patterns
 
Bitcoin and blockchain
Bitcoin and blockchainBitcoin and blockchain
Bitcoin and blockchain
 
Qt Multiplatform development
Qt Multiplatform developmentQt Multiplatform development
Qt Multiplatform development
 
Qt for beginners
Qt for beginnersQt for beginners
Qt for beginners
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integration
 
Mobile Factor App
Mobile Factor AppMobile Factor App
Mobile Factor App
 

Kürzlich hochgeladen

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

The Google file system

  • 1. The Google File System Based on The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 2003
  • 2. I am Sergio Shevchenko You can find me in Telegram with @sshevchenko Hello! 2
  • 3. Agenda ◉ Why GFS? ◉ System Design ◉ System Interactions ◉ Master operation ◉ Fault tolerance and diagnosis ◉ Measurements ◉ Experiences and Conclusions 3
  • 4. Why GFS? What prompted Google to build GFS 1 4
  • 5. Historical context ◉ August 1996: Larry Page and Sergey Brin launch Google ◉ ... ◉ December 1998: Index of about 60 million pages ◉ ... ◉ February 2003: Index 3+ billion pages ◉ ... 5
  • 9. System design: Assumptions ◉ COTS x86 servers ◉ Few million files, typically 100MB or larger ◉ Large streaming reads, small random reads ◉ Sequential writes, append data ◉ Multiple concurrent clients that append to the same file ◉ High sustained bandwidth more important than low latency 9
  • 10. System design: Interface ◉ Not posix ◉ Usual file operations: open, close, create, delete, write, read ◉ Enhanced file operations: snapshot, record append 10
  • 12. System design: Architecture ◉ Single Master ○ Design simplification (there is a replica) ○ Global knowledge permit sophisticated chunk placement and replication decisions ○ No data plane bottleneck: GFS clients communicate with ■ master only for metadata (cached with TTL expiration) ■ chunks only for read/write ◉ Chunk size ○ 64 Mb stored as a linux file ○ Extended only as needed (lazy allocation) 12
  • 13. System design: Metadata ◉ 3 types of metadata ○ File and chunk namespaces ○ Mappings from file to chunks ○ Location of each chunks’ replicas ◉ Metadata all in RAM memory of master ○ Re-replication ○ Chunk migration ○ GC ◉ OpLog ○ Timeline for concurrent operations ○ No persistent record for chunk locations ○ Saved on disk and replicated 13
  • 14. Consistency model ◉ File namespace mutation are atomic ○ A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. ○ A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. ◉ GFS doesn't guarantee againsts duplicates 14
  • 15. 15 In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of this three guarantees.
  • 17. Leases and mutation order ◉ Each mutation (write or append) is performed at all the chunk’s replicas ◉ Leases used to maintain a consistent mutation order across replicas ○ Master grants a chunk lease for 60 seconds to one of the replicas: the primary replica ○ The primary replica picks a serial order for all mutations to the chunk ○ All replicas follow this order when applying mutations 17
  • 18. Leases and mutation order 18 1. Client ask master for primary and secondaries replicas info for a chunk 2. Master replies with replicas identity (client put info in cache with timeout) 3. Client pushes Data to all replicas (pipelined fashion) Each chunkserver store the Data in an internal LRU buffer cache until Data is used or aged out 4. One all replicas have acknowledged receiving the data, client send a write request to the primary replica 5. Primary replica forward the write request to all secondary replicas that applies mutations in the same serial number order assigned by the primary 6. Secondary replicas acknowledge the primary that they have completed the operation 7. The primary replica replies to the client. Any errors encountered at any of the replicas are reported to the client.
  • 19. System interactions: Dataflow ◉ Data flow decoupled from Control flow to use the network efficiently ◉ Data is pushed linearly along a carefully picked chain of chunkservers in a TCP pipelined fashion ◉ Each machine forwards the data to the closest machine in the network topology that has not received it 19
  • 20. Atomic record appends ◉ GFS decides the offset ◉ Maximum size of an append record is 16 MB ◉ Client push the data to all replicas of the last chunk of the file (as for write) ◉ Client sends its request to the primary replica ○ Primary replica check if the data size would cause the chunk to exceed the chunk maximum size (64MB) ◉ Atomic records are heavily used in Google for handling multiple producers and single consumer o MapRed operation 20
  • 21. Snapshot ◉ Make a instantaneously copy of a file or a directory tree by using standard copy-on-write techniques ○ Master receives a snapshot request ○ Master revokes any outstanding leases on the chunks in the files to snapshot ○ Master wait leases to be revoked or expired ○ Master logs the snapshot operation to disk ○ Master duplicate the metadata for the source file or directory tree: newly created snapshot files point to the same chunk as source files ○ When client asks write to chunk which was not accessed before clone, chunk is locally cloned on replica 21
  • 23. Replica placement ◉ Serves for 2 purposes: ○ maximize data reliability and availability ○ maximize network bandwidth utilization ◉ Spread chunks in different racks ○ Ensure chunk survivability due rack failure (switch failure e.g.) ○ Aggregate bandwidth of different racks ○ Write traffic has to flow through multiple racks ■ Tradeoff they make willingly 23
  • 24. Creation, Re-replication and Balancing ◉ Creation ○ Place new replicas on chunkservers with below-average disk space utilization ○ Limit the number of "recent" creations on each chunkserver ○ Spread replicas of a chunk across racks ◉ Re-Replication ○ Master re-replicate a chunk as soon as the number of available replicas falls below a user-specified goal ○ Re-replicated chunk is prioritized based ○ Re-replication placement is similar as for "creation" ○ Master limits the numbers of active clone operations both for the cluster and for each chunkservers 24
  • 25. Creation, Re-replication and Balancing ◉ Balancing ○ Master re-balances replicas periodically for better disk space and load balancing ○ Master gradually fills up a new chunkserver rather than instantly swaps it with new chunks (and the heavy write traffic that come with them !) ○ Re-balancing placement is similar as for “creation” 25
  • 26. Garbage collection and Stale replica detection ◉ File is not deleted istantly but lazily ○ Master logs deletions ○ File renamed to a hidden name that include deletion timestamp ○ After 3 days hidden files are removed on regular scan ○ Same procedure for orphaned chunks ○ Chunks “heartbeat” his chunks to master, if they are not present in metadata chunkserver is free to remove chunks ◉ Stale replica ○ For each chunk master maintains a version 26
  • 27. Fault tolerance and diagnosis HA and Data Integrity in GFS 4 27
  • 28. High availability in GFS Chunk Replication Fast Recovery Master Replication 28
  • 29. Data integrity READS Before returning block chankserver verifies integrity, if fails, returns an error and notifies master WRITES Chunkserver verifies the checksum of first and last data blocks that overlap the write range before perform the write, and finally compute and record the new checksums APPEND No checksum verification on the last block, just incrementally update the checksum for the last partial checksum block 29 Each chunkserver uses checksumming to detect corruption of stored chunk. For each 64 Kb adds a 32 Bit checksum
  • 30. Measurements Micro and Macro benchmarks 5 30
  • 31. Micro benchmarks 1 master, 2 master replicas, 16 chunkservers with 16 clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. 31
  • 32. Micro benchmarks - READS 32 Each client read a randomly selected 4MB region 256 times (= 1GB of data) from a 320MB file. The chunkservers taken together have only 32GB of memory, so is expected at most a 10% hit rate in the Linux buffer cache. 75-80% efficiency
  • 33. Micro benchmarks - WRITES 33 Each client writes 1 GB of data to a new file in a series of 1 MB writes 50% efficiency
  • 34. Micro benchmarks - APPENDS 34 Each client appends simultaneously to a single file Performance is limited by the network bandwidth of the 3 chunkservers that store the last chunk of the file, independent of the number of clients 35-40% efficiency
  • 35. Real World clusters 35 Cluster A used regularly for R&D by +100 engineers Cluster B used for production data processing
  • 36. Fault tolerance ◉ Kill 1 chunk servers in cluster B ○ 15.000 chunks on it (= 600 GB of data) ○ Cluster limited to 91 concurrent chunk cloning (= 40% of 227 chunkservers) ○ Each clone operation consume at most 50 Mbps ■ 15.000 chunks restored in 23.2 minutes effective replication rate of 440 MB/s ◉ Kill 2 chunk servers in cluster B ○ 16.000 chunks on each (= 660 GB of data) ○ This double failure reduced 266 chunks to having a single replica ■ These 266 chunks cloned at higher priority and were all restored to a least 2x replication within 2 minutes 36
  • 38. Experiences ◉ Operational issues ○ Original design for backend file system on production systems, not with R&D tasks in mind (lack of support for permissions and quotas…) ◉ Technical issues ○ IDE protocol versions support that corrupt data silently: this problem motivated checksums usage ○ fsync() cost in kernel 2.2 towards large operational logs files 38
  • 39. Conclusions ◉ GFS is: ○ Support large-scale data processing workloads on COTS x86 servers ○ Component failures as the norm rather than the exception ○ Optimize for huge files mostly append to and then read sequentially ○ Fault tolerance by constant monitoring, replicating crucial data and fast and automatic recovery (+ checksum to detect data corruption) ○ Delivers high aggregate throughput to many concurrent readers and writers 39
  • 40. Conclusions ◉ GFS now: ○ Source code is closed ○ Substituted with Colossus in 2010s ■ Smaller chunks 1 Mb ■ Eliminated single point of failure ■ Built for “real-time” big batch operations 40
  • 41. Any questions ? You can find me at ◉ @sshevchenko ◉ s.shevchenko@studenti.unisa.it Thanks! 41