1. The Google File System
Published By:
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
Google Presented By:
Manoj Samaraweera (138231B)
Azeem Mumtaz (138218R)
University of Moratuwa
2. Contents
âą Distributed File Systems
âą Introducing Google File System
âą Design Overview
âą System Interaction
âą Master Operation
âą Fault Tolerance and Diagnosis
âą Measurements and Benchmarks
âą Experience
âą Related Works
âą Conclusion
âą Reference
3. Distributed File Systems
âą Enables programs to store and access remote
files exactly as they do local ones
âą New modes of data organization on disk or
across multiple servers
âą Goals
â« Performance
â« Scalability
â« Reliability
â« Availability
4. Introducing Google File System
âą Growing demand for Google data processing
âą Properties
â« A scalable distributed file system
â« For large distributed data intensive applications
â« Fault tolerance
â« Inexpensive commodity hardware
â« High aggregated performance
âą Design is driven by observation of workload and
technological environment
5. Design Assumptions
âą Component failures are the norm
â« Commodity Hardware
âą Files are huge by traditional standard
â« Multi-GB files
â« Small files also must be supported,
ï Not optimized
âą Read Workloads
â« Large streaming reads
â« Small random reads
âą Write Workloads
â« Large, sequential writes that append data to file
âą Multiple clients concurrently append to one file
â« Consistency Semantics
â« Files are used as producer-consumer queues or many way merging
âą High sustained bandwidth is more important than low latency
6. Design Interface
âą Typical File System Interface
âą Hierarchical Directory Organization
âą Files are identified as pathnames
âą Operations
â« Create, delete, open, close, read, write
7. Architecture (1/2)
âą Files are divided into chunks
âą Fixed-size chunks (64MB)
âą Unique 64-bit chunk handles
â« Immutable and globally unique
âą Chunks as Linux files
âą Replicated over chunkservers, called replicas
â« 3 replicas by default
â« Different replication for different region of file namespace
âą Single master
âą Multiple chunkservers
â« Grouped into Racks
â« Connected through switches
âą Multiple clients
âą Master/chunkserver coordination
â« HeartBeat Messages
10. Chunk Size (1/2)
âą 64 MB
âą Stored as plain Linux file on a chunkserver
âą Advantages
â« Reduces clientâs interaction with single master
â« Clients most likely to perform many operations on
a large chunk
ï Reduce network overhead by keeping a persistent
TCP connection with the chunkserver
â« Reduces the size of the metadata
ï Keep metadata in memory
â« Lazy Space Allocation
11. Chunk Size (1/2)
âą Disadvantages
â« Small files consisting of small chunks may
become hot spots
â« Solutions
ï Higher replication factor
ï Stagger application start time
ï Allow clients to read from other clients
12. Metadata (1/5)
âą 3 Major Types
â« The file and chunk namespace
â« File-to-chunk mappings
â« The location of each chunk replicas
âą Namespaces and mappings
â« Persisted by logging mutation to an operation log
stored on master
â« Operation log is replicated
13. Metadata (2/5)
âą Metadata are stored in the memory
â« Improves the performance master
â« Easier to scan the entire state of metadata
periodically
ï Chunk garbage collection
ï Re-replication in the presence of chunkserver failure
ï Chunk migration to balance load and disk space
âą 64 bytes of metadata for 64 MB chunk
âą File namespace data requires < 64 bytes per file
â« Prefix compression
14. Metadata (3/5)
âą Chunk location information
â« Polled at master startup
ï Chunkservers join and leave the cluster
â« Keeps up-to-date with chunkserver with
HeartBeat messages
15. Metadata (4/5)
âą Operation Logs
â« Historical record of critical metadata changes
â« Logical timeline that defines the order of
concurrent operations
â« Not visible to client
ï Until it is replicated and flushed the logs to the disk
â« Flushing and replication in batch
ï Reduces impact on system throughput
16. Metadata (5/5)
âą Operation Logs
â« By replaying operation logs master recover its file
system state
â« Checkpoints
ï To avoid the growth of the operation logs beyond the
threshold
ï avoids interfering other mutations by working in a
separate thread
â« Compact B-tree like structure
ï Directly mapped into the memory and used for
namespace lookup
ï No extra parsing
17. Consistency Model (1/3)
âą Guarantees by GFS
â« File namespace mutations (i.e. File Creation) are atomic
ï Namespace management and locking guarantees atomicity and
correctness
ï The masterâs operation log
â« After a sequence of successful mutations, the mutated file is
guaranteed to be defined and contain the data written by
the last mutation. This is obtained by
ï Applying the same mutation in order to all replicas
ï Using chunk version numbers to detect stale replica
18. Consistency Model (2/3)
âą Relaxed consistency model
âą Two types of mutations
â« Writes
ï Cause data to be written at an application-specified file offset
â« Record Appends
ï Cause data to be appended atomically at least once
ï Offset chosen by GFS, not by the client
âą States of a file region after a mutation
â« Consistent
ï All clients see the same data, regardless which replicas they read from
â« Inconsistent
ï Clients see different data at different times
â« Defined
ï consistent and all clients see what the mutation writes in its entirety
â« Undefined
ï consistent but it may not reflect what any mutation has written
19. Consistency Model (3/3)
âą Implication for Applications
â« Relying on appends rather on overwrites
â« Checkpointing
ï to verify how much data has been successfully
written
â« Writing self-validating records
ï Checksums to detect and remove padding
â« Writing Self-identifying records
ï Unique Identifiers to identify and discard duplicates
20. Lease & Mutation Order
âą Master uses leases to maintain a consistent
mutation order among replicas
âą Primary is the chunkserver who is granted a
chunk lease
â« Master delegates the authority of mutation
â« All others are secondary replicas
âą Primary defines a mutation order between
mutations
â« Secondary replicas follows this order
21. Writes (1/7)
âą Step 1
â« Which chunkserver holds
the current lease for the
chunk?
â« The location of secondary
replicas
22. Writes (2/7)
âą Step 2
â« Identities of primary and
secondary replicas
â« Client cache this data for
future mutation, until
ï Primary is unreachable
ï Primary no longer holds
the lease
23. Writes (3/7)
âą Step 3
â« Client pushes the data to
all replicas
â« Chunkserver stores the
data in an internal LRU
buffer cache
24. Writes (4/7)
âą Step 4
â« Client sends a write
request to the primary
â« Primary assigns a
consecutive serial
numbers to mutations
ï Serialization
â« Primary applies
mutations to its own state
25. Writes (5/7)
âą Step 5
â« Forward the writes to all
secondary replicas
â« Follows the mutation
order
26. Writes (6/7)
âą Step 6
â« Secondary replicas
inform primary after
completing the mutation
27. Writes (7/7)
âą Step 7
â« Primary replies to the
client
â« Retries from step 3 to 7 in
case of errors
28. Data Flow (1/2)
âą Decoupled control flow and data flow
âą Data is pushed linearly along a chain of
chunkservers in a pipelined fashion
â« Utilize inbound bandwidth
âą Distance is accurately estimated from IP
addresses
âą Minimize latency by pipelining the data
transmission over TCP
29. Data Flow (2/2)
âą Ideal elapsed time for transmitting B bytes to R
replicas:
ï T â Network Throughput
ï L â Latency between 2 machines
âą At Google:
ï
ï
ï
T = 100 Mbps
L <= 1 ms
1000 replicas
Î/΀ RL
ï 1 MB distributed in 80 ms
30. Record Append
âą In traditional writes
â« Clients specifies offset where the data to be written
â« Concurrent write to the same region is not serialized
âą In record append
â« Client specifies only the data
â« Similar to writes
â« GFS appends data to the file at least once atomically
ï The chunk is padded if appending the record exceeds the
maximum size
ï If a record append fails at any replica, the client retries
the operation - record duplicates
ï File region may be defined but inconsistent
31. Snapshot (1/2)
âą Goals
â« To quickly create branch copies of huge data sets
â« To easily checkpoint the current state
âą Copy-on-write technique
â« Master receive snapshot request,
â« Revokes outstanding leases on chunks in the file
â« Master logs the operation to the disk
â« Applies this log to its in-memory state by duplicating
the metadata for the source file or directory tree
â« New snapshot file
32. Snapshot (2/2)
âą After the snapshot operation
â« Clients sends a request to master to find the
current lease holder of a âchunk Câ
â« Reference count for chunk C is > 1
â« Master pick a new chunk handle C
â« Master asks chunkserver to create a new chunk C
â« Master grants one of the replicas a lease on the
new chunk C and replies to the client
33. Content
âą Distributed File Systems
âą Introducing Google File System
âą Design Overview
âą System Interaction
âą Master Operation
âą Fault Tolerance and Diagnosis
âą Measurements and Benchmarks
âą Experience
âą Related Works
âą Conclusion
âą Reference
35. Namespace Management and Locking
âą Each master operation acquires a set of locks
before it runs
âą Creating /home/user/foo while /home/user is
snapshotted to /save/user
36. Replica Placement
âą Chunk replica placement policy serves two
purposes:
â« Maximize data reliability and availability.
â« Maximize network bandwidth utilization
37. Creation, Re-replication, Rebalancing
âą Creation
â« Want to place new replicas on chunkservers with
below-average disk space utilization
â« Limit the number of ârecentâ creations on each
chunkserver
â« Spread replicas of a chunk across racks.
âą Re-replication
â« As soon as # of replicas go below user specified goal
âą Rebalancing
â« Moves replicas for better disk space and load
balancing
38. Garbage Collection
âą Mechanism
â« Master logs the deletion immediately.
â« File is just renamed to a hidden name.
â« Removes any such hidden files if they have existed
for more than three days.
â« In a regular scan of the chunk namespace, master
identifies orphaned chunks and erases the
metadata for those chunks.
39. Stale Replica Detection
âą Chunk version number to distinguish between
up-to-date and stale replicas.
âą Master removes stale replicas in its regular
garbage collection.
40. Fault Tolerance and Diagnosis
âą High Availability
â« Fast Recovery
ï Master and the chunkserver are designed to restore their
state and start in seconds.
â« Chunk Replication
ï master clones existing replicas as needed to keep each
chunk fully replicated
â« Master Replication
ï The master state is replicated for reliability
ï Operation log and checkpoints are replicated on multiple
machines
ï âShadow masterâ read-only access to the FS even when
the primary master is down
41. Fault Tolerance and Diagnosis (2)
âą Data Integrity
â« Each chunkserver uses checksumming to detect
corruption of stored data.
â« Chunk is broken up into 64 KB blocks. Each has a
corresponding 32 bit checksum
â« Checksum computation is heavily optimized for
writes that append to the end of a chunk
42. Fault Tolerance and Diagnosis (3)
âą Diagnostic Tools
â« Extensive and detailed diagnostic logging for in
problem isolation, debugging, and performance
analysis
â« GFS servers generate diagnostic logs that record
many significant events and all RPC requests and
replies
44. Measurements and Benchmarks (2)
âą Real World Clusters
âą Cluster A is used regularly for research and development
âą Cluster B is primarily used for production data processing
46. Experience
âą Biggest problems were disk and Linux related.
â« Many of disks claimed to the Linux driver that they
supported a range of IDE protocol versions but in fact
responded reliably only to the more recent ones.
â« Despite occasional problems, the availability of Linux
code has helped to explore and understand system
behavior.
47. Related Works (1/3)
âą Both GFS & AFS provides a location independent
namespace
â« data to be moved transparently for load balance
â« fault tolerance
âą Unlike AFS, GFS spreads a fileâs data across
storage servers in a way more akin to xFS and Swift
in order to deliver aggregate performance and
increased fault tolerance
âą GFS currently uses replication for redundancy and
consumes more raw storage than xFS or Swift.
48. Related Works (2/3)
âą In contrast to systems like AFS, xFS, Frangipani,
and Intermezzo, GFS does not provide any caching
below the file system interface.
âą GFS uses a centralized approach in order to
simplify the design, increase its reliability, and gain
flexibility
â« unlike Frangipani, xFS, Minnesotaâs GFS and GPFS
â« Makes it easier to implement sophisticated chunk
placement and replication policies since the master
already has most of the relevant information and
controls how it changes.
49. Related Works (3/3)
âą GFS delivers aggregated performance by focusing on
the needs of our applications rather than building a
POSIX-compliant file system, unlike in Lustre
âą NASD architecture is based on network-attached
disk drives, similarly GFS uses commodity machines
as chunkservers
âą GFS chunkservers use lazily allocated fixed-size
chunks, whereas NASD uses variable-length objects
âą The producer-consumer queues enabled by atomic
record appends address a similar problem as the
distributed queues in River
â« River uses memory-based queues distributed across
machines
50. Conclusion
âą GFS demonstrates the qualities essential for
supporting large-scale data processing
workloads on commodity hardware.
âą Provides fault tolerance by constant
monitoring, replicating crucial data, and fast
and automatic recovery
âą Delivers high aggregate throughput to many
concurrent readers and writers performing a
variety of tasks
51. Reference
âą Ghemawat. S., Gobioff. H., Leung. S., 2003. The
Google file system. In Proceedings of the
nineteenth ACM symposium on Operating
systems principles (SOSP '03). ACM, New York,
NY, USA, 29-43.
âą Coulouris. G., Dollimore. J., Kindberg. T. 2005.
Distributed Systems: Concepts and Design (4th
Edition). Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA.
fast recovery and replicationno distinguish between normal and abnormal terminationshadows, not mirrors, in that they may lag the primary slightly
impractical to detect corruption by comparing replicas across chunkserverswrite overwrites an existing range ?? Compare chsum of 1st and last blocks
helped immeasurably in problem isolation, debugging, and performance analysis with minimal costchunkservers going up and downRPC logs include the exact requests and responses sent on the wire