2. Agenda
Context
Google File System
Aims
Design
Walk through
MapReduce
Aims
Design
Walk through
Conclusion
13.04.2008 Johannes Passing 2
3. Context
Google‘s scalability strategy
Cheap hardware
Lots of it
More than 450.000 servers (NYT, 2006)
Basic Problems
Faults are the norm
Software must be massively parallelized
Writing distributed software is hard
13.04.2008 Johannes Passing 3
4. GFS: Aims
Aims
Must scale to thousands of servers
Fault tolerance/Data redundancy
Optimize for large WORM files
Optimize for large reads
Favor throughput over latency
Non-Aims
Space efficiency
Client side caching futile
POSIX compliance
Significant deviation from standard DFS (NFS, AFS, DFS, …)
13.04.2008 Johannes Passing 4
5. Design Choices
File system cluster
Spread data over thousands of machines
Provide safety by redundant storage
Master/Slave architecture
13.04.2008 Johannes Passing 5
6. Design Choices
Files split into chunks
Unit of management and distribution
Replicated across servers
Subdivided into blocks for integrity checks
Accept internal fragmentation for better throughput
GFS …
File System
13.04.2008 Johannes Passing 6
7. Architecture: Chunkserver
Host chunks
Store chunks as files in Linux FS
Verify data integrity
Implemented entirely in user mode
Fault tolerance
Fail requests on integrity violation detection
All chunks stored on at least one other server
Re-replication restores server after downtime
13.04.2008 Johannes Passing 7
8. Architecture: Master
Namespace and metadata management
Cluster management
Chunk location registry (transient)
Chunk placement decisions
Re-replication, Garbage collection
Health monitoring
Fault tolerance
Backed by shadow masters
Mirrored operations log
Periodic snapshots
13.04.2008 Johannes Passing 8
9. Reading a chunk
Send filename and offset
Determine file and chunk
Return chunk locations
and version
Choose closest server
and request chunk
Verify chunk version and
block checksums
Return data if valid, else
fail request
13.04.2008 Johannes Passing 9
10. Writing a chunk
Send filename and offset
Determine lease owner/
grant lease
Return chunk locations
and lease owner
Push data to any chunkserver
Forward along replica chain
Send write request to primary
Create mutation order, apply
Forward write request
Validate blocks, apply
modifications , inc version, ack
13.04.2008 Johannes Passing
Reply to client 10
11. Appending a record (1/2)
Send append request
Choose offset
Return chunk locations
and lease owner
Push data to any chunkserver
Forward along replica chain
Send append request
Check for left space in chunk
Pad chunk and request retry
Request chunk allocation
Return chunk locations
and lease owner
13.04.2008 Johannes Passing 11
12. Appending a record (2/2)
Send append request
Allocate chunk, write data
Forward request to replicas
Write data, failure occurs
Fail request
A A A Client retries
Padding Padding Padding Write data
B B Garbage Forward request to replicas
B B B
Write data
Free Free Free
Free Free Free Success
13.04.2008 Johannes Passing 12
13. File Region Consistency
File region states:
Consistent All clients see same data
Defined Consistent ^ clients see change in its entirety
Inconsistent
Consequences
Unique record IDs to identify duplicates
Records should be self-identifying and self-validating
B B Garbage inconsistent
B B B defined
D
Free
C D
Free
C D
Free
C consistent
defined
C C C
13.04.2008 Johannes Passing 13
14. Fault tolerance GFS
Automatic re-replication
Replicas are spread across racks
Fault mitigations
Data corruption choose different replica
Machine crash choose replica on different machine
Network outage choose replica in different network
Master crash shadow masters
13.04.2008 Johannes Passing 14
15. Wrap-up GFS
Proven scalability, fault tolerance and performance
Highly specialized
Distribution transparent to clients
Only partial abstraction – clients must cooperate
Network is the bottleneck
13.04.2008 Johannes Passing 15
16. MapReduce: Aims
Aims
Unified model for large scale distributed data processing
Massively parallelizable
Distribution transparent to developer
Fault tolerance
Allow moving of computation close to data
13.04.2008 Johannes Passing 16
17. Higher order functions
Functional: map(f, [a, b, c, …])
a b c
f f f
f(a) f(b) f(c)
Google: map(k, v)
(k, v)
map
(f(k1), f(v1)) (f(k1), f(v1)) (f(k1), f(v1))
13.04.2008 Johannes Passing 17
18. Higher order functions
Functional: foldl/reduce(f, [a, b, c, …], i)
f
f c
f b
i a
Google: reduce(k, [a, b, c, d, …])
a b c d
f(a) – f(c, d)
13.04.2008 Johannes Passing 18
19. Idea
Observation
Map can be easily parallelized
Reduce might be parallelized
Idea
Design application around map and reduce scheme
Infrastructure manages scheduling and distribution
User implements map and reduce
Constraints
All data coerced into key/value pairs
13.04.2008 Johannes Passing 19
20. Walkthrough
Spawn master
Provide data
e.g. from GFS or BigTable
Create M splits
Spawn M mappers
Map
Once per key/value pair
Partition into R buckets
Mappers Combiners Reducers Spawn up to R*M combiners
Combine
Barrier
Spawn up to R reducers
Reduce
13.04.2008 Johannes Passing 20
21. Scheduling
Locality
Mappers scheduled close to data
Chunk replication improves locality
Reducers run on same machine as mappers
Choosing M and R
Load balancing & fast recovery vs. number of output files
M >> R, R small multiple of #machines
Schedule backup tasks to avoid stragglers
13.04.2008 Johannes Passing 21
22. Fault tolerance MapReduce
Fault mitigations
Map crash
All intermediate data in questionable state
Repeat all map tasks of machine
Rationale of barrier
Reduce crash
Data is global, completed stages marked
Repeat crashed resuce task only
Master crash
start over
Repeated crashs on certain records
Skip records
13.04.2008 Johannes Passing 22
23. Wrap-up MapReduce
Proven scalability
Restrict programming model for better runtime support
Tailored to Google’s needs
Programming model is fairly low-level
13.04.2008 Johannes Passing 23
24. Conclusion
Rethinking infrastructure
Consequent design for scalability and performance
Highly specialized solutions
Not generally applicable
Solutions more low-level than usual
Maintenance efforts may be significant
„We believe we get tremendous competitive advantage by
essentially building our own infrastructures“
--Eric Schmidt
13.04.2008 Johannes Passing 24
25. References
Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data
Processing on Large Clusters”, 2004
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
“The Google File System”, 2003
Ralf Lämmel, “Google's MapReduce Programming Model –
Revisited”, SCP journal, 2006
Google, “Cluster Computing and MapReduce”,
http://code.google.com/edu/content/submissions/mapreduce-
minilecture/listing.html, 2007
David F. Carr, “How Google Works”,
http://www.baselinemag.com/c/a/Projects-Networks-and-
Storage/How-Google-Works/, 2006
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis, “Evaluating MapReduce for Multi-core and
Multiprocessor Systems”, 2007
13.04.2008 Johannes Passing 25