3. Fault tolerance
- Workers can be replicated, act as a cache for the various UFS
- Many UFS have high availability, fault tolerance guarantees
- Master becomes single point of failure
3
4. Journal details
4
- Total order log of operations
- Recover by replay
- Snapshots to efficiently store state
- Faster recovery
- Smaller size
5. Basic fault tolerance
- Create a fault tolerant journal
- If the master crashes
- Stat a new master
- Replay the journal
- Start serving clients
- The system will be unavailable during this time
5
6. Basic high availability
- Run multiple masters
- A primary master will serve requests
- Secondary master(s) will replicate the state of the primary master, and take
over in case of failure
6
8. Problems to solve
- Ensure a single primary master running at all times
- Journal needs to be
- Fault tolerant
- Must agree on a valid order of journal entries
- Consensus
8
10. Ensure a single primary master running at a time
- Leader election using Zookeeper recipe
- Apache Zookeeper is an open-source server which enables highly reliable
distributed coordination
- File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol
- Run on a cluster of nodes to provide fault tolerance/high availability
10
11. UFS Journal
- Write journal entries to the UFS
- Use the availability / fault tolerance / consistency guarantees of the UFS
- HDFS
11
12. Does leader election solve all our problems?
- Not quite
- Due to asynchrony two nodes
may believe they are leader at the
same time
- Concurrent writes to the journal
12
14. Issues
- Relies on multiple systems
- Each having their own fault tolerance/availability models
- More complicated
- Different UFS have different consistency models and performance
- May not be efficient for appending log entries
14
15. Additional details on the file system metadata
- RocksDB (optional)
- Log-structured merge tree
- Efficient inserts
- Key-value store
- Inode tree as a key-value map
- Efficient snapshots
- Alluxio adds in-memory cache for fast reads
15
17. Raft - replicated state machine
- Clients interact with the state
machine as if it was a single
instance (linearizability)
- Send commands and receive
responses
- Fault tolerant and high
availability
https://ratis.apache.org
17
18. The Alluxio journal and a replicated state machine
- Raft simplifications
- Handles snapshotting and recovery, the
journal log, etc.
- Replicated state-machine =
key-value store of the file-system
meta-data
- Raft colocated with Alluxio masters
18
19. Primary master protocol
- Raft ensures a consistent and
highly available journal
- Still want a single primary master
- Update the UFS and Alluxio workers
- Serve clients
- Use leader election built into Raft.
- + Additional coordination layer
19
21. Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
21