Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Floating on a RAFT: HBase Durability with Apache Ratis

469 Aufrufe

Veröffentlicht am

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.

This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Floating on a RAFT: HBase Durability with Apache Ratis

  1. 1. Floating on a Raft HBase Durability with Apache Ratis NoSQL Day 2019 Washington, D.C. Ankit Singhal, Josh Elser Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
  2. 2. Distributed Consensus Problem: How do a collection of computers agree on state in the face of failures? A = 1 A = 2 A = 1 CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
  3. 3. Distributed Consensus Goals: Low-latency, high-throughput, fault-tolerant Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped Replication Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft, Hazelcast Raft, Neo4j, WANdisco...
  4. 4. Easy to understand, easy to implement. “New” (2013) -- Diego Ongaro, John Ousterhout Proven correctness via TLA+ Paxos is “old” (1989), but still hard Raft
  5. 5. Apache Ratis Incubating project at the Apache Software Foundation A library-oriented, Java implementation of Raft (not a service!) Pluggable pieces: ● Transport (gRPC, Netty, Hadoop RPC) ● State Machine (your code!) ● Raft Log (In-memory, segmented files on disk)
  6. 6. A StateMachine is the abstraction point for user-code. Interface to query and modify “state” Ratis Arithmetic Example: Maintain variables (e.g. a = 1) and apply mathematical operations. Read expr’s: add, subtract, multiply, divide Write expr’s: assignment Ratis State Machines class Arithmetic implements StateMachine { Map<String,Double> variables; Message query(Message req) { Expression exp = parseReadExp(req); try (ReadLock rlock = getReadlock()) { return exp.eval(variables); } } Message update(Message req) { Expression exp = parseWriteExp(req); try (WriteLock wlock = getWriteLock()) { return exp.eval(variables); } } }
  7. 7. Ratis LogService Recipe that provides a facade of a log (append-only, immutable bytes) Maintain little-to-no state. Storage “provided” by the Raft Log. interface Reader { void seek(long offset); byte[] readMsg(); List<byte[]> readBulk(int numMsgs); } interface Writer { long write(byte[] msg); List<Long> writeBulk( List<byte[]> msgs); } interface Client { List<String> list(); Log getLog(String name); void archive(String name); void close(String name); void delete(String name); } interface Log { Reader createReader(); Writer createWriter(); Metadata getMetadata(); void addListener(); }
  8. 8. Ratis LogService Architecture Log Name transactions gps_coordinates sensors query_durations Client Metadata Workers
  9. 9. LogService Testing Docker-compose simplicity: 3 metadata services, >=3 workers $ mvn package assembly:single && ./build-docker.sh $ docker-compose up -d $ ./client-env.sh Utilities: interactive shell, verification tool $ ./bin/shell -q <...> $ ./bin/load-test -q <...>
  10. 10. LogService Testing Goal: Generate some non-trivial data sizes Environment: ● Intel i5-5250U ● 16GB of RAM ● Samsung SSD 850 M.2 ● Gentoo Linux: Kernel 4.19.27 ● Docker 18.09.4 ● Write ~50MB per scenario ● Single client program, one log/thread, no batching ● JDK8, 3GB LogWorker heaps (no other tuning)
  11. 11. LogService Testing Results Logs/Threads Value Size Num Records Duration 1 50 1,100,000 5h+ 4 50 275,000 35m 5 100 105,000 13m 30s 5 500 22,000 2m 48s 8 100 66,000 16m 20s 8 500 13,200 2m 30s 4 1000 13,200 1m 40s
  12. 12. Does HBase want this? Assumption: we can more efficiently run HBase in cloud environments without HDFS for WALs. ● Running HDFS is expensive, hard ○ Data is “heavy” (10’s mins to 1’s of hours to decommission) ○ Unexpected DataNode failure requires slow re-replication ● More things to monitor -- twice as many JVMs Ideal Case: ● Scale up HBase by just adding a more RegionServers, then balance ● Scale down by gently (order 1’s of minutes) removing RegionServers
  13. 13. Asynchronous flushing to generate HFiles Write Path Store Durability in HBase Put Delete Incr RegionServer wal MemStore 1 2 Region1 Store MemStore RegionN 3 3 Store File Store File Append and sync KVs
  14. 14. Life cycle of WAL RegionServer WAL WALs zookeeper Flush Log Roller Roll Wal Flush Tracking for Replication Backup Cleaner chore WALs Archived
  15. 15. Regionserver Recovery Identification - Master(ServerManager) observes when a region server is deemed dead due to their ephemeral node being deleted Splitting - Reading the WAL and creating separate files for each region Re-assignment - Assigning the regions from dead server to live regionservers Fencing - Fencing for half dead region server (server which undergoes long GC pause and comes back after GC finishes) - Currently done through renaming HDFS directory Replaying - Reading the WAL recovered edits produced by WAL splitting and replaying the edits that were not flushed
  16. 16. Regionserver Recovery Refactoring Identification - No change is required Splitting interface WALProvider { public Map<Region, WAL> split(WAL wal); } Re-assignment - No change is required Fencing interface ServerFence { public void fence(ServerName server); } In case of Ratis, Implementation could be to close the log to prevent further writes by dead regionserver. Replaying interface WALProvider { public Reader getRecoveredEditsReader( Region region ); } Disclaimer: These Interfaces are for reference only , may change during the actual development
  17. 17. Replication - Async and Serial Replication rely on reading WALs - Need a long-term storage for WALs - Ratis LogService uses local disk Proposed Solution - Can we upload Ratis WALs to distributed, cheap storage? - If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
  18. 18. Why Ratis for WAL? Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS ● Fully embeddable(No dependency on External System) ● Low Latency ● High throughput ● Enable HBase for cloud deployment Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
  19. 19. What’s next? More testing for LogService ● Easy to cause leader-election storms ● Better insight/understanding into internals A Ratis LogService WalProvider ● Wire up the LogService with the new WAL APIs
  20. 20. References Ratis LogService ● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice HBase WAL Refactoring ● https://issues.apache.org/jira/browse/HBASE-20951 ● https://issues.apache.org/jira/browse/HBASE-20952 Authors ● ankit,elserj@apache.org

×