Alluxio Journal Evolution - Towards high availability and fault tolerance

•

0 gefällt mir•119 views

Alluxio, Inc.

Alluxio Day X March 3, 2022 https://www.alluxio.io/alluxio-day/ Speakers: Tyler Crain, Alluxio

Software

Alluxio Journal Evolution -
Towards High Availability
and Fault Tolerance
ALLUXIO 1

Fault tolerance
- Workers can be replicated, act as a cache for the various UFS
- Many UFS have high availability, fault tolerance guarantees
- Master becomes single point of failure
3

Journal details
4
- Total order log of operations
- Recover by replay
- Snapshots to efficiently store state
- Faster recovery
- Smaller size

Basic fault tolerance
- Create a fault tolerant journal
- If the master crashes
- Stat a new master
- Replay the journal
- Start serving clients
- The system will be unavailable during this time
5

Basic high availability
- Run multiple masters
- A primary master will serve requests
- Secondary master(s) will replicate the state of the primary master, and take
over in case of failure
6

Basic highly available/ fault tolerant architecture
7

Problems to solve
- Ensure a single primary master running at all times
- Journal needs to be
- Fault tolerant
- Must agree on a valid order of journal entries
- Consensus
8

Ensure a single primary master running at a time
- Leader election using Zookeeper recipe
- Apache Zookeeper is an open-source server which enables highly reliable
distributed coordination
- File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol
- Run on a cluster of nodes to provide fault tolerance/high availability
10

UFS Journal
- Write journal entries to the UFS
- Use the availability / fault tolerance / consistency guarantees of the UFS
- HDFS
11

Does leader election solve all our problems?
- Not quite
- Due to asynchrony two nodes
may believe they are leader at the
same time
- Concurrent writes to the journal
12

Issues
- Relies on multiple systems
- Each having their own fault tolerance/availability models
- More complicated
- Different UFS have different consistency models and performance
- May not be efficient for appending log entries
14

Additional details on the file system metadata
- RocksDB (optional)
- Log-structured merge tree
- Efficient inserts
- Key-value store
- Inode tree as a key-value map
- Efficient snapshots
- Alluxio adds in-memory cache for fast reads
15

ALLUXIO 16
The journal and a replicated state machine
Raft Journal

Raft - replicated state machine
- Clients interact with the state
machine as if it was a single
instance (linearizability)
- Send commands and receive
responses
- Fault tolerant and high
availability
https://ratis.apache.org
17

The Alluxio journal and a replicated state machine
- Raft simplifications
- Handles snapshotting and recovery, the
journal log, etc.
- Replicated state-machine =
key-value store of the file-system
meta-data
- Raft colocated with Alluxio masters
18

Primary master protocol
- Raft ensures a consistent and
highly available journal
- Still want a single primary master
- Update the UFS and Alluxio workers
- Serve clients
- Use leader election built into Raft.
- + Additional coordination layer
19

Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
21

Empfohlen

Apache Hudi: The Path ForwardAlluxio, Inc.

HBaseCon 2013: Apache HBase ReplicationCloudera, Inc.

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.

Large-scale Web Apps @ PinterestHBaseCon

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

Empfohlen

Apache Hudi: The Path ForwardAlluxio, Inc.

HBaseCon 2013: Apache HBase ReplicationCloudera, Inc.

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.

Large-scale Web Apps @ PinterestHBaseCon

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

Digital Library Collection Management using HBaseHBaseCon

HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.

HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon

HBaseCon 2015- HBase @ FlipboardMatthew Blair

HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

Usage case of HBase for real-time applicationEdward Yoon

HBase: Where Online Meets Low LatencyHBaseCon

Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.

A Survey of HBase Application ArchetypesHBaseCon

RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.

HBase Advanced - Lars GeorgeJAX London

Mapreduce over snapshotsenissoz

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.

Hadoop Operations - Best Practices from the FieldDataWorks Summit

zookeeer+raft-2.pdfChester Chen

Zookeeper vs Raft: Stateful distributed coordination with HA and Fault ToleranceAlluxio, Inc.

Weitere ähnliche Inhalte

Was ist angesagt?

Digital Library Collection Management using HBaseHBaseCon

HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.

HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon

HBaseCon 2015- HBase @ FlipboardMatthew Blair

HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

Usage case of HBase for real-time applicationEdward Yoon

HBase: Where Online Meets Low LatencyHBaseCon

Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.

A Survey of HBase Application ArchetypesHBaseCon

RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.

HBase Advanced - Lars GeorgeJAX London

Mapreduce over snapshotsenissoz

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.

Hadoop Operations - Best Practices from the FieldDataWorks Summit

Was ist angesagt? (20)

Digital Library Collection Management using HBase

HBaseCon 2013: Apache HBase Table Snapshots

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...

HBase at Bloomberg: High Availability Needs for the Financial Industry

HBaseCon 2015- HBase @ Flipboard

HBaseCon 2013: Compaction Improvements in Apache HBase

HBase Read High Availability Using Timeline Consistent Region Replicas

HBase Read High Availability Using Timeline-Consistent Region Replicas

Taming the Elephant: Efficient and Effective Apache Hadoop Management

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Usage case of HBase for real-time application

HBase: Where Online Meets Low Latency

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...

A Survey of HBase Application Archetypes

RaptorX: Building a 10X Faster Presto with hierarchical cache

HBase Advanced - Lars George

Mapreduce over snapshots

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Hadoop Operations - Best Practices from the Field

Ähnlich wie Alluxio Journal Evolution - Towards high availability and fault tolerance

zookeeer+raft-2.pdfChester Chen

Zookeeper vs Raft: Stateful distributed coordination with HA and Fault ToleranceAlluxio, Inc.

Replication in the WildEnsar Basri Kahveci

Hive ACID Apache BigData 2016alanfgates

Apache Hive on ACIDHortonworks

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Building reliable Ceph clusters with SUSE Enterprise StorageLars Marowsky-Brée

OpenZFS at LinuxConMatthew Ahrens

Linux High Availability Overview - openSUSE.Asia Summit 2015 Roger Zhou 周志强

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

Apache Pulsar as a Dual Stream / Batch ProcessorJoe Olson

Running Solr in the Cloud at Memory Speed with Alluxiothelabdude

Module2 MultiThreads.pptshreesha16

A presentaion on Panasas HPC NASRahul Janghel

State of the_gluster_-_lceuGluster.org

Vert.x Event Driven Non Blocking Reactive ToolkitBrian S. Paskin

Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmDataWorks Summit

Memory and Cache Coherence in Multiprocessor System.pdfrajaratna4

Oracle DataguardNavneet Upneja

Running Solr at Memory Speed with Alluxio - Timothy Potter, LucidworksLucidworks

Ähnlich wie Alluxio Journal Evolution - Towards high availability and fault tolerance (20)

zookeeer+raft-2.pdf

Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance

Replication in the Wild

Hive ACID Apache BigData 2016

Apache Hive on ACID

Building reliable Ceph clusters with SUSE Enterprise Storage

OpenZFS at LinuxCon

Linux High Availability Overview - openSUSE.Asia Summit 2015

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

Apache Pulsar as a Dual Stream / Batch Processor

Running Solr in the Cloud at Memory Speed with Alluxio

Module2 MultiThreads.ppt

A presentaion on Panasas HPC NAS

State of the_gluster_-_lceu

Vert.x Event Driven Non Blocking Reactive Toolkit

Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm

Memory and Cache Coherence in Multiprocessor System.pdf

Oracle Dataguard

Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks

Mehr von Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.

Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.

Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.

Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.

Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.

AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.

AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.

Mehr von Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Kürzlich hochgeladen

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Professional Resume Template for Software DevelopersVinodh Ram

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Software Quality Assurance Interview QuestionsArshad QA

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Kürzlich hochgeladen (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Active Directory Penetration Testing, cionsystems.com.pdf

Diamond Application Development Crafting Solutions with Precision

Professional Resume Template for Software Developers

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Software Quality Assurance Interview Questions

Hand gesture recognition PROJECT PPT.pptx

Exploring iOS App Development: Simplifying the Process

why an Opensea Clone Script might be your perfect match.pdf

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Salesforce Certified Field Service Consultant

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

5 Signs You Need a Fashion PLM Software.pdf

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

HR Software Buyers Guide in 2024 - HRSoftware.com

Alluxio Journal Evolution - Towards high availability and fault tolerance

1. Alluxio Journal Evolution - Towards High Availability and Fault Tolerance ALLUXIO 1

2. Basic Alluxio Architecture 2

3. Fault tolerance - Workers can be replicated, act as a cache for the various UFS - Many UFS have high availability, fault tolerance guarantees - Master becomes single point of failure 3

4. Journal details 4 - Total order log of operations - Recover by replay - Snapshots to efficiently store state - Faster recovery - Smaller size

5. Basic fault tolerance - Create a fault tolerant journal - If the master crashes - Stat a new master - Replay the journal - Start serving clients - The system will be unavailable during this time 5

6. Basic high availability - Run multiple masters - A primary master will serve requests - Secondary master(s) will replicate the state of the primary master, and take over in case of failure 6

7. Basic highly available/ fault tolerant architecture 7

8. Problems to solve - Ensure a single primary master running at all times - Journal needs to be - Fault tolerant - Must agree on a valid order of journal entries - Consensus 8

9. ALLUXIO 9 Zookeeper + UFS Journal

10. Ensure a single primary master running at a time - Leader election using Zookeeper recipe - Apache Zookeeper is an open-source server which enables highly reliable distributed coordination - File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol - Run on a cluster of nodes to provide fault tolerance/high availability 10

11. UFS Journal - Write journal entries to the UFS - Use the availability / fault tolerance / consistency guarantees of the UFS - HDFS 11

12. Does leader election solve all our problems? - Not quite - Due to asynchrony two nodes may believe they are leader at the same time - Concurrent writes to the journal 12

13. Zookeeper + UFS architecture 13

14. Issues - Relies on multiple systems - Each having their own fault tolerance/availability models - More complicated - Different UFS have different consistency models and performance - May not be efficient for appending log entries 14

15. Additional details on the file system metadata - RocksDB (optional) - Log-structured merge tree - Efficient inserts - Key-value store - Inode tree as a key-value map - Efficient snapshots - Alluxio adds in-memory cache for fast reads 15

16. ALLUXIO 16 The journal and a replicated state machine Raft Journal

17. Raft - replicated state machine - Clients interact with the state machine as if it was a single instance (linearizability) - Send commands and receive responses - Fault tolerant and high availability https://ratis.apache.org 17

18. The Alluxio journal and a replicated state machine - Raft simplifications - Handles snapshotting and recovery, the journal log, etc. - Replicated state-machine = key-value store of the file-system meta-data - Raft colocated with Alluxio masters 18

19. Primary master protocol - Raft ensures a consistent and highly available journal - Still want a single primary master - Update the UFS and Alluxio workers - Serve clients - Use leader election built into Raft. - + Additional coordination layer 19

20. Alluxio + Raft architecture 20

21. Advantages - Simplicity - No external systems (Raft colocated with masters) - Raft takes care of logging, snapshotting, recovery, etc. - Performance - Journal stored directly on masters - RocksDB key-value store + cache 21

22. ALLUXIO 22 Questions?