Big Data Bellevue Meetup
May 19, 2022
For more Alluxio events: https://alluxio.io/events/
Speaker: Bin Fan (Founding Engineer & VP of Open Source, Alluxio)
Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few).
To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015.
Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
Building a Distributed File System for the Cloud-Native Era
1. Building a Distributed File System
for the Cloud-Native Era
Bin Fan, 05-19-2022 @ Big Data Bellevue
ALLUXIO 1
2. ALLUXIO 2
About Me
2
Bin Fan (https://www.linkedin.com/in/bin-fan/)
● Founding Engineer, VP Open Source @ Alluxio
● Alluxio PMC Co-Chair, Presto TSC/committer
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
3. ● Originally a research project (Tachyon) in UC Berkeley AMPLab led by by-then PHD student
Haoyuan Li (Alluxio founder CEO)
● Backed by top VCs (e.g., Andreessen Horowitz) with $70M raised in total, Series C ($50M)
announced in 2021
● Deployed in production at large scale in Facebook, Uber, Microsoft, Tencent, Tiktok and etc
● More than 1100 Contributors on Github. In 2021, more than 40% commits in Github were
contributed by the community users
● The 9th most critical Java-based Open-Source projects on Github by Google/OpenSSF[1]
Alluxio Overview
ALLUXIO 3
[1] Google Comes Up With A Metric For Gauging Critical Open-Source Projects
9. ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
10. Topology
● On-prem Hadoop → Cloud-native, Multi- or Hybrid-cloud,
Multi-datacenter
Computation
● MR/Spark → Spark, Presto, Hive, Tensorflow, Pytorch ….
● More mature frameworks (less frequent OOM etc)
Data access pattern
● Sequential-read (e.g., scanning) on unstructured files → Ad-hoc
read into structured/columnar data
● Hundred to thousand of big files → millions of small files
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 10
11. Data Storage
● On-prem & colocated HDFS → S3 !!! and other object stores
(possibly across regions like us-east & us-west),
and legacy on-prem HDFS in service
Resource/Job Orchestration
● YARN → K8s
○ Lost focus on data locality
The Evolution from Hadoop to Cloud-native Era
ALLUXIO 11
12. Analytics & AI in the Hybrid and
Multi-Cloud Era
Available:
ALLUXIO 12
13. ALLUXIO 13
Unified Namespace
Providing Users a single and consistent logical filesystem
• Extension: Single logical Alluxio path backed by multiple storage systems
○ Example customized data policy: Migrate data older than 7 days from HDFS to S3
Reports Sales
hdfs://host:port/directory/
14. ALLUXIO 14
Data Locality with Scale-out Workers
Local I/O performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
15. ALLUXIO 15
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
Metadata Locality with Scalable Masters
RocksDB
16. Tencent has implemented a 1000-node Alluxio
cluster and designed a scalable, robust, and
performant architecture to speed up Ceph
storage for game AI training.
Case study: TencentHKG:0700
https://www.alluxio.io/blog/thousand-node-alluxio-cluster-powers-game-ai-platform-production-case-study-from-tencent/
16
ALLUXIO 16
Jobs Failure Failure rate
alluxio-fuse 250000 1848 0.73%
ceph-fuse 250000 7070 2.8%
17. This company consolidated compute platform to a main datalake in one AWS region.
Previously, this would require copying data daily from satellite data lakes from other regions
to the main region for joint queries. This process is expensive and time consuming, and more
importantly may yield inconsistent query results. Alluxio provides a consistent view and
“cheap” data access to remote AWS data.
Case study: Top Online Travel Shopping in US
17
ALLUXIO 17
v
TEAM A
TEAM C
Main Data Lake (us-east-1)
Hiv
e
Mounted
Mounted
Satellite Data Lake (us-west-1)
Satellite Data Lake (us-west-2)
19. Metadata Storage Challenges
• Storing the raw metadata becomes a problem with a large number of
files
• On average, each file takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space!
• A typical JVM runs with < 64GB of heap space
• GC becomes a big problem when using larger heaps
19
20. Metadata Serving Challenges
• Common file operations (ie. getStatus, create) need to be fast
• On heap data structures excel in this case
• Operations need to be optimized for high concurrency
• Generally many readers and few writers for large-scale analytics
• The metadata service also needs to sustain high load
• A cluster of 100 machines can easily house over 5k concurrent clients!
• Connection life cycles need to be managed well
• Connection handshake is expensive
• Holding an idle connection is also detrimental
20
24. Working with RocksDB
• Abstract the metadata storage layer
• Redesign the data structure representation of the Filesystem Tree
• Each inode is represented by a numerical ID
• Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2>
• Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto>
• Two table solution provides good performance for common
operations
• One lookup for listing by using prefix scan
• Path depth lookups for tree traversal
• Constant number of inserts for updates/deletes/creates
24
25. Effects of the Inode Cache
• Generally can store up to 10M inodes
• Caching top levels of the Filesystem Tree greatly speeds up read
performance
• 20-50% performance loss when addressing a filesystem tree that does not
mostly fit into memory
• Writes can be buffered in the cache and are asynchronously flushed
to RocksDB
• No requirement for durability - that is handled by the journal
25
27. Built-in Fault Tolerance
• Alluxio Masters are run as a quorum for journal fault tolerance
• Metadata can be recovered, solving the durability problem
• This was previously done utilizing an external fault tolerance storage
• Alluxio Masters leverage the same quorum to elect a leader
• Enables hot standbys for rapid recovery in case of single node failure
27
28. RAFT
• https://raft.github.io/
• Raft is a consensus algorithm that is
designed to be easy to understand.
It's equivalent to Paxos in
fault-tolerance and performance.
• Implemented by
https://github.com/atomix/copycat
28
29. • Consensus achieved
internally
• Leading masters commits state
change
• Benefits
• Local disk for journal
• Challenges
• Performance tuning
A New HA Mode with Self-managed Services
29
Standby
Master
Leading
Master
Standby
Master
Raft
State Change
State
Change
State
Change
31. Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
31