ApacheCon 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Lu Qiu
Bin Fan
Alluxio’s capabilities as a Data Orchestration framework have encouraged users to onboard more of their data-driven applications to an Alluxio powered data access layer. Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. This effort has a lot of progress with the collaboration with engineers from Microsoft, Alibaba and Tencent. Particularly, we have introduced a new JNI-based FUSE implementation to support POSIX data access, created a more efficient way to integrate Alluxio with FUSE service, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training. We will also share our engineering lessons and roadmap in future releases to support Machine Learning applications.
2. Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
5000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
alluxio.io/slack
#9
Most critical open
source Java projects
(Google OpenSSF)
3. ALLUXIO 3
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
4. Bin Fan ● Founding Engineer, VP Open Source @
Alluxio
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
4
5. Lu Qiu ● Machine Learning Engineer @ Alluxio
● Email: lu@alluxio.com
● Master Data Science @ GWU
● Responsible for integrating Alluxio with
machine learning/deep learning
● Areas: Alluxio fault tolerant system,
journal system, metrics system, and
POSIX API. Alluxio integration with Cloud
5
6. Agenda
● Training pain points
● Traditional data solutions for cloud training
● Accelerate cloud training with Alluxio
● Alluxio use cases
6
16. Fast Speed Low Cost
+ Good Model
+ Enough Data
+ Cloud Distributed Training
+ High GPU Utilization Rate
Good Performance
17. More Powerful GPU requires higher data
throughput
RestNet50 Model Training Speed (Images/Second)
18. Data Pain Points for Cloud Training
High data
throughput
requirement
ESSENTIAL
Separation
between Data
and Training
ESSENTIAL
Data stability
ESSENTIAL
19. Data Requirements
Each training
machine has
access to
training data
ESSENTIAL
Low latency and
high throughput
when accessing
data
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
21. Solution 1 —— Direct Copy
Alluxio
Server
Alluxio
Server ...
GPU Instance
Full
Data
Full
Data
Full
Data
22. Solution 1 —— Direct Copy
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
● Exceed storage request rate
● Disk/file error can cause the whole training to error out
● Copy data before training, GPU idle
23. Solution 2 —— Direct Access UFS
... GPU Instance
Get Data on Demand
24. Solution 2 —— Direct Access UFS
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
● Bound by Network I/O, high latency, low GPU utilization rate
● Exceed storage request rate, data access can error out
28. Accessing Remote/Distributed Data as
Local Directories
HDFS #1
Obj Store
NFS
HDFS #2
Connecting to
• HDFS
• Amazon S3
• Azure
• Google Cloud
• Ceph
• NFS
• Many more
30. One Click to Mount UFS to Alluxio
All the data locates in s3://<bucket_name>/ will be cached
by Alluxio and provide data locality for training jobs.
$ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option
aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key>
$ bin/alluxio fs distributedLoad /s3
One Click to Load all Training data into Alluxio
31. Caching Data Dynamically during Training
Read data locally
Read data from nearby Alluxio
worker nodes
Read data from UFS and cache in
Alluxio to accelerate future accesses
32. Speed up Training
with preload + dynamic cache
Copy data Training
Solution 1: Direct Copy
Training
Solution 2: Direct Access UFS
Solution 3: Alluxio pre-cache + dynamically cache data when training
Pre-cache data
Training
33. Data Stability
Multiple Replica of Data
Auto retry mechanism
Alluxio fault tolerant mechanism
● Master high availability for metadata safety
● Worker high availability for data safety
34. Solution 3: Alluxio distributed caching
● Support multiple data sources and multiple training frameworks
● Support one click preload data and dynamically caching data during training, increase
GPU utilization rate
● High data stability, less I/O errors
● Use remote/distributed data as local directories, data scientist can focus on training
logic instead of worrying about data
35. Solution 3 —— Alluxio Distributed Caching
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
37. Alluxio @ Microsoft Task
● More than 400 tasks need to read data from
Azure and write data to Azure
● The total data size is larger than 1T
Previously they uses solution 1 direct copy data from
cloud to training nodes.
Challenges
● Easy to exceed request rate. Azure blob-fuse
requires downloading data from Azure to local
before starting the tasks, and uploading data to
Azure after finishing the tasks.
● Large amount of data input and output, easy to
cause I/O errors
● GPU idle when waiting for I/O operations
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
38. Alluxio @ Microsoft Alluxio Speed up Training by 18%
Reduce I/O wait time, improve training
performance
● Use data pre-cache to improve
performance
● Dynamically cache data during training
● Share data across multiple tasks
Streaming read data to disperse I/O request and
avoid exceeding cloud storage request limit
Auto retry retry to reduce I/O error rate
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
40. Alluxio @ Boss Zinpin
Task
● Use Spark/Flink to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple model training frameworks
● Control the read/write rate from Alluxio to
Ceph
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
41. Alluxio @ Boss Zinpin
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
42. Alluxio @ Momo
Momo has multiple Alluxio clusters including thousands of Alluxio nodes.
Stores more than 100+ TB data. Alluxio serves searching and training tasks
of Momo. Momo continues to develop new use cases of Alluxio.
● Alluxio supports multiple under storage and multiple
compute/training frameworks.
● Accelerate compute/training tasks
● Reduce the metadata and data overhead of under storage
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
43. Alluxio @ Momo
Billions image training
- 2 billion small files
- Pytorch + Alluxio + Ceph
- Reduce the metadata and data interactions with
Ceph to improve performance
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
44. Alluxio @ Momo
Speed up recommendation system model loading
● Upload recommendation system model to HDFS
● Distributed load model from HDFS to Alluxio
● Recommendation system load model from Alluxio
concurrently
Speed up loading indexes for ANN system
● Creating indexes
● Upload indexes to HDFS (or object store)
● Nodes loading indexes from Alluxio
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/
45. Alluxio may help you if
● Distributed Training
● Large amount of data (>= TB), large amount of small
files/images
● Network I/O cannot satisfy GPU requirements
● Multiple data sources and multiple training/compute frameworks
● Keep under storage stable and avoid exceeding request rate
problems
● Share data between multiple training tasks