Accelerate Cloud Training with Alluxio

Accelerate Cloud Training
with Alluxio
Bin Fan, Lu Qiu @ Alluxio

Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
5000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
alluxio.io/slack
#9
Most critical open
source Java projects
(Google OpenSSF)

ALLUXIO 3
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE

Bin Fan ● Founding Engineer, VP Open Source @
Alluxio
● Email: binfan@alluxio.com
● PhD in CS @ Carnegie Mellon University
4

Lu Qiu ● Machine Learning Engineer @ Alluxio
● Email: lu@alluxio.com
● Master Data Science @ GWU
● Responsible for integrating Alluxio with
machine learning/deep learning
● Areas: Alluxio fault tolerant system,
journal system, metrics system, and
POSIX API. Alluxio integration with Cloud
5

Agenda
● Training pain points
● Traditional data solutions for cloud training
● Accelerate cloud training with Alluxio
● Alluxio use cases
6

Fast Speed Low Cost
Training requirements
Good Performance

Good Performance = Good Model + Enough
Data

Fast Speed -> Better GPU
ResNet50 Training hours

Fast Speed -> Distributed Training

Low Cost -- Cloud Training



 Cloud
Training
 On demand
training
 Scalable

Easy to set up

Low cost

Lost Cost -- High GPU Utilization Rate

Fast Speed Low Cost
+ Good Model
+ Enough Data
+ Cloud Distributed Training
+ High GPU Utilization Rate
Good Performance

More Powerful GPU requires higher data
throughput
RestNet50 Model Training Speed (Images/Second)

Data Pain Points for Cloud Training
High data
throughput
requirement
ESSENTIAL
Separation
between Data
and Training
ESSENTIAL
Data stability
ESSENTIAL

Data Requirements
Each training
machine has
access to
training data
ESSENTIAL
Low latency and
high throughput
when accessing
data
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL

Traditional Cloud
Training Data Solutions
20

Solution 1 —— Direct Copy
Alluxio
Server
Alluxio
Server ...
GPU Instance
Full
Data
Full
Data
Full
Data

Solution 1 —— Direct Copy
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
● Exceed storage request rate
● Disk/file error can cause the whole training to error out
● Copy data before training, GPU idle

Solution 2 —— Direct Access UFS
... GPU Instance
Get Data on Demand

Solution 2 —— Direct Access UFS
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL
● Bound by Network I/O, high latency, low GPU utilization rate
● Exceed storage request rate, data access can error out

Accelerating Cloud
Training with Alluxio
25

Accelerate Cloud Training with Alluxio
Alluxio
Server
Alluxio
Server ...
GPU Instance

Apps Connecting to Alluxio via POSIX API
27

Accessing Remote/Distributed Data as
Local Directories
HDFS #1
Obj Store
NFS
HDFS #2
Connecting to
• HDFS
• Amazon S3
• Azure
• Google Cloud
• Ceph
• NFS
• Many more

Alluxio
Server
Alluxio
Server
Model Training
Distributed Caching w/ Unified Namespace
Alluxio
Server
A
B
/path1/file1
/path2/file2
C
A
B C A
Model Training Model Training
29

One Click to Mount UFS to Alluxio
All the data locates in s3://<bucket_name>/ will be cached
by Alluxio and provide data locality for training jobs.
$ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option
aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key>
$ bin/alluxio fs distributedLoad /s3
One Click to Load all Training data into Alluxio

Caching Data Dynamically during Training
Read data locally
Read data from nearby Alluxio
worker nodes
Read data from UFS and cache in
Alluxio to accelerate future accesses

Speed up Training
with preload + dynamic cache
Copy data Training
Solution 1: Direct Copy
Training
Solution 2: Direct Access UFS
Solution 3: Alluxio pre-cache + dynamically cache data when training
Pre-cache data
Training

Data Stability
Multiple Replica of Data
Auto retry mechanism
Alluxio fault tolerant mechanism
● Master high availability for metadata safety
● Worker high availability for data safety

Solution 3: Alluxio distributed caching
● Support multiple data sources and multiple training frameworks
● Support one click preload data and dynamically caching data during training, increase
GPU utilization rate
● High data stability, less I/O errors
● Use remote/distributed data as local directories, data scientist can focus on training
logic instead of worrying about data

Solution 3 —— Alluxio Distributed Caching
Access to data
ESSENTIAL
Low latency and
high throughput
ESSENTIAL
Strong data
stability
ESSENTIAL
High GPU
utilization rate
ESSENTIAL

Alluxio @ Microsoft Task
● More than 400 tasks need to read data from
Azure and write data to Azure
● The total data size is larger than 1T
Previously they uses solution 1 direct copy data from
cloud to training nodes.
Challenges
● Easy to exceed request rate. Azure blob-fuse
requires downloading data from Azure to local
before starting the tasks, and uploading data to
Azure after finishing the tasks.
● Large amount of data input and output, easy to
cause I/O errors
● GPU idle when waiting for I/O operations
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/

Alluxio @ Microsoft Alluxio Speed up Training by 18%
Reduce I/O wait time, improve training
performance
● Use data pre-cache to improve
performance
● Dynamically cache data during training
● Share data across multiple tasks
Streaming read data to disperse I/O request and
avoid exceeding cloud storage request limit
Auto retry retry to reduce I/O error rate
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/

Alluxio @ Alibaba —— Improve
Throughput
https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/
https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv
e-performance-of-kubernetes-based-deep-learning-in-the-cloud/

Alluxio @ Boss Zinpin
Task
● Use Spark/Flink to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple model training frameworks
● Control the read/write rate from Alluxio to
Ceph
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/

Alluxio @ Boss Zinpin
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/

Alluxio @ Momo
Momo has multiple Alluxio clusters including thousands of Alluxio nodes.
Stores more than 100+ TB data. Alluxio serves searching and training tasks
of Momo. Momo continues to develop new use cases of Alluxio.
● Alluxio supports multiple under storage and multiple
compute/training frameworks.
● Accelerate compute/training tasks
● Reduce the metadata and data overhead of under storage
https://www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/

Alluxio @ Momo
Billions image training
- 2 billion small files
- Pytorch + Alluxio + Ceph
- Reduce the metadata and data interactions with
Ceph to improve performance

Alluxio @ Momo
Speed up recommendation system model loading
● Upload recommendation system model to HDFS
● Distributed load model from HDFS to Alluxio
● Recommendation system load model from Alluxio
concurrently
Speed up loading indexes for ANN system
● Creating indexes
● Upload indexes to HDFS (or object store)
● Nodes loading indexes from Alluxio

Alluxio may help you if
● Distributed Training
● Large amount of data (>= TB), large amount of small
files/images
● Network I/O cannot satisfy GPU requirements
● Multiple data sources and multiple training/compute frameworks
● Keep under storage stable and avoid exceeding request rate
problems
● Share data between multiple training tasks

Alluxio POSIX API
Latest work and roadmap
46

Community Collaboration
Community-driven collaboration
● Contributors from NJU, Alibaba, Tencent, Microsoft,
Alluxio
Already used by Microsoft, Analytics Aspects, BOSS in
production
47

Alluxio POSIX API 优化
● 5X Improve Alluxio POSIX API performance when reading millions of
small files (#14028)
● Add Fuse read stressbench test（design doc）(#14018)
● Support update Alluxio configuration at runtime (#13643) (#13722)
(#13852）
● Improve local data operation performance (#13044) (#13767)
● Improve Alluxio POSIX API (#13876) (#13103) (#13218) (#13429)
(#13236) (#13160)
● Improve distributed caching and metadata caching （#13506)(#13687）
Join Alluxio weekly community sync to create solutions together!
48

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media

Accelerate Cloud Training with Alluxio

Accelerate Cloud Training with Alluxio

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Accelerate Cloud Training with Alluxio

Ähnlich wie Accelerate Cloud Training with Alluxio (20)

Mehr von Alluxio, Inc.

Mehr von Alluxio, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Accelerate Cloud Training with Alluxio