SlideShare a Scribd company logo
1 of 34
Download to read offline
Accelerate Cloud Training
with Alluxio
Alluxio Day 15
Lu Qiu @ Alluxio
Lu Qiu ● Machine Learning Engineer @ Alluxio
● Alluxio PMC maintainer
● Master Data Science @ GWU
● Responsible for integrating Alluxio with
deep learning
● Areas: Alluxio fault tolerant system,
journal system, metrics system, and
POSIX API. Alluxio integration with Cloud
2
Agenda
● Alluxio and its POSIX API
● Accelerate Cloud Training with Alluxio
○ Round 1 Storage Read Accelerating
○ Round 2 Data Preprocessing & Training
○ Round 3 Data Orchestration Layer
3
Alluxio
& its POSIX API
4
Data Orchestration for
Analytics & AI in the Cloud
Available:
ALLUXIO 6
DATA ACCESSIBILITY
Convert from client-side interface to native storage interface
ALLUXIO 7
DATA LOCALITY
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
ALLUXIO 8
METADATA LOCALITY
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
Alluxio POSIX API
Alluxio POSIX API
10
HDFS #1
Obj Store
NFS
HDFS #2
Connecting to
● HDFS
● Amazon S3
● Azure
● Google Cloud
● Ceph
● NFS
● Many more
Accessing Remote/Distributed Data as Local Directories
Accelerating Cloud
Training with Alluxio
11
Round 1
Accelerating under
storage data access
Training Clusters
Data Data Data
SSD SSD SSD
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Under Storage Kubernetes Cloud Cluster
1. Accelerating under storage data access
One Click to Mount UFS to Alluxio
All the data locates in s3://<bucket_name>/ will be
cached by Alluxio and provide data locality for training
jobs.
$ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option
aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key>
$ bin/alluxio fs distributedLoad /s3
One Click to Load all Training data into Alluxio
Alluxio @ Alibaba —— Improve
Throughput
https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/
https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv
e-performance-of-kubernetes-based-deep-learning-in-the-cloud/
Alluxio @ Microsoft Task
● More than 400 tasks need to read data from
Azure and write data to Azure
● The total data size is larger than 1T
Previously they directly copy data from cloud to training
nodes.
Challenges
● Easy to exceed request rate. Azure blob-fuse
requires downloading data from Azure to local
before starting the tasks, and uploading data to
Azure after finishing the tasks.
● Large amount of data input and output, easy to
cause I/O errors
● GPU idle when waiting for I/O operations
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
Alluxio @ Microsoft Alluxio Speed up Training by 18%
Reduce I/O wait time, improve training
performance
● Use data pre-cache to improve
performance
● Dynamically cache data during training
● Share data across multiple tasks
Streaming read data to disperse I/O request and
avoid exceeding cloud storage request limit
Auto retry to reduce I/O error rate
https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
Round 2
Data Processing &
Training Speed Up
Big Data ETL Cluster Training Clusters
DATA DATA DATA
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
2. Data processing to training speed up
Alluxio @ Boss Zinpin
Task
● Use Spark/Flink to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple compute/training frameworks
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
Alluxio in BOSSZP
21
Big Data ETL Model Training
HDFS Interface POSIX Interface
2. Data processing to training speed up
● Improve under storage stability
● Speed up whole data preprocessing to training pipeline
● Can launch more Alluxio clusters to meet burst ETL/Training
requirements
2. Data processing to training speed up
23
Data Preprocessing Model Training
POSIX Interface
Round 3
Data Orchestration
Layer
Big Data ETL Cluster
Training Clusters
DAT
A
DAT
A
DAT
A
Read Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Under Storage System
Data Preprocessing
Big Data ETL Cluster
DAT
A
DAT
A
DAT
A
Write Buffering
Policies for pinning,
promotion/demotion,TTL
Under Storage
Data Preprocessing
Training Clusters
Data Orchestration for
Analytics & AI in the Cloud
Available:
Alluxio @ Momo
Momo has multiple Alluxio clusters including thousands of Alluxio nodes.
Stores more than 100+ TB data. Alluxio serves searching and training tasks
of Momo. Momo continues to develop new use cases of Alluxio.
● Alluxio supports multiple under storage and multiple
compute/training frameworks.
● Accelerate compute/training tasks
● Reduce the metadata and data overhead of under storage
Alluxio @ Momo
Billions image training
- 2 billion small files
- Pytorch + Alluxio + Ceph
- Reduce the metadata and data interactions
with Ceph to improve performance
Alluxio @ Momo
Speed up recommendation system model loading
● Upload recommendation system model to HDFS
● Distributed load model from HDFS to Alluxio
● Recommendation system load model from Alluxio
concurrently
Speed up loading indexes for ANN system
● Creating indexes
● Upload indexes to HDFS (or object store)
● Nodes loading indexes from Alluxio
Alluxio may help you if
● Distributed Training
● Large amount of data (>= TB), large amount of small
files/images
● Network I/O cannot satisfy GPU requirements
● Multiple data sources and multiple training/compute frameworks
● Keep under storage stable and avoid exceeding request rate
problems
● Share data between multiple training tasks
Community Driven Project
● Community driven cooperation. Special thanks to excellent
engineers from Microsoft, Shopee, Tencent, AntFinance,
Alibaba, Bilibili, and Nanjing University.
● In production in Microsoft, Shopee, Bilibili, MOMO, Boss
Zhipin, and etc
Deployment & Usage
https://www.alluxio.io/alluxio-day/
Alluxio on Kubernetes talk on Alluxio Day XII 2022
https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media

More Related Content

Similar to Accelerating Cloud Training With Alluxio

Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 

Similar to Accelerating Cloud Training With Alluxio (20)

Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...Speeding up I/O for Machine Learning  ft Apple Case Study using TensorFlow, N...
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraModernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
The Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioThe Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with Alluxio
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
 
Alluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to ProductionAlluxio Monthly Webinar - Accelerate AI Path to Production
Alluxio Monthly Webinar - Accelerate AI Path to Production
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model Training
 

Recently uploaded

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

Accelerating Cloud Training With Alluxio

  • 1. Accelerate Cloud Training with Alluxio Alluxio Day 15 Lu Qiu @ Alluxio
  • 2. Lu Qiu ● Machine Learning Engineer @ Alluxio ● Alluxio PMC maintainer ● Master Data Science @ GWU ● Responsible for integrating Alluxio with deep learning ● Areas: Alluxio fault tolerant system, journal system, metrics system, and POSIX API. Alluxio integration with Cloud 2
  • 3. Agenda ● Alluxio and its POSIX API ● Accelerate Cloud Training with Alluxio ○ Round 1 Storage Read Accelerating ○ Round 2 Data Preprocessing & Training ○ Round 3 Data Orchestration Layer 3
  • 5. Data Orchestration for Analytics & AI in the Cloud Available:
  • 6. ALLUXIO 6 DATA ACCESSIBILITY Convert from client-side interface to native storage interface
  • 7. ALLUXIO 7 DATA LOCALITY Local performance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 8. ALLUXIO 8 METADATA LOCALITY Synchronization of changes across clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 10. Alluxio POSIX API 10 HDFS #1 Obj Store NFS HDFS #2 Connecting to ● HDFS ● Amazon S3 ● Azure ● Google Cloud ● Ceph ● NFS ● Many more Accessing Remote/Distributed Data as Local Directories
  • 13. Training Clusters Data Data Data SSD SSD SSD Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL Under Storage Kubernetes Cloud Cluster 1. Accelerating under storage data access
  • 14. One Click to Mount UFS to Alluxio All the data locates in s3://<bucket_name>/ will be cached by Alluxio and provide data locality for training jobs. $ bin/alluxio fs mount /s3 s3://<bucket_name>/ --option aws.accessKeyId=<access_key> --option aws.secretKey=<secret_key> $ bin/alluxio fs distributedLoad /s3 One Click to Load all Training data into Alluxio
  • 15. Alluxio @ Alibaba —— Improve Throughput https://www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/ https://www.alluxio.io/resources/whitepapers/using-alluxio-to-optimize-and-improv e-performance-of-kubernetes-based-deep-learning-in-the-cloud/
  • 16. Alluxio @ Microsoft Task ● More than 400 tasks need to read data from Azure and write data to Azure ● The total data size is larger than 1T Previously they directly copy data from cloud to training nodes. Challenges ● Easy to exceed request rate. Azure blob-fuse requires downloading data from Azure to local before starting the tasks, and uploading data to Azure after finishing the tasks. ● Large amount of data input and output, easy to cause I/O errors ● GPU idle when waiting for I/O operations https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
  • 17. Alluxio @ Microsoft Alluxio Speed up Training by 18% Reduce I/O wait time, improve training performance ● Use data pre-cache to improve performance ● Dynamically cache data during training ● Share data across multiple tasks Streaming read data to disperse I/O request and avoid exceeding cloud storage request limit Auto retry to reduce I/O error rate https://www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/
  • 18. Round 2 Data Processing & Training Speed Up
  • 19. Big Data ETL Cluster Training Clusters DATA DATA DATA Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL 2. Data processing to training speed up
  • 20. Alluxio @ Boss Zinpin Task ● Use Spark/Flink to process data ● Model training on top of the processed data Previous solution ● Spark/flink + Ceph + model training Problems ● Write temporary files into Ceph cause high Ceph pressure ● Cannot control Ceph read/write pressure, cluster unstable Solution with Alluxio Spark/flink + Alluxio + Ceph + Alluxio + model training ● Alluxio supports multiple data sources and multiple compute/training frameworks ● Multiple independent Alluxio clusters, support multi-tenants, customized configuration, access control
  • 21. Alluxio in BOSSZP 21 Big Data ETL Model Training HDFS Interface POSIX Interface
  • 22. 2. Data processing to training speed up ● Improve under storage stability ● Speed up whole data preprocessing to training pipeline ● Can launch more Alluxio clusters to meet burst ETL/Training requirements
  • 23. 2. Data processing to training speed up 23 Data Preprocessing Model Training POSIX Interface
  • 25. Big Data ETL Cluster Training Clusters DAT A DAT A DAT A Read Buffering Transparent to App Policies for pinning, promotion/demotion,TTL Under Storage System Data Preprocessing
  • 26. Big Data ETL Cluster DAT A DAT A DAT A Write Buffering Policies for pinning, promotion/demotion,TTL Under Storage Data Preprocessing Training Clusters
  • 27. Data Orchestration for Analytics & AI in the Cloud Available:
  • 28. Alluxio @ Momo Momo has multiple Alluxio clusters including thousands of Alluxio nodes. Stores more than 100+ TB data. Alluxio serves searching and training tasks of Momo. Momo continues to develop new use cases of Alluxio. ● Alluxio supports multiple under storage and multiple compute/training frameworks. ● Accelerate compute/training tasks ● Reduce the metadata and data overhead of under storage
  • 29. Alluxio @ Momo Billions image training - 2 billion small files - Pytorch + Alluxio + Ceph - Reduce the metadata and data interactions with Ceph to improve performance
  • 30. Alluxio @ Momo Speed up recommendation system model loading ● Upload recommendation system model to HDFS ● Distributed load model from HDFS to Alluxio ● Recommendation system load model from Alluxio concurrently Speed up loading indexes for ANN system ● Creating indexes ● Upload indexes to HDFS (or object store) ● Nodes loading indexes from Alluxio
  • 31. Alluxio may help you if ● Distributed Training ● Large amount of data (>= TB), large amount of small files/images ● Network I/O cannot satisfy GPU requirements ● Multiple data sources and multiple training/compute frameworks ● Keep under storage stable and avoid exceeding request rate problems ● Share data between multiple training tasks
  • 32. Community Driven Project ● Community driven cooperation. Special thanks to excellent engineers from Microsoft, Shopee, Tencent, AntFinance, Alibaba, Bilibili, and Nanjing University. ● In production in Microsoft, Shopee, Bilibili, MOMO, Boss Zhipin, and etc
  • 33. Deployment & Usage https://www.alluxio.io/alluxio-day/ Alluxio on Kubernetes talk on Alluxio Day XII 2022 https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html