Alluxio Product School Webinar
August 25, 2022
For more Alluxio events: https://alluxio.io/events/
Speaker: Jingwen Ouyang
As more and more companies turn to AI / ML / DL to unlock insight, AI has become this mythical word that adds unnecessary barriers to new adaptors. Oftentimes it was regarded as luxury for those big tech companies only - this should not be the case.
In this talk, Jingwen will first dissect the ML life cycle into five stages - starting from data collection, to data cleansing, model training, model validation, and end at model inference / deployment stages. For each stage, Jingwen will then go over its concept, functionality, characteristics, and use cases to demystify ML operations. Finally, Jingwen will showcase how Alluxio, a virtual data lake, could help simplify each stage.
3. AI vs ML vs DL
Model Performance scales with data
AI: Intelligence demonstrated by machines rather than human or animals
ML: Giving computers the skills to learn without explicit programming
DL: An ML subset, examining algorithms that learn and improves on their own (typically a neural network that consists of more than three layers)
4. Many Use Cases of AI
AI has wide variety of use cases in different industries!
Image source: page 7
Health Care Retail Automotive
Manufacturing Financial Services Government and Defense
5. The common learning process - ML Lifecycle Stages
1
2
3
4
5
1. Data Collection
2. Data Preprocessing
3. Model training
4. Model Evaluation
5. Model Inference
6. Intro to Alluxio - a Virtual Data Lake layer
Did you know? Alluxio shines in all AI lifecycle stages!
Unified namespace provides a single point of access and eliminates silos in the data lake
Server side API translation from a standard client-side interface to any storage interface, enables
any compute to access any storage (portability)
Cache layer provides data access acceleration and off load stress from underlying storage
8. Data comes from everywhere and in different forms
Dictated by business
Source: lenovo/netapp
9. Moving Data
• Data often flows from edge to core data centers / cloud for training
Image source: page 10
10. Data Collection Case Study w/ Alluxio
Before: moves PBs of data from merged subsedes to
parent company for analysis
• Poor performance
• Error-prone
• High S3 egress cost
• Needs synchronization
Read more: blog
With Alluxio: no-copy solution with unified
namespace
• Eliminates data silo, and improves
manageability
• Reduces S3 egress cost (50%)
● The world's leading online travel service
● The eighth largest travel agency in the U.S.
12. “Garbage in garbage out”
Background: What is Involved in Data Preprocessing
Many approaches
• Data formatting
• Data cleansing
○ Missing data
○ Duplicates
○ Structural errors
○ Outliers
• Data aggregation
• Data sampling
• Feature engineering
• Handling categorical data
• Feature scaling
• Dimensionality reduction
• Feature selection
○ filter
○ wrapper
○ embedded
• Feature creation
Lots of data
Very complex
Compute expensive
MLE spend most of their time on
30%-40% companies painpoint is in data cleansing
13. Read more blog
Feature Extraction Case Study w/ Alluxio
● “Honor of Kings” - world’s largest mobile game (MOBA)
● Highest-grossing mobile game of all time
● Upward of 80 million people play it each day (high concurrency)
Alluxio Worker Pods Alluxio Worker Pods Alluxio Worker Pods
Alluxio
Alluxio HA master
1000 Application pods
(Spark: Feature Extraction)
Under File System (CephFS)
15. A Light Weight Intro to Training and Data
** Cross validation is meant to cover all data to validate
the model, but sometimes for DL iteration is too
expensive. so they may just assume data is random
enough and skip iteration
Image source
Image source
For training iterations** For evaluation
Typical test data split
16. Optimization Goal of Training
• Infra team: GPU utilization rate (electricity = money) => Reduce IO stall
• Machine learning engineer: accuracy => more data, better data, bigger model, available resources
17. Model Training Case Study w/ Alluxio
Read more blog
● No more redownload
● But single machine has
limited capacity
● Distributed layer very scalable
● Video sharing (China’s Youtube)
● Almost 80 million DAU
API: S3, HDFS API: POSIX API: POSIX
Compute simplification and portability
● On restart needs to
redownload data
19. Intro to Model Evaluation
• What is model evaluation
○ A method of assessing the correctness
of models on test data.
Different aspects of model evaluation
Image source
For training iterations For evaluation
• Challenge
○ Methodology - statistical
○ Data quality and quantity
○ Compute intensive
22. Model Inference
Offline Inference Online Inference
Intro ● The process of running data points into a machine learning model to calculate an
output, such as a single numerical score
● Similar data flow as training - same feature extractor too
Characteristics ● In batch
● Large amount of data - can take
advantage of big data tool like Spark
● Latency is acceptable
● Result is stored then served
● At run time upon request
● Needs real time result (SLA)
● Streamed data
● Interactive
Examples ● Amazon product recommendation
● Microsoft bing search result
● Tesla autonomous driving on the
road
● Manufacturing robotic arm (QA)
● Uber Eats estimated time
23. Offline Model Inference Case Study w/ Alluxio
Read more: blog
“By implementing Alluxio, we are able to speed up the inference
job, reduce I/O stall, and improve performance by about 18%.”
• Prefetch with scheduler into Alluxio cache allows jobs
to execute immediately without IO stall
• Alluxio provides read retry
• Alluxio allows customized cache replacement policies
making the inference job more efficient
• Largest vendor of computer software in the world.
• Leading provider of cloud computing services, video games, computer and
gaming hardware, search and other online services.
26. Alluxio as a common layer
Read more blog
Focus for Alluxio - data volume + data silo / need for speed
• Large amount of data
• Heterogeneous compute / storage systems
• Heterogeneous typology (hybrid / multi-cloud + on prem)
• I/O becomes bottleneck (GPU utilization, caching)
Alluxio can be in all the stages of ML life cycles!
Read more blog