Alluxio Product School Webinar
January 27, 2022
For more Alluxio events: https://www.alluxio.io/events/
Speaker:
Adit Madan
Data platform teams are increasingly challenged with accessing multiple data stores that are separated from compute engines, such as Spark, Presto, TensorFlow or PyTorch. Whether your data is distributed across multiple datacenters and/or clouds, a successful heterogeneous data platform requires efficient data access. Alluxio enables you to embrace the separation of storage from compute and use Alluxio data orchestration to simplify adoption of the data lake and data mesh paradigms for analytics and AI/ML workloads.
Join Alluxio’s Sr. Product Mgr., Adit Madan, to learn:
- Key challenges with architecting a successful heterogeneous data platform
- How data orchestration can overcome data access challenges in a distributed, heterogeneous environment
- How to identify ways to use Alluxio to meet the needs of your own data environment and workload requirements
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
1. Architecting a Heterogeneous Data Platform
Across Clusters, Regions, and Clouds
Adit Madan
Sr. Product Manager @ Alluxio
2. About Me
ALLUXIO 2
Sr. Product Manager, Alluxio, Inc.
PMC member, Alluxio Open Source Project
MS from Carnegie Mellon University
BS from Indian Institute of Technology - Delhi
Adit Madan
3. Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
4. 4
Consideration 1: Data is Shared
4
1. Between Compute Frameworks
For example, Spark Extract Transform Load (ETL) for data pre-processing followed by
Presto for interactive queries or PyTorch for deep learning
2. Between Different Teams
If your organization spans multiple domains, Team A as producer could share
data with Team B as consumer
5. 5
Consideration 2: Processing in place is simple
5
Why not to make data copies?
1. Copies are error-prone
Hard to maintain consistency and low Total Cost of Ownership (TCO)
2. Data Ownership and Governance
Although replication provides isolation, security compliance is complex
6. Alluxio Proprietary and Confidential
Shared Data Platform across
teams and clouds
Available:
ALLUXIO 6
8. ALLUXIO 8
UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
9. ALLUXIO 9
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
10. COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 10
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio
11. Alluxio - Key Innovations, Benefits
ALLUXIO 11
Acceleration, efficient
representation and
movement of data based on
policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform
with agility across regions
for private, hybrid or
multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with
storage abstraction and
streamlined data
movement across the
pipeline
UNIFY DATA LAKES
≈
12. ALLUXIO 12
Shared Previously
• 40%+ reduction in training stage time & cost
over direct access to cloud storage
Whatʼs New in 2.7
• Optimal resource utilization with NVIDIA Data
Loading Library (DALI) + Alluxio
• 8-12x performance improvement in data loading
and preprocessing stages
• I/O and training can now execute in parallel,
eliminating serialization delays caused by the
copy-to-local approach
Large Scale Deep Learning
USE CASE: ALL IN CLOUD
Distributed
Deep
Learning
13. ALLUXIO 13
WeRide uses Alluxio as a Hybrid Cloud Storage Gateway
USER STORY: HYBRID CLOUD
Alluxio
ON PREMISE
PUBLIC CLOUD
• Network egress cost savings with cross-region access over
data copy-based solutions
• Multiple locations with GPU clusters access a centralized
data lake in AWS for training autonomous driving
• Terabytes of data generated daily from simulations & test
drives shared across regions
GPU training
14. ALLUXIO 14
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC