Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

Architecting a Heterogeneous Data Platform
Across Clusters, Regions, and Clouds
Adit Madan
Sr. Product Manager @ Alluxio

About Me
ALLUXIO 2
Sr. Product Manager, Alluxio, Inc.
PMC member, Alluxio Open Source Project
MS from Carnegie Mellon University
BS from Indian Institute of Technology - Delhi
Adit Madan

Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million

4
Consideration 1: Data is Shared
4
1. Between Compute Frameworks
For example, Spark Extract Transform Load (ETL) for data pre-processing followed by
Presto for interactive queries or PyTorch for deep learning
2. Between Diﬀerent Teams
If your organization spans multiple domains, Team A as producer could share
data with Team B as consumer

5
Consideration 2: Processing in place is simple
5
Why not to make data copies?
1. Copies are error-prone
Hard to maintain consistency and low Total Cost of Ownership (TCO)
2. Data Ownership and Governance
Although replication provides isolation, security compliance is complex

Alluxio Proprietary and Confidential
Shared Data Platform across
teams and clouds
Available:
ALLUXIO 6

ALLUXIO 7
DATA ACCESSIBILITY
Access any storage using any compute

ALLUXIO 8
UNIFIED NAMESPACE
With Replication & Live Data Migration Capabilities
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3

ALLUXIO 9
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive

COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 10
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio

Alluxio - Key Innovations, Benefits
ALLUXIO 11
Acceleration, eﬀicient
representation and
movement of data based on
policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform
with agility across regions
for private, hybrid or
multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with
storage abstraction and
streamlined data
movement across the
pipeline
UNIFY DATA LAKES
≈

ALLUXIO 12
Shared Previously
• 40%+ reduction in training stage time & cost
over direct access to cloud storage
Whatʼs New in 2.7
• Optimal resource utilization with NVIDIA Data
Loading Library (DALI) + Alluxio
• 8-12x performance improvement in data loading
and preprocessing stages
• I/O and training can now execute in parallel,
eliminating serialization delays caused by the
copy-to-local approach
Large Scale Deep Learning
USE CASE: ALL IN CLOUD
Distributed
Deep
Learning

ALLUXIO 13
WeRide uses Alluxio as a Hybrid Cloud Storage Gateway
USER STORY: HYBRID CLOUD
Alluxio
ON PREMISE
PUBLIC CLOUD
• Network egress cost savings with cross-region access over
data copy-based solutions
• Multiple locations with GPU clusters access a centralized
data lake in AWS for training autonomous driving
• Terabytes of data generated daily from simulations & test
drives shared across regions
GPU training

ALLUXIO 14
Cross Datacenter Access without changing Ingest Pipeline
USE CASE: MULTI DATACENTER
Trino
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
REMOTE DATA RESULTS
• Ad-hoc SQL workloads in a secondary DC as analyst
headcount reached 1800 people
• Leverage a 220+ node Alluxio cluster for compute resources
outside primary DC

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media
Q&A

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds

Similar to Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds