4. The Alluxio Story
Originated as Tachyon project, at UC Berkeley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CEO
2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
2019
2018 2020 2021
5. Fast-growing Open Source Community
5000+ Github Stars
1100+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
7. Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST API
POSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
13. Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
13
14. /file1
/file2
A pipeline consisting
of multiple jobs,
writing intermedia
data to external storage
Data Sharing Between Jobs
Inter-process sharing slowed down by network bandwidth
14
15. Data Sharing Between Jobs
/file1
/file2
HDFS
disk
block 1
block 3
block 2
block 4 In-Memory
/file1
/file2
writing data to
closer and faster
Alluxio to exchange
data
Inter-process sharing can happen at memory speed
15
16. Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
• Checkpoint Data for resilience
16
23. API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
– Minimal code change
– Do not need to reason about logic
• Example:
– val file = sc.textFile(“s3a://my-bucket/myFile”)
– val file = sc.textFile(“alluxio://master:19998/myFile”)
23
24. Setup
• Spark works with Alluxio client out of box
• Put in spark/conf/spark-defaults.conf
spark.driver.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
spark.executor.extraClassPath
/<PATH_TO_ALLUXIO>/client/alluxio-2.6.1-client.jar
• More advanced setting:
https://docs.alluxio.io/os/user/stable/en/compute/
Spark.html
24
25. Example of Spark RDDs
Writing to Alluxio
rdd.saveAsTextFile(“alluxio://master:19998/myPath”);
rdd.saveAsObjectFile(“alluxio://master:19998/myPath”);
Reading from Alluxio
rdd = sc.textFile(“alluxio://master:19998/myPath”);
rdd = sc.objectFile(“alluxio://master:19998/myPath”);
26. Example of Spark DataFrames
Writing to Alluxio
df.write.parquet(“alluxio://master:19998/myPath”)
Reading from Alluxio
df = sc.read.parquet(“alluxio://master:19998/myPath”)
28. DATA LOCALITY WITH SCALE-OUT WORKERS
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
29. Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query RAM SSD
METADATA LOCALITY WITH SCALEABLE MASTERS
RocksDB
30. Spark Workflow on Remote Storage
(Without Alluxio)
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context s3://data/
Worker Node
1) run Spark job
Worker Node
Spark
Executor
3) launch executors
and launch tasks
4) access data and compute
2) allocate executors
Takeaway: Remote data, no locality
31. Step 1: Schedule compute to data cache location
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
Alluxio
Client
Alluxio
Masters
HostA: 196.0.0.7
Alluxio
Worker
HostB: 196.0.0.8
Alluxio
Worker
(3) allocate on [HostA]
block1
block2
(1) where is
block1?
(2) block1 is
served at
[HostA]
● Alluxio client implements HDFS compatible API with
block location info
● Alluxio masters keep track and serve a list of worker
nodes that currently have the cache.
32. Step 4: Find local Alluxio Worker and Efficient Data
Exchange
s3://data/
Spark Executor
Alluxio
Worker
Alluxio
Client
HostB: 196.0.0.8
Alluxio
Worker
HostA: 196.0.0.7
block1
Alluxio
Masters
block1?
[HostA]
Efficient I/O via local fs
(e.g., /mnt/ramdisk/) or
local domain socket
(/opt/domain)
● Spark Executor finds local Alluxio Worker by
hostname comparison
● Spark Executor talks to local Alluxio Worker
using either short-circuit access (via local
FS) or local domain socket
33. Recap: Spark Architecture with Alluxio
Cluster Manager
(i.e.YARN/Mesos)
Application
Spark
Context
s3://data/
Alluxio
Client
Alluxio
Masters
Worker Node
Spark
Executor
Alluxio
Worker
Alluxio
Client
Worker Node
Alluxio
Worker
1) run spark job
2.2) allocate executors
4) access Alluxio for data
and compute
2.1) talk to Alluxio for
where the data cache is
Step 2: Help Spark schedule compute to data cache
Step 4: Enable Spark to access local Alluxio cache
3) launch executors
and launch tasks
37. • Hybrid Cloud Analytics
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage
Typical Use Cases
• Cloud Analytics
Caching
Get in-memory data access for
Spark, Presto, or any analytics
framework on Cloud storage
38. Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in an
top data-driven Hedge
Fund. Model training in
cloud for bursty
workloads
Data access was slow,
costing them $$ in
compute cost and lower
modeler productivity
Solution –
With Alluxio, data access
are 10-30X faster
Impact –
Increased efficiency on
training of ML algorithm,
lowered compute cost and
increased modeler
productivity, resulting in 14
day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund
39. Machine Learning Case
Study
Challenge –
Gain end to end view of
business with large volume
of data
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
ETL Data from storage to
Alluxio
Impact –
Faster Time to Market –
“Now we don’t have to work
Sundays”
SPARK
Storage
SPARK
Storage
https://dzone.com/articles/Accelerate-In-Memory-Proces
sing-with-Spark-from-Hours-to-Seconds-With-Tachyon
40. Cloud Analytics
Challenge –
Queries were slow / not
interactive, resulting in
operational inefficiency
Solution –
Using Alluxio as a read cache
for queries to S3
Impact –
6x - 11x query performance
More scalable analytics
infrastructure
https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-o
n-aws-s3-by-10x-with-alluxio-tiered-storage/
41. Spark + Alluxio @ Boss直聘
https://www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/
Target
● Use Spark to process data
● Model training on top of the processed
data
Previous solution
● Spark/flink + Ceph + model training
Problems
● Write temporary files into Ceph cause
high Ceph pressure
● Cannot control Ceph read/write
pressure, cluster unstable
Solution with Alluxio
Spark/flink + Alluxio + Ceph + Alluxio +
model training
● Alluxio supports multiple data sources and
multiple model training frameworks
● Control the read/write rate from Alluxio to
Ceph
● Multiple independent Alluxio clusters, support
multi-tenants, customized configuration,
access control
43. References
- Spark Performance Tuning Tips
- Accelerate Spark and Hive Jobs on AWS S3:
Use case from Bazaarvoice
- Spark + Alluxio: Tencent News Use Case
44. Why Using Alluxio with Spark?
• Improve I/O with better data locality
• Enable Data sharing between Spark jobs
• Checkpoint Data for resilience
44