Alluxio Bay Area Meetup March 14th
Join the Alluxio Meetup group: https://www.meetup.com/Alluxio
Alluxio Community slack: https://www.alluxio.org/slack
4. â Release Manager for Alluxio 2.0.0
â Contributor since Tachyon 0.4 (2012)
â Founding Engineer @ Alluxio
About Me
Calvin Jia
5. Alluxio Overview
⢠Open source, distributed storage system
⢠Commonly used for data analytics such as OLAP on Hadoop
⢠Deployed at Huya, Two Sigma, Tencent, and many others
⢠Largest deployments of over 1000 nodes
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
8. Why 2.0
⢠Alluxio 1.x target use cases are largely addressed
⢠Three major types of feedback from users
⢠Want to support POSIX-based workloads, especially ML
⢠Want better options for data management
⢠Want to scale to larger clusters
9. Use Cases
Alluxio 1.x
⢠Burst compute into cloud with data
on-prem
⢠Enable object stores for data
analytics platforms
⢠Accelerate OLAP on Hadoop
Example
⢠As a data scientist, I want to be able
to spin up my own elastic compute
cluster that can easily and efficiently
access my data stores
New in Alluxio 2.x
⢠Enable ML/DL frameworks on object
stores
⢠Data lifecycle management and data
migration
Examples
⢠As a data scientist, I want to run my
existing simulations on larger
datasets stored in S3.
⢠As a data infrastructure engineer, I
want to automatically tier data
between Alluxio and the under store.
10. ML/DL Workloads
⢠Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP
on Hadoop
⢠Alluxio 2.x will continue to excel for these workloads
⢠New emphasis on ML frameworks such as Tensorflow
⢠Primarily accesses the same data set which Alluxio already is serving
⢠Challenges include new API and file characteristics, such as file access
pattern and file sizes
11. Data Management
⢠Finer grained control over Alluxio replication
⢠Automated and scalable async persistence
⢠Distributed data loading
⢠Mechanism for cross-mount data operations
12. Scaling
⢠Namespace scaling - scale to 1 billion files
⢠Cluster scaling - scale to 3000 worker nodes
⢠Client scaling - scale to 30,000 concurrent clients
14. Architectural Innovations in 2.0
⢠Off heap metadata storage (namespace scaling)
⢠gRPC transport layer (cluster and client scaling)
⢠Improved POSIX API (new workloads)
⢠Job Service (enable data management)
⢠Embedded Journal and Internal Leader Election (better integration
with object stores, fewer external dependencies)
15. Off Heap Metadata Storage
⢠Uses an embedded RocksDB to store inode tree
⢠Internal cache for frequently used inodes
⢠Performance is comparable to previous on-heap option when
working set can fit in cache
16. gRPC Transport Layer
⢠Switch from Thrift (metadata) + Netty (data) transport to a
consolidated gRPC based transport
⢠Connection multiplexing to reduce the number of connections from
# of application threads to # of applications
⢠Threading model enables the master to serve concurrent requests
without being limited by internal threadpool size or open file
descriptors on the master
17. Improved POSIX API
⢠Alluxio FUSE based POSIX API
⢠Limitations such as no random write, file cannot be read until
complete
⢠Validated against Tensorflowâs image recognition and
recommendation workloads
⢠Taking suggestions for other POSIX-based workloads!
18. Job Service
⢠New process which serves as a lightweight computation framework
for Alluxio specific tasks
⢠Enables replication factor control without user input
⢠Enables faster loading/persisting of data in a distributed manner
⢠Allows users to do cross-mount operations
⢠Async through is handled automatically
19. Embedded Journal and Internal Leader Election
⢠New journaling service reliant only on Alluxio master processes
⢠No longer need an external distributed storage to store the journal
⢠Greatly benefits environments without a distributed file system
⢠Uses Raft as the consensus algorithm
⢠Consensus is used for journal integrity
⢠Consensus can also be used for leader election in high availability mode
21. Alluxio 2.0.0 Release
⢠Alluxio 2.0.0-preview is available now
⢠Any and all feedback is appreciated!
⢠File bugs and feature requests on our Github issues
⢠Alluxio 2.0.0 will be released in ~3 months
26. 26
Overview - Big Data systems
q Separate Streaming and Batch platforms, single data pre-
processing pipeline, no longer a pure Lambda architecture
q Typically streaming data get sinked into hive tables every 5
minutes
q More ETL jobs are moving toward Near Real Time
Lo
g
Kafka
Data
Cleansin
g
Kafka
Augmen
-tation
Kafka
Hive
Delta
Hive
Daily
Streaming(Storm/Flink/Spark)
Batch ETL
(Hive/Spark)
27. 27
The process of identifying a set of user actions (âeventsâ) across screens and touch
points that contribute in some manner to a product sale, and then assigning value to
each of these events.
front
todayâs
new
manâs
special
Product A
detail
manâs
special
Product B
detail
add cartorder
28. Near Real-time sales attribution
is a very complex process
⢠Recompute full dayâs data at each iteration:
⢠~ 30 minutes, worst case 2-3 hours
⢠Many data sources involved:
⢠page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc
⢠Several large data sources each contain billions of records and take up 300GB ~
800GB space on Disk
⢠Sales Path assignment is very CPU intensive computation
⢠Written by business analysts
⢠Complex SQL scripts with UDF functions
Business expectation: updated result every 5 - 15 minutes
29. 29
+ + ++
+
Running performance sensitive jobs on current batch platform
not an option
⢠Around 200K batch jobs executed daily in Hadoop & Spark clusters
⢠Hdfs 1400+ nodes
⢠SSD hdfs 50+ nodes
⢠Spark Clusters( 300+ nodes)
⢠Cluster usage is above 80% at normal days, resources are even more saturated
during monthly promotion period
⢠Many issues contribute to the Inconsistent data access time such as NN RPC too
high, slow DataNode response etc
⢠Scheduling overhead when running M/R jobs
30. 30
1. Adding more compute power
⢠Too expensive - Not a real option
2. Improve ETL job to process updates incrementally
3. Create a new, relatively isolated environment
⢠consistent computing resource allocation
⢠intermediate data caching
⢠faster read/write
31. ⢠Recompute the click paths for the active users in current window
⢠Merge active user paths with previous full path result
⢠Less data in computation but one more read on history data
2.Improve ETL Job to process
updates incrementally
33. 33
q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores,
256G Memory)
q Alluxio colocated with Spark
qVery consistent read/write I/O time over iterations
q Alluxio Mem + HDD
qDisable multiple copies to save space
qLeave enough memory to OS, improve stability
34. 34
A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of
spikes
B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD)
C. On dedicated SSD cluster
⢠on par with Alluxio in regular days, but overall read/write latency doubled during busy days
D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done)
E. Spark Cache
⢠Our daily views, clicks and path result are too big to fit into JVM
⢠Slow to create and we have lots of âonly used twiceâ data
⢠Multiple downstream spark apps need to share the data
35. 35
L
q Move the downstream processes closer to the data, avoid duplicating large amount of
data from Alluxio to remote HDFS
q Manage NRT jobs
q A single big Spark Streaming job? too many inputs and outputs at different stages
q Split into multiple jobs? how to coordinate multiple stream jobs
q NRT executed in much higher frequency, very sensitive to system hiccups
q Current batch job scheduling
q Process dependency, executed for every fixed interval
q When there is a severe delay, multiple batch instances for different slot running at
the same time
36. 36
q Report data readiness to Watermark Service, manage dependency
between loosely coupled jobs
q Ultimate goal is get the latest result fast
q a delayed batch might consume the unprocessed input blocks span
over multiple cycles.
q Output for fix intervals is not guaranteed
q not all inputs are mandatory, iteration get kicked off even when
optional input sources are not update for that particular cycle
37. 37
⢠Easy to setup
⢠Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx
⢠Together with Spark, either form a separated satellite cluster or on label machines in
our big clusters
⢠Within our Data Centers, it is easier to allocate computing resources but SSD
machines are scare
⢠Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to
run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different
time of a day
⢠Very stable in production
⢠Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
38. 38
⢠Async persistent to remote HDFS
⢠Avoid duplicated write in user code/SQL,
⢠Put hadoop /tmp/ directory on Alluxio over SSD, reduce
NN rpc and load on DN
⢠Cache hot/warm data for Presto, Heavy traffic and ad hoc
query is very sensitive to HDFS stability