6. Alluxio Local Cache: Overview
Production Deployment
● Deployed to 3 clusters, with >200
nodes each
● Plugged in as a local library in
Presto worker
● Leverage Presto workers’ local
NVMe disks
● Selective caching based on cache
filter
https://prestodb.io/blog/2020/06/16/alluxio-datacaching
8. Challenge #1: Realtime Partition Updates
● At Uber, a lot of tables/partitions are constantly changing
○ Upsert queries constantly into Hudi tables
● Partition id alone as caching key is not sufficient
○ Same partition may have changed in Hive, while Alluxio still caches the
outdated version
● Partitions in cache are outdated
9. Challenge #1: Realtime Partition Updates
● Solution: Add Hive latest modification time to caching key
○ Previous caching key: hdfs://<path>
○ New caching key: hdfs://<path><mod time>
■ Concatenate last modification time to the path
● New partition with latest modification gets cached
● Tradeoff: outdated partition still present in cache, wasting caching space until
evicted
○ Improving eviction strategy WIP
10. Challenge #2: Cluster Membership Change
● Cached bytes only present on certain nodes
○ SOFT_AFFINITY
● Presto worker nodes may go up/down due to operational activities
○ Node crash
○ Node taken down for maintenance
○ Ad-hoc Node restart
● When node changes, node selection may hit wrong nodes
12. Challenge #2: Cluster Membership Change
Presto Coordinator
Presto Worker#0 Presto Worker#1 Presto Worker#2
key=4
Now node#3 goes down, new lookup: key 4 % 2 nodes = 0, but worker#0 does not have the bytes
13. Challenge #2: Cluster Membership Change
Solution: Node id based consistent
hashing
● All nodes are placed on a virtual
ring
● Relative ordering of nodes on
the ring doesn’t change
● Always look up the key on the
ring
○ Instead of using mode
based hash
● Use replication for better
robustness
14. Challenge #3: Cache Size Restriction
● At Uber, PBs accessed by Presto queries >> PBs Disks space available on
Worker nodes
○ 50PB of data accessed daily v.s 500GB local disk space
○ Heavy eviction can hurt overall cache performance
● Only a selected set of data can fit into cache:
○ certain tables
○ certain number of partitions
15. Challenge #3: Cache Size Restriction
● Solution: Cache Filter
○ A mechanism that decide whether to cache a table and how many
partitions
○ Based on a static json config that specifies:
■ which table are eligible for caching
■ how many partitions to cache for each table
● A sample configuration:
{
"databases": [{
"name": "database_foo",
"tables": [{
"name" : "table_bar",
"maxCachedPartitions": 100}]}]
}
16. Challenge #3: Cache Size Restriction
● Greatly increased cache hit rate
○ From ~65% to >90%
● Notes wrt Cache Filter
○ Manual, static configuration
○ Should be based on traffic pattern, e.g.:
■ Most frequently accessed tables
■ Most common # of partitions being accessed
■ Tables that do not change too frequently
■ Ideally, should be based on shadow caching numbers and table
level metrics
18. Current Status and Next Steps
● Deployed to production cluster
○ 3 clusters of 200+ nodes each, all nodes on NVMe disks, 500GB cache space per
node
○ Using cache filter to cache ~20 most frequently accessed tables
○ Initial measurement shows great improvement
■ ~1/3 of wall time for input scan (TableScanOperator and
ScanFilterProjectOperator) vs no cache
● Next Steps
○ Onboard more tables/Improve process of table onboarding
■ E.g. shadow cache
○ Better support for changing partitions/Hoodie tables
○ Other optimizations
■ E.g. load balancing between nodes
22. Persistent File Level Metadata for Local Cache
● Prevent stale caching
○ The underlying data files might be changed by the 3rd
party frameworks. (This situation might be rare in hive
table, but very common in hudi tables)
● Metadata should be recoverable after server restart
● File or Partition Level Eviction
23. Future Work
● Performance Tuning
○ Improve cache efficiency
○ Optimized for SATA or mechanical hard drives
● Adopt Shadow Cache
○ Table-level working set estimation
○ Partition-level popularity estimation