4. Presto Today: Disaggregated Storage and Physics!
• Data is growing exponentially faster than use of compute
• Resultant Industry trend towards scaling storage and compute
independently e.g., Snowflake on S3, AWS EMR on S3, Big Query on
Google Storage etc.
• Helps customers and cloud providers scale independently, reducing
cost
• Data for querying and processing needs to be streamed from remote
storage nodes
• New challenge for query latency as scanning huge amounts of data
over the wire is going to be I/O bound when the network is saturated
4
CAPTION: Presto Servers need to retrieve data from remote storage
Distance has increased between compute and storage and overcoming Physics is hard
5. RaptorX: Hierarchical Caching for Interactive
Workloads!
• RaptorX’s goal is to create a no migration query acceleration solution for
existing Presto customers so that existing workloads can benefit
seamlessly
• Challenge is to accelerate interactive workloads that are petabyte scale
without replicating data
• Found top opportunities to increase performance by doing a
comprehensive audit of query lifecycle
• Caching is obviously the answer and not new - however is a lot of work to
manage e.g., cache invalidation etc.!
• What’s new is ‘true no-work’ query acceleration; Responses are returned
upto 10x faster with no change in pipelines or queries
5
CAPTION: Presto with RaptorX smartly caches at every opportunity
Reduce distance between compute and storage intelligently!
6. Metastore Cache: 20% latency decrease
• Every Presto query makes a metastore call getPartitions() to learn about
metadata (e.g., schema, partition list, and partition info)
• FB scale partitions are complex and can introduce latency!
• Presto Coordinator (SQL endpoint) caches metadata to avoid calls to metastore.
• Slow changing partitions particularly benefit from this (e.g., date based
partitions)
• Cache is versioned to confirm validity of cached metadata
- A version number is attached to each cache Key-Value pair.
- For every read request, coordinator either gets partition information for
caching if not cached
- or confirms that cached information is up to date from the metastore
6
CAPTION: RaptorX caches table metadata with versioning
Presto
Metastore
Coordinator i.e. SQL endpoint
metadata
versioned
cache
7. File List Cache: 100ms drop per query
7
• A listFile() call is used by Presto to retrieve list of files and name from
remote file system
• Coordinator caches file lists in memory to avoid long listFile calls to remote
storage.
• Challenge is applicability to partitions / directories that are compacted or
sealed i.e. no new data will be added to a partition
• However, real-time ingestion and serving depend fresh data i.e. partitions /
directories are open / not compacted
• For open partitions, RaptorX skips caching directories to guarantee data
freshness
• Note that consistency is still maintained when a query uses both, a mix of
compacted/sealed and open partitions
7
CAPTION: RaptorX caches file lists to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File List
cache
8. Affinity Scheduling for Compute/Data locality
8
• Presto optimizes cluster utilization by assigning work to the worker cluster nodes
uniformly across all running queries.
• This prevents nodes from becoming overloaded, which would lead to a slowdown
of queries due to the overloaded nodes becoming a compute bottleneck.
• With Affinity scheduling, Presto Coordinator schedules requests that process
certain data/file to the same Presto worker node.
• Sending requests for the same data consistently to the same worker node means
less remote storage calls to retrieve data
• High probability, that this data/file is cached on that particular worker node
• Scheduling policy is "soft", i.e. if the destination worker node is too busy or
unavailable, the scheduler will fallback to its secondary worker node pick
• Stay tuned for results of a more sophisticated scheduling (in testing currently)
8
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator i.e.
SQL endpoint
Scheduler
Hashed file path to send
processing work to same
worker instance
Load balancing is done if
target worker node is at
capacity
9. File Desc & Footer Cache: 40% CPU & latency decrease
9
• OpenFile() calls to remote storage are used to learn about columnar file data
• High hit rate of footers as they are the indexes to the data itself
• Presto worker nodes cache file descriptors in memory to avoid long openFile
calls to remote storage
• Especially beneficial for super wide tables that contain hundreds or thousands
of columns - upto 40% CPU and latency decrease
• Presto worker nodes also cache common columnar file and stripe footers in
memory.
• Supported file formats are ORC, DWRF, and Parquet
9
CAPTION: RaptorX caches file descriptors to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File
Descriptor
cache
Header
Index Data
Row Data
Stripe Footer
Metadata
File Footer
Postscript
Optimized Row Columnar (ORC) file
10. Data cache using Alluxio: 10X - 20X latency decrease
10
• Improved performance by caching data on flash disks co-located with Presto
worker; Collaboration between Alluxio and Presto team to create a worker node
level embedded cache library
• Cache is transparent to Presto (standard HDFS interface). Presto falls back to
remote data source if there are disk failures.
• On a cache hit, Alluxio local cache directly reads data from the local disk and
returns the cached data to Presto; otherwise, it retrieves data from the remote
data source, and caches the data on the local disk for follow-up queries.
• Caching mechanism aligns each read into 1MB chunks, where 1MB is configurable
to be adapted to different storage media
• Example IO: [1.1MB, 5.6MB]
- Alluxio will issue IO [1MB, 6MB]
- Then save the following 5 chunks on disk: [1MB, 2MB], [2MB, 3MB], [3MB, 4MB],
[4MB, 5MB], and [5MB, 6MB]
- If there is another IO [4.3MB, 7.8MB], then [4.3MB, 6MB] will be fetched locally
and [6MB, 8MB] will be issued and cache with two extra chunks: [6MB, 7MB]
and [7MB, 8MB)
10
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator
Remote
Storage
Worker
1
MB
1
MB
1
MB
Alluxio Caching
Cache hit
Cache miss
11. Fragmented Result Cache: 45% latency decrease and
75% CPU decrease
11
• Exact results cache has been around for a long time; does not help if queries
differ
• RaptorX uses a fragmented result cache, caches fragment results
• Especially beneficial for slice and dice, drill down, sliding window reporting and
visualization use cases or queries where customers add/remove filters and
projections
• Consider two aggregate queries over an overlapping time period, Query 1 and 2
• Partially computed sum for each of 2021-03-22, 2021-03-23, and 2021-03-24
partitions i.e. corresponding files is cached on Presto workers forming a
fragment result for query 1.
• A subsequent query will only need to aggregate/compute 2021-03-25 and
2021-03-26 partitions, reducing both, compute and I/O cost
11
CAPTION: RaptorX’s fragment result cache reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-24'
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-26'
Cached
Result
2021-03-22
Cached
Result
2021-03-23
Cached
Result
2021-03-24
Scan Node
2021-03-25
Scan Node
2021-03-26
Query 1 Query 2
AggNode
partial sum(col)
2021-03-25
AggNode
partial sum(col)
2021-03-26
AggNode
final sum(col)
03-22 to
03-26
12. Fragmented Result Cache
12
• Previous example explains intelligent cache handling when filtering on partition
columns
• Another query type is one that contains non-partition column filters; Cache
misses for such queries types are reduced by partition statistics based pruning
• Consider Query 3, where time is a non-partition column. NOW() is a function that
has values changing all the time. Caching absolute value results in 0% cache hits
• Predicate time > NOW() - INTERVAL '3' DAY is a "loose" condition that is going
to be true for most of the partitions if predicate is removed from the plan
• For example, if today is 2021-03-24, we know for partition ds = 2021-03-23,
predicate time > NOW() - INTERVAL '3' DAY is always true.
• RaptorX makes a normalized plan shape with
- Plan Canonicalization/Normalization
- Partition column pruning
- Non-partition column pruning based on partition stats
12
CAPTION: RaptorX’s intelligent fragmented result cache
reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22' AND '2021-03-26'
AND time > NOW() - INTERVAL '3' DAY
Query 3
Scan Node
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
Scan Node
2021-03-23
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
13. 13
RaptorX: 10X faster than Presto!
• We see more than 10X increase in query performance
with RaptorX in production at Facebook
• TPC-H benchmark between Presto and RaptorX also
confirms the performance difference!
• Test was run on a 114 node cluster with 1TB SSD and 4
threads per task
• TPC-H scale factor was 100 in remote storage
• Scan and aggregation heavy queries show 10X
improvement (Q1, Q6, Q12-16, Q19 and Q22)
• Join heavy queries show between 3X and 5X
improvement (Q2, Q5, Q10, or Q17)
13
CAPTION: Presto + Cache i.e. RaptorX is on average 10X faster
10X better performance with no change in pipelines!
Presto RaptorX
14. Not a research project: RaptorX is in production!
• RaptorX is battle tested!
• We want to highlight, RaptorX is widely deployed (10K+ machines) within Facebook for interactive workloads that need low-latency query
performance
• Other low-latency query engines (with co-located storage or disaggregated row-based storage) have been consolidated into RaptorX
• RaptorX is the engine of choice for interactive queries within Facebook!
14