SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
1
RaptorX
Rohit Jain
Software Engineer June 24th, 2021
2
10X faster Presto for Facebook scale petabyte workloads
Presto @ Facebook Scale
3
50K+
Servers
~ 1 EB data
scan per day
Presto Today: Disaggregated Storage and Physics!
• Data is growing exponentially faster than use of compute
• Resultant Industry trend towards scaling storage and compute
independently e.g., Snowflake on S3, AWS EMR on S3, Big Query on
Google Storage etc.
• Helps customers and cloud providers scale independently, reducing
cost
• Data for querying and processing needs to be streamed from remote
storage nodes
• New challenge for query latency as scanning huge amounts of data
over the wire is going to be I/O bound when the network is saturated
4
CAPTION: Presto Servers need to retrieve data from remote storage
Distance has increased between compute and storage and overcoming Physics is hard
RaptorX: Hierarchical Caching for Interactive
Workloads!
• RaptorX’s goal is to create a no migration query acceleration solution for
existing Presto customers so that existing workloads can benefit
seamlessly
• Challenge is to accelerate interactive workloads that are petabyte scale
without replicating data
• Found top opportunities to increase performance by doing a
comprehensive audit of query lifecycle
• Caching is obviously the answer and not new - however is a lot of work to
manage e.g., cache invalidation etc.!
• What’s new is ‘true no-work’ query acceleration; Responses are returned
upto 10x faster with no change in pipelines or queries
5
CAPTION: Presto with RaptorX smartly caches at every opportunity
Reduce distance between compute and storage intelligently!
Metastore Cache: 20% latency decrease
• Every Presto query makes a metastore call getPartitions() to learn about
metadata (e.g., schema, partition list, and partition info)
• FB scale partitions are complex and can introduce latency!
• Presto Coordinator (SQL endpoint) caches metadata to avoid calls to metastore.
• Slow changing partitions particularly benefit from this (e.g., date based
partitions)
• Cache is versioned to confirm validity of cached metadata
- A version number is attached to each cache Key-Value pair.
- For every read request, coordinator either gets partition information for
caching if not cached
- or confirms that cached information is up to date from the metastore
6
CAPTION: RaptorX caches table metadata with versioning
Presto
Metastore
Coordinator i.e. SQL endpoint
metadata
versioned
cache
File List Cache: 100ms drop per query
7
• A listFile() call is used by Presto to retrieve list of files and name from
remote file system
• Coordinator caches file lists in memory to avoid long listFile calls to remote
storage.
• Challenge is applicability to partitions / directories that are compacted or
sealed i.e. no new data will be added to a partition
• However, real-time ingestion and serving depend fresh data i.e. partitions /
directories are open / not compacted
• For open partitions, RaptorX skips caching directories to guarantee data
freshness
• Note that consistency is still maintained when a query uses both, a mix of
compacted/sealed and open partitions
7
CAPTION: RaptorX caches file lists to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File List
cache
Affinity Scheduling for Compute/Data locality
8
• Presto optimizes cluster utilization by assigning work to the worker cluster nodes
uniformly across all running queries.
• This prevents nodes from becoming overloaded, which would lead to a slowdown
of queries due to the overloaded nodes becoming a compute bottleneck.
• With Affinity scheduling, Presto Coordinator schedules requests that process
certain data/file to the same Presto worker node.
• Sending requests for the same data consistently to the same worker node means
less remote storage calls to retrieve data
• High probability, that this data/file is cached on that particular worker node
• Scheduling policy is "soft", i.e. if the destination worker node is too busy or
unavailable, the scheduler will fallback to its secondary worker node pick
• Stay tuned for results of a more sophisticated scheduling (in testing currently)
8
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator i.e.
SQL endpoint
Scheduler
Hashed file path to send
processing work to same
worker instance
Load balancing is done if
target worker node is at
capacity
File Desc & Footer Cache: 40% CPU & latency decrease
9
• OpenFile() calls to remote storage are used to learn about columnar file data
• High hit rate of footers as they are the indexes to the data itself
• Presto worker nodes cache file descriptors in memory to avoid long openFile
calls to remote storage
• Especially beneficial for super wide tables that contain hundreds or thousands
of columns - upto 40% CPU and latency decrease
• Presto worker nodes also cache common columnar file and stripe footers in
memory.
• Supported file formats are ORC, DWRF, and Parquet
9
CAPTION: RaptorX caches file descriptors to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File
Descriptor
cache
Header
Index Data
Row Data
Stripe Footer
Metadata
File Footer
Postscript
Optimized Row Columnar (ORC) file
Data cache using Alluxio: 10X - 20X latency decrease
10
• Improved performance by caching data on flash disks co-located with Presto
worker; Collaboration between Alluxio and Presto team to create a worker node
level embedded cache library
• Cache is transparent to Presto (standard HDFS interface). Presto falls back to
remote data source if there are disk failures.
• On a cache hit, Alluxio local cache directly reads data from the local disk and
returns the cached data to Presto; otherwise, it retrieves data from the remote
data source, and caches the data on the local disk for follow-up queries.
• Caching mechanism aligns each read into 1MB chunks, where 1MB is configurable
to be adapted to different storage media
• Example IO: [1.1MB, 5.6MB]
- Alluxio will issue IO [1MB, 6MB]
- Then save the following 5 chunks on disk: [1MB, 2MB], [2MB, 3MB], [3MB, 4MB],
[4MB, 5MB], and [5MB, 6MB]
- If there is another IO [4.3MB, 7.8MB], then [4.3MB, 6MB] will be fetched locally
and [6MB, 8MB] will be issued and cache with two extra chunks: [6MB, 7MB]
and [7MB, 8MB)
10
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator
Remote
Storage
Worker
1
MB
1
MB
1
MB
Alluxio Caching
Cache hit
Cache miss
Fragmented Result Cache: 45% latency decrease and
75% CPU decrease
11
• Exact results cache has been around for a long time; does not help if queries
differ
• RaptorX uses a fragmented result cache, caches fragment results
• Especially beneficial for slice and dice, drill down, sliding window reporting and
visualization use cases or queries where customers add/remove filters and
projections
• Consider two aggregate queries over an overlapping time period, Query 1 and 2
• Partially computed sum for each of 2021-03-22, 2021-03-23, and 2021-03-24
partitions i.e. corresponding files is cached on Presto workers forming a
fragment result for query 1.
• A subsequent query will only need to aggregate/compute 2021-03-25 and
2021-03-26 partitions, reducing both, compute and I/O cost
11
CAPTION: RaptorX’s fragment result cache reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-24'
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-26'
Cached
Result
2021-03-22
Cached
Result
2021-03-23
Cached
Result
2021-03-24
Scan Node
2021-03-25
Scan Node
2021-03-26
Query 1 Query 2
AggNode
partial sum(col)
2021-03-25
AggNode
partial sum(col)
2021-03-26
AggNode
final sum(col)
03-22 to
03-26
Fragmented Result Cache
12
• Previous example explains intelligent cache handling when filtering on partition
columns
• Another query type is one that contains non-partition column filters; Cache
misses for such queries types are reduced by partition statistics based pruning
• Consider Query 3, where time is a non-partition column. NOW() is a function that
has values changing all the time. Caching absolute value results in 0% cache hits
• Predicate time > NOW() - INTERVAL '3' DAY is a "loose" condition that is going
to be true for most of the partitions if predicate is removed from the plan
• For example, if today is 2021-03-24, we know for partition ds = 2021-03-23,
predicate time > NOW() - INTERVAL '3' DAY is always true.
• RaptorX makes a normalized plan shape with
- Plan Canonicalization/Normalization
- Partition column pruning
- Non-partition column pruning based on partition stats
12
CAPTION: RaptorX’s intelligent fragmented result cache
reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22' AND '2021-03-26'
AND time > NOW() - INTERVAL '3' DAY
Query 3
Scan Node
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
Scan Node
2021-03-23
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
13
RaptorX: 10X faster than Presto!
• We see more than 10X increase in query performance
with RaptorX in production at Facebook
• TPC-H benchmark between Presto and RaptorX also
confirms the performance difference!
• Test was run on a 114 node cluster with 1TB SSD and 4
threads per task
• TPC-H scale factor was 100 in remote storage
• Scan and aggregation heavy queries show 10X
improvement (Q1, Q6, Q12-16, Q19 and Q22)
• Join heavy queries show between 3X and 5X
improvement (Q2, Q5, Q10, or Q17)
13
CAPTION: Presto + Cache i.e. RaptorX is on average 10X faster
10X better performance with no change in pipelines!
Presto RaptorX
Not a research project: RaptorX is in production!
• RaptorX is battle tested!
• We want to highlight, RaptorX is widely deployed (10K+ machines) within Facebook for interactive workloads that need low-latency query
performance
• Other low-latency query engines (with co-located storage or disaggregated row-based storage) have been consolidated into RaptorX
• RaptorX is the engine of choice for interactive queries within Facebook!
14
15
Come join us!
facebook.com/careers

Weitere ähnliche Inhalte

Was ist angesagt?

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 

Was ist angesagt? (20)

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

Ähnlich wie RaptorX: Building a 10X Faster Presto with hierarchical cache

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseSandesh Rao
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdfMukundThakur22
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
Performance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehousePerformance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehouseDenodo
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Tuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait EventsTuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait EventsConfio Software
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on EverythingDavid Phillips
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)Gary Jackson MBCS
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 

Ähnlich wie RaptorX: Building a 10X Faster Presto with hierarchical cache (20)

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Performance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehousePerformance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data Warehouse
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Tuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait EventsTuning for Oracle RAC Wait Events
Tuning for Oracle RAC Wait Events
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 

Mehr von Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

Mehr von Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Kürzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 

Kürzlich hochgeladen (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

RaptorX: Building a 10X Faster Presto with hierarchical cache

  • 1. 1
  • 2. RaptorX Rohit Jain Software Engineer June 24th, 2021 2 10X faster Presto for Facebook scale petabyte workloads
  • 3. Presto @ Facebook Scale 3 50K+ Servers ~ 1 EB data scan per day
  • 4. Presto Today: Disaggregated Storage and Physics! • Data is growing exponentially faster than use of compute • Resultant Industry trend towards scaling storage and compute independently e.g., Snowflake on S3, AWS EMR on S3, Big Query on Google Storage etc. • Helps customers and cloud providers scale independently, reducing cost • Data for querying and processing needs to be streamed from remote storage nodes • New challenge for query latency as scanning huge amounts of data over the wire is going to be I/O bound when the network is saturated 4 CAPTION: Presto Servers need to retrieve data from remote storage Distance has increased between compute and storage and overcoming Physics is hard
  • 5. RaptorX: Hierarchical Caching for Interactive Workloads! • RaptorX’s goal is to create a no migration query acceleration solution for existing Presto customers so that existing workloads can benefit seamlessly • Challenge is to accelerate interactive workloads that are petabyte scale without replicating data • Found top opportunities to increase performance by doing a comprehensive audit of query lifecycle • Caching is obviously the answer and not new - however is a lot of work to manage e.g., cache invalidation etc.! • What’s new is ‘true no-work’ query acceleration; Responses are returned upto 10x faster with no change in pipelines or queries 5 CAPTION: Presto with RaptorX smartly caches at every opportunity Reduce distance between compute and storage intelligently!
  • 6. Metastore Cache: 20% latency decrease • Every Presto query makes a metastore call getPartitions() to learn about metadata (e.g., schema, partition list, and partition info) • FB scale partitions are complex and can introduce latency! • Presto Coordinator (SQL endpoint) caches metadata to avoid calls to metastore. • Slow changing partitions particularly benefit from this (e.g., date based partitions) • Cache is versioned to confirm validity of cached metadata - A version number is attached to each cache Key-Value pair. - For every read request, coordinator either gets partition information for caching if not cached - or confirms that cached information is up to date from the metastore 6 CAPTION: RaptorX caches table metadata with versioning Presto Metastore Coordinator i.e. SQL endpoint metadata versioned cache
  • 7. File List Cache: 100ms drop per query 7 • A listFile() call is used by Presto to retrieve list of files and name from remote file system • Coordinator caches file lists in memory to avoid long listFile calls to remote storage. • Challenge is applicability to partitions / directories that are compacted or sealed i.e. no new data will be added to a partition • However, real-time ingestion and serving depend fresh data i.e. partitions / directories are open / not compacted • For open partitions, RaptorX skips caching directories to guarantee data freshness • Note that consistency is still maintained when a query uses both, a mix of compacted/sealed and open partitions 7 CAPTION: RaptorX caches file lists to lower query latency Presto Remote Storage Coordinator i.e. SQL endpoint File List cache
  • 8. Affinity Scheduling for Compute/Data locality 8 • Presto optimizes cluster utilization by assigning work to the worker cluster nodes uniformly across all running queries. • This prevents nodes from becoming overloaded, which would lead to a slowdown of queries due to the overloaded nodes becoming a compute bottleneck. • With Affinity scheduling, Presto Coordinator schedules requests that process certain data/file to the same Presto worker node. • Sending requests for the same data consistently to the same worker node means less remote storage calls to retrieve data • High probability, that this data/file is cached on that particular worker node • Scheduling policy is "soft", i.e. if the destination worker node is too busy or unavailable, the scheduler will fallback to its secondary worker node pick • Stay tuned for results of a more sophisticated scheduling (in testing currently) 8 CAPTION: RaptorX does a best effort to send jobs that use data from remote storage to nodes that have processed jobs with the same data, reducing remote storage calls Presto Coordinator i.e. SQL endpoint Scheduler Hashed file path to send processing work to same worker instance Load balancing is done if target worker node is at capacity
  • 9. File Desc & Footer Cache: 40% CPU & latency decrease 9 • OpenFile() calls to remote storage are used to learn about columnar file data • High hit rate of footers as they are the indexes to the data itself • Presto worker nodes cache file descriptors in memory to avoid long openFile calls to remote storage • Especially beneficial for super wide tables that contain hundreds or thousands of columns - upto 40% CPU and latency decrease • Presto worker nodes also cache common columnar file and stripe footers in memory. • Supported file formats are ORC, DWRF, and Parquet 9 CAPTION: RaptorX caches file descriptors to lower query latency Presto Remote Storage Coordinator i.e. SQL endpoint File Descriptor cache Header Index Data Row Data Stripe Footer Metadata File Footer Postscript Optimized Row Columnar (ORC) file
  • 10. Data cache using Alluxio: 10X - 20X latency decrease 10 • Improved performance by caching data on flash disks co-located with Presto worker; Collaboration between Alluxio and Presto team to create a worker node level embedded cache library • Cache is transparent to Presto (standard HDFS interface). Presto falls back to remote data source if there are disk failures. • On a cache hit, Alluxio local cache directly reads data from the local disk and returns the cached data to Presto; otherwise, it retrieves data from the remote data source, and caches the data on the local disk for follow-up queries. • Caching mechanism aligns each read into 1MB chunks, where 1MB is configurable to be adapted to different storage media • Example IO: [1.1MB, 5.6MB] - Alluxio will issue IO [1MB, 6MB] - Then save the following 5 chunks on disk: [1MB, 2MB], [2MB, 3MB], [3MB, 4MB], [4MB, 5MB], and [5MB, 6MB] - If there is another IO [4.3MB, 7.8MB], then [4.3MB, 6MB] will be fetched locally and [6MB, 8MB] will be issued and cache with two extra chunks: [6MB, 7MB] and [7MB, 8MB) 10 CAPTION: RaptorX does a best effort to send jobs that use data from remote storage to nodes that have processed jobs with the same data, reducing remote storage calls Presto Coordinator Remote Storage Worker 1 MB 1 MB 1 MB Alluxio Caching Cache hit Cache miss
  • 11. Fragmented Result Cache: 45% latency decrease and 75% CPU decrease 11 • Exact results cache has been around for a long time; does not help if queries differ • RaptorX uses a fragmented result cache, caches fragment results • Especially beneficial for slice and dice, drill down, sliding window reporting and visualization use cases or queries where customers add/remove filters and projections • Consider two aggregate queries over an overlapping time period, Query 1 and 2 • Partially computed sum for each of 2021-03-22, 2021-03-23, and 2021-03-24 partitions i.e. corresponding files is cached on Presto workers forming a fragment result for query 1. • A subsequent query will only need to aggregate/compute 2021-03-25 and 2021-03-26 partitions, reducing both, compute and I/O cost 11 CAPTION: RaptorX’s fragment result cache reduces compute and I/O cost SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-24' SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-26' Cached Result 2021-03-22 Cached Result 2021-03-23 Cached Result 2021-03-24 Scan Node 2021-03-25 Scan Node 2021-03-26 Query 1 Query 2 AggNode partial sum(col) 2021-03-25 AggNode partial sum(col) 2021-03-26 AggNode final sum(col) 03-22 to 03-26
  • 12. Fragmented Result Cache 12 • Previous example explains intelligent cache handling when filtering on partition columns • Another query type is one that contains non-partition column filters; Cache misses for such queries types are reduced by partition statistics based pruning • Consider Query 3, where time is a non-partition column. NOW() is a function that has values changing all the time. Caching absolute value results in 0% cache hits • Predicate time > NOW() - INTERVAL '3' DAY is a "loose" condition that is going to be true for most of the partitions if predicate is removed from the plan • For example, if today is 2021-03-24, we know for partition ds = 2021-03-23, predicate time > NOW() - INTERVAL '3' DAY is always true. • RaptorX makes a normalized plan shape with - Plan Canonicalization/Normalization - Partition column pruning - Non-partition column pruning based on partition stats 12 CAPTION: RaptorX’s intelligent fragmented result cache reduces compute and I/O cost SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-26' AND time > NOW() - INTERVAL '3' DAY Query 3 Scan Node Filter time > NOW() - INTERVAL '3' DAY AggNode partial sum(col) Scan Node 2021-03-23 Filter time > NOW() - INTERVAL '3' DAY AggNode partial sum(col)
  • 13. 13 RaptorX: 10X faster than Presto! • We see more than 10X increase in query performance with RaptorX in production at Facebook • TPC-H benchmark between Presto and RaptorX also confirms the performance difference! • Test was run on a 114 node cluster with 1TB SSD and 4 threads per task • TPC-H scale factor was 100 in remote storage • Scan and aggregation heavy queries show 10X improvement (Q1, Q6, Q12-16, Q19 and Q22) • Join heavy queries show between 3X and 5X improvement (Q2, Q5, Q10, or Q17) 13 CAPTION: Presto + Cache i.e. RaptorX is on average 10X faster 10X better performance with no change in pipelines! Presto RaptorX
  • 14. Not a research project: RaptorX is in production! • RaptorX is battle tested! • We want to highlight, RaptorX is widely deployed (10K+ machines) within Facebook for interactive workloads that need low-latency query performance • Other low-latency query engines (with co-located storage or disaggregated row-based storage) have been consolidated into RaptorX • RaptorX is the engine of choice for interactive queries within Facebook! 14