2. Version 1.0
Today's topic
Why Raptor is performant
When to choose raptor
How to Load data
How to enable
Ramesh Byndoor
Big-Data Team @OLA
What is Presto Raptor
6. Raptor Connector
Raptor is a columnar store on flash.
It’s designed to fit natively with Presto. (Previously called as
presto-native)
Shared nothing MPP architecture.
No redundant copies, Flash/Disk tiered storage
9. Raptor Table (bucketed)
CREATE TABLE raptor.partner-app.click_1 (
_time timestamp,
dim1 string,
_actor string
)
WITH (
bucket_count = 30, --Number of buckets into which to divide the table.
bucketed_on = array ['_actor'], --Table columns on which to bucket the table
temporal_column = '_time', --Temporal column of the table
ordering=array['_time', 'dim1'], --"Sort order for each shard of the table"
distribution_name='user-app' --Shared distribution name for co-located tables
)
11. Physical Data awareness
Sorts within a shard, Uses ORC’s native sort
technique.
Takes array of columns.
Skips part of files for better read
throughput.
ordering
ordering=array['_time', 'dim1']
12. Physical Data awareness
Time based shards are created.
Assures shards don’t cross temporal
boundary.
Perf boost for time based filter queries.
Ease managing data retention.
temporal_colum
temporal_column = '_time',
13. Physical Data awareness
hash based bucketing.
All tables of same distribution and bucket
resides on same node.
Boosts co-located local joins.(Funnel use
case)
Avoids global shuffling.(Network is big pain
in Big-Data)
Increase performance with join on
bucket_key in order of magnitude.
Limitation:
Bucket number can not be modified for
distribution once done.#6252
Bucketing & distribution
14. Physical Data awareness
Column statistics/BRIN Index
Helps narrow down the splits involved
in query.
Query only shards that possibly
contain data.
SELECT shard_uuid,
bucket_number FROM
x_shards_t435 WHERE
((c1_min > 100 and
c1_max<= 200) OR c1_min
IS NULL)
ORDER BY bucket_number
16. INSERT into raptor.schema1.t1
SELECT * from
catalog.schema1.t1
Where
dt=’2018-10-10’
How to load data.?
Repeated load on failure?
Delete from raptor.schema1.t1 Where dt=’2018-10-10’
18. Presto Real time Collector
● Push events from Kinesis/Kafka to Presto(Raptor) in
Real time.
● Ever evolving schema.
○ Auto add new table.
○ Auto add new column at last.
● On the fly data type detection.
20. When to choose
● Hot cache for dashboards.
● Real time funnels (co-located joins are great in Raptor).
● Real-time event analytics.
21. Hive LLAP vs Raptor
LLAP Raptor
Overhead of first query. No overhead of first query.
Shard recovery manager auto pulls it on flash.
storage.missing-shard-discovery-interval
=5m
Cache misses are much (LRFU) . It’s No cache, everything served from flash
backed by backupStore(s3, Gluster,etc).
Redistribute the tables over the network can not
be controlled, Same is true for aggregations.
Bucketing(bucket_column) avoids data
shuffling. Ex: all events of same user are
present in same node.
Physical awareness is at partition. Hive ends up
reading entire partition.
Shards are files(Apache ORC as of now).
CBO doesn’t filter splits. It helps optimize Apache
calcite plan.
Raptor uses stats for filtering shards itself.
22. Team
emre@rakam.io Founder @ Rakam.IO
Satendra Sahu Dev Big-Data team @OLA
Ramesh Byndoor Lead event analytics @OLA
23. References
● Release doc
○ https://prestodb.io/docs/current/release/release-0.69.html
● Raptor @facebook
○ https://www.slideshare.net/MartinTraverso/presto-at-facebook-presto-meetup-boston-1
062015
● Why raptor doesn’t have doc?
○ https://github.com/prestodb/presto/issues/2676
● Jay Tang from facebook talks on Raptor
○ https://atscaleconference.com/videos/presto-raptor-mpp-shared-nothing-database-on-fl
ash/
● Rakam.IO an event analytics system.
○ https://rakam.io