Weitere ähnliche Inhalte Ähnlich wie Archmage, Pinterest’s Real-time Analytics Platform on Druid (20) Kürzlich hochgeladen (20) Archmage, Pinterest’s Real-time Analytics Platform on Druid3. 3© 2020 Pinterest. All rights reserved.
Agenda
Motivation
Challenges
Use cases
Cluster stats
Architecture
Learnings
1
2
3
4
5
6
4. 4© 2020 Pinterest. All rights reserved.
Motivation
● Cons of Hbase based precomputed key value look up system
○ Key value data model doesn’t fit into analytics query pattern
○ Cardinality explosion anytime a new column is added
○ Impossible to precompute all filter combinations
○ More work is needed on the application side to do aggregation
We want a better system as demand for Pinterest’s analytics
use cases increase...
Why do we replace Hbase with Druid for analytics use
cases?
Example key value model:
country=usa,device=iphone,gender=male,click=123
country=china,device=iphone,gender=female,click=456
country=japan,device=android,gender=male,click=789
country=usa,device=iphone,gender=female,click=135
5. 5© 2020 Pinterest. All rights reserved.
Challenges
What are the unique challenges of onboarding to use
Druid in Pinterest?
● Clients expects low latency on par to key value store
○ Migrated from a Hbase based key value lookup backend, clients expects
latency to stay at lower 100ms while vanilla Druid only guarantees
subseconds/seconds latency
● Pinterest scale data volume
○ Largest batch use case: 300 TB with seconds SLA
○ Largest real time use case: 500k write QPS with SLA requirement of 500
query QPS and 200 ms p99
● Cost effective
○ We want the lowest cost for the best performance possible
6. 6© 2020 Pinterest. All rights reserved.
Use cases
Many of company’s analytics use cases are powered by
Druid
● Partner and advertiser reporting
○ Provides stats on board/pins impressions, clicks, saves, etc.
7. 7© 2020 Pinterest. All rights reserved.
Use cases
Many of company’s analytics use cases are powered by
Druid
● Realtime spam detection
○ Detects spamming events for user login and pin operations
8. 8© 2020 Pinterest. All rights reserved.
Use cases
Many of company’s analytics use cases are powered by
Druid
● Partner and advertiser reporting
○ Stats on board/pins impressions, clicks, saves, etc.
● Realtime spam detection
○ Detects spamming events for user login and pin operations
● Experiment metrics
○ AB testing experiment metrics
● Ads delivery debugger
○ Debugging tool for Ads delivery status
● And many more ...
9. 9© 2020 Pinterest. All rights reserved.
Cluster stats
We have both online and offline use clusters
● Biggest online use cluster
○ 200 r4.8x historical nodes hosting 32TB, 50 i3.2x hosting 100TB
○ QPS 250
○ Query P99 ranges from 100ms to ~1.5s depending on use cases
● Biggest offline use cluster
○ 160 i3en.2x historical nodes hosting 280TB
○ QPS < 1
○ P99 2s
10. 10© 2020 Pinterest. All rights reserved.
Architecture
Batch ingestion
Real-time ingestion
Archmage
11. 11© 2020 Pinterest. All rights reserved.
Architecture
Archmage
● Proxy service
○ A thrift service that acts as a proxy between clients and druid to ease
integration with other services in Pinterest
○ Handles druid service discovery by watching broker znode on Druid
zookeeper
○ Thrift to HTTP/HTTP to thrift request/response translation
○ Metrics reporting
○ Speculative execution
○ Query optimization and rewriting
○ Shadow cluster dark traffic routing
12. 12© 2020 Pinterest. All rights reserved.
Architecture
Query
● Thrift API
○ Clients send a thrift request with a SQL field to Archmage who does the
forwarding to Druid
● UI
○ Individual clients’ use case specific UI
○ Internal UI with SQL editor tool for ad-hoc queries
○ Apache Superset for dashboarding
13. 13© 2020 Pinterest. All rights reserved.
Architecture
Ingestion
● Batch ingestion
○ Hadoop: extracted library which bypassed Druid locking
○ Reads input from s3 and writes Druid segment files on s3
● Real time ingestion
○ Kafka: exactly-once-delivery
○ Evaluated push-based Tranquility library but deprecated
14. 14© 2020 Pinterest. All rights reserved.
Learnings
Tiered setup
● Need disk access? Look for host types with good 4KB page size
random read IOPS
○ Disk is needed when segments are not accessed often or simply the data volume
is so large thus too expensive to have a full in memory setup
○ Druid uses mmap and abstracts a segment into a byte array. Only specific portion
of the byte array is loaded from disk (e.g., for a certain column) during query time
and the loading is done in 4KB pages which means a host type (excluding process
memory) with 256G RAM behaves pretty much the same as one with 1G RAM if 1)
the 4KB page size random read IOPS are the same 2) you expect scan different
segments for each query
○ For AWS, host types with on-instance SSD work the best: i3 > i3en >> other
instance types attaching an EBS disk
15. 15© 2020 Pinterest. All rights reserved.
Learnings
Tiered setup
● Recent data? All in memory
○ Recent data is expected to be queried more often so we want to avoid
query time disk I/O by caching all data in page cache
○ Put most recent segments (e.g., last 3 months) into memory intensive
instance types with 1:1 RAM/disk ratio: r5.8x with attached EBS
○ Background threads in historical nodes to read segment files (equivalent
to `cat 0000.smoosh > /dev/null`) on server bootstrap and new segment
downloading to force OS to load into page cache to avoid query time on
demand loading
○ The exact period of “recent” is recommended to be figured out through
request analysis. Druid real time ingestion is a good choice.
16. 16© 2020 Pinterest. All rights reserved.
Learnings
Middle managers
● Need as much intention in tuning as historical nodes
○ Monitor metrics on Kafka ingestion offset and timestamp lag
○ Increase intermediatePersistPeriod if you are sensitive to query latency
on middle managers
○ Use a custom partitioner on Kafka producer side to improve data locality
○ Use lateMessageRejectionPeriod and earlyMessageRejectionPeriod to
avoid scattered late and early events to create a lot of small segments
○ Reindexing (compaction) jobs
○ Be careful not to use Kafka transaction on producer side prior to Druid
0.15
17. 17© 2020 Pinterest. All rights reserved.
Learnings
Group by queries
● Tail latency
○ Many are convertible to top N if you add a limit clause
○ Add a combined dimension if group by dimensions are more than 2 but
fixed
○ Enable push limit down to sacrifice some accuracy for performance
○ Enable parallel broker side merge
○ Limit number of rows to do group by if possible from the application side
○ Make sure you have enough merge buffers to not run out them
18. 18© 2020 Pinterest. All rights reserved.
Learnings
Secondary dimension query time pruning other than time
● Cluster computing resource is limited
○ Each segment is processed in one processing thread whose number is
usually identical to number of cores
○ Cores are the expensive and are always fewer than number of segments
○ We should be cautious on which segments to scan for a query
● Shard specs with query time partition dimensions pruning
○ Batch ingestion
■ Hash based shard spec
■ Even size single dimension shard spec
○ Real time ingestion
■ Stream hash based shard spec
19. 19© 2020 Pinterest. All rights reserved.
Learnings
Secondary dimension query time pruning other than time
● Shard specs with query time partition dimensions pruning
○ Batch ingestion
■ Hash based shard spec
● Worked well in most use cases
● Added missing query time pruning based on hashing and
partition dimensions
● However: skewed data which leads to skewed segment size,
long ingestion tail latency and query performance issue
■ Even size single dimension shard spec
20. 20© 2020 Pinterest. All rights reserved.
Learnings
Secondary dimension query time pruning other than time
● Shard specs with query time partition dimensions pruning
○ Batch ingestion
■ Hash based shard spec
■ Even size single dimension shard spec
● Default single dimension shard spec will fit data for the same partition
dimension value into a single segment
● Added a custom partitioner to distribute data for skewed partition dimension
value to multiple segments
● Replaced the two very slow hadoop jobs (roll up input and calculate per
partition dimension value number of rows to decide partition) with reading
output from a SparkSQL job
21. 21© 2020 Pinterest. All rights reserved.
Learnings
Secondary dimension query time pruning other than time
● Shard specs with query time partition dimensions pruning
○ Realtime
■ Stream hash based shard spec
● Real time ingestion defaults to use numbered shard spec which doesn’t have
metadata on what data is in it which means every query has all segment fanout,
making it very hard to support high query QPS
● The stream hashed shard spec is a real time version of batch Hash based shard
spec
● Let Kafka producer puts records to different kafka partition id based on:
hash(partition dimensions) % number of kafka partitions
● Cons: this approach doesn’t allow increasing kafka partitions which will lead to
incorrect results during the transition period
22. 22© 2020 Pinterest. All rights reserved.
Learnings
Operation tips
● druid.broker.select.tier and druid.server.priority
○ Controls routing for dark reads, Druid config AB testing and no downtime deploy
23. 23© 2020 Pinterest. All rights reserved.
Learnings
Operation tips
● skipCoordinatorRun
○ Use this runtime config when deploy/restart historical nodes to avoid coordinator
triggering unnecessary segments movements
● maxSegmentsInNodeLoadingQueue and maxSegmentsToMove
○ Segments are represented as children under a historical host znode
○ Load queue znodes not compressed
○ Be careful of hitting zk buffer limit (default to a few MBs) when loading a large
number of segments to a historical node
25. Time for questions
@Pinterest
25
Thank you!
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.