3. Analytics Use Case: Interactive Dashboard
select sum(pageView), time from T
where country = us,
browser = chrome,…
group by time
Slice and dice over arbitrary dimensions
Human driven queries
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
4. Analytics Use Case: Site Facing
select sum(pageView) from T
where memberId = 456,
pageKey = “profilePage”,
privacySettings in (…)
group by time,[title|geo|industry]
Pre-defined query format with different
primary key values
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
5. Analytics Use Case: Anomaly Detection
for d1 in [us, ca, … ]
for d2 in [chrome, ie, … ]
…
select sum(pageView), time from T
where country = d1, browser = d2
group by time
Identifying all issues requires us to monitor
all possible combinations
Periodic machine generated queries (bursty)
Use Case Response Latency Query Rate Possible Solutions
Anomaly Detection
sub-second to
few seconds
10-100s qps Streaming Engine
6. Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
What makes these use cases use different solutions?
Different solutions based on
different workload
characteristics
Can we support all these use cases in one single system?
7. What is Pinot?
SQL-like interface with predictable latency (no joins)
Batch Data Ingestion (Hadoop)
Realtime Data Ingestion (Kafka)
Distributed, horizontally scalable
Open source! (https://github.com/linkedin/pinot)
8. Pinot @ LinkedIn
+50
Site Facing Use cases
+60k
Queries per second Records ingested
per second
+2000
Tables
+1.4m
• 300B documents
per data center
• 2 trillion documents
for internal use case
10. Architecture Overview
• Controller - handles cluster-wide
coordination using Apace Helix and
Zookeeper
• Broker - handles query fan out and
query routing to servers
• Server - responds to query requests
originating from the brokers
15. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5 1-2t = 1 2
16. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
t = 1 2 1-2
17. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
where t <= 2
select sum(m) from T
where t > 2
select sum(m) from T
1-2t = 1 2
18. Query Execution: Single Node
Query Optimization
select max(col) from T Use metadata instead of scanning
select sum(metric) from T
where country = us and accountId = x
Reorders filter for better performance
(apply accountId before country predicate)
Dynamic query planning based on column metadata, index, and dictionary
19. Anatomy of Pinot Segment
Dictionary Forward Index
Metadata
start/end time
available indexes
partitioning info
min/max value
…
Inverted
Sorted
Startree
Indexes
docId country code
0 us 002
1 ca 001
2 jp 003
… … …
country
ca
jp
us
…
dictId docId
code
001
002
003
…
country
2
0
1
…
code
1
0
2
…
Raw Data
21. Recap: Analytics Use Cases
Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
Different solutions based on
different workload
characteristics
22. Interactive Dashboard
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
select sum(pageView), time from T
where country = us, browser = chrome,…
group by time
0 100 200 300 400 500
Latency (milliseconds)
Frequency
pinot
druid
23. Site Facing
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
select sum(pageView) from T
where memberId = xx, privacySettings in…
group by time,[title|geo|industry]
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●● ●●100
1000
10000
10 1000
Queries per second
Latency(milliseconds)
druid
pinot
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
pinot
druid
24. Pinot Optimizations For Site Facing Use Cases
• Optimizing Query Processing
1. Sorted Index + Dynamic execution planning
• Optimizing Scatter and Gather
1. Smart segment assignment and routing
2. Data partitioning and pruning
25. Optimizing Query Processing: Sorted Index
• Access to both forward/inverted index
• Fetch contiguous block, benefit from locality
• For item filtering, pick scanning or inverted index based on cardinality of
sorted column
memberId
start
docId
end
docId
123 0 100
456 101 300
… … …
docId memberId
0 123
... …
100 123
101 456
… …
300 456
… …
select …
where memberId = 456, item in(…)
group by …
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
100
1000
10 100 1000
Queries per second
Latency(milliseconds)
sorted index
inverted index
26. Optimizing Scatter and Gather: Querying All Servers
Replica group: a set of servers that contains a complete set of all segments.
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
without routing
optimization
with routing
optimization
Problem Impact Solution
Querying all servers
99% is impacted by
the slowest server (e.g. gc)
Control the number of servers to fan-out
27. Optimizing Scatter and Gather: Querying All Segments
S1
S3
query 1
query 2
S2
S4
S1
(p=1)
S3
(p=2)
query 1
(mid = p1)
query 2
(mid = p2)
S2
(p=1)
S4
(p=2)
Problem Impact Solution
Querying all segments More CPU work on server
Minimize the number of segment
(partitioning and pruning)
select …
where memberId = 456, item in(…)
group by …
28. Anomaly Detection: Challenge
for d1 in [us, ca, …]
for d2 in [key1, key2,…]
…
select sum(pageViews) from T
where country=d1, page_key=d2,
source_app=d3, device_name=d4…
group by country, time
…
Filter Aggregation Latency
select …
where country = us,…
Slow, scan 60-70% data high
select …
where country = kenya,…
Scan less than 1% low
• Latency not predictable depends on the query predicate
• Monitoring all possible combinations makes the problem worse!
29. Time vs Space Trade-off
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
variable latency
low storage overhead
low latency
high storage overhead
30. Startree Index Generation
1. Multidimensional sort
2. Split on the column and create a node
for each value
3. Create star node (aggregate metric after
removing the split column)
4. Apply 1,2,3 for each node recursively
and stop when number of records in
node < SplitThreshold
root
*
docId country browser
…
other
dimensions
impre
ssion
0 al ie 10
1 ca safari 10
2 … … …
… us chrome 10
… us chrome 10
… us ie 10
N us safari 10
Raw records
Aggregated records
N+1 * chrome 40
N+2 * ie 20
N+3 * safari 20
caal … us *country
browser chrome … safari
31. Time vs Space Trade-off with Startree
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
SplitThreshold= infinity,
No prematerialization
SplitThreshold= 1,
Full materialiation
SplitThreshold= 100,000,
Partial data aware materialiation
32. Startree Query Execution
select sum(pageViews)from T
where country = AL
select sum(pageViews) from T
where browser = Chrome
select sum(pageViews) from T
select sum(X)
from T
where d1=v1 and d2=v2 and …
Any query pattern will scan
less than SplitThreshold records
root
*
caal … us *country
browser chrome … safari*chrome … safari
select sum(pageViews)from T
where country = CA
Raw docs
Aggregated docs
33. ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
100
1000
10000
1 10 100
Queries per second
Latency(milliseconds)
Anomaly Detection
druid
pinot with
inverted index
pinot with
startree index
Use Case Response Latency Query Throughput Possible Solutions
Anomaly detection
sub-second to
few seconds
10-100s queries
per second
Streaming Engine
34. Pinot vs Druid
Druid Pinot
Inverted Index Always on all columns, fixed Configurable on per column basis
Query Execution Layer Fixed Plan Split into planning and execution
Data Organization N/A Sorted column
Partitioning
Only available for
time column
Available for any column
Controlling query fan-out N/A
Replica group based segment
assignment and routing
Smart pre-matrialization N/A Star-tree
35. Can we support all these use cases in one single system?
Use Case Response Latency Query Rate Solution
Interactive dashboard
sub-second to
few seconds
~1 qps Pinot
Site facing
100ms
(99 percentile)
1000s qps Pinot
Anomaly detection
sub-second to
few seconds
10-100s qps Pinot