Running SELECT COUNT(DISTINCT) on your database is all too common. In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you cannot just push query to the workers and add up results, because most likely there will be overlapping records in different workers.
In this talk, we will focus on HyperLogLog(HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in distributed environment. During the talk, first, we’ll look at the internals of the HLL to understand why HLL algorithm is useful to solve distinct count problem in scalable way, then how it can be applied in a distributed fashion. Finally we will see some examples of HLL usage.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy
1. Burak Yucesoy | Citus Data | PGConf EU
Distributed
COUNT(DISTINCT) with
HyperLogLog on
PostgreSQL
2. Burak Yucesoy | Citus Data | PGConf EU
What is COUNT(DISTINCT)?
● Number of unique elements (cardinality) in given data
● Useful to find things like…
○ Number of unique users visited your web page
○ Number of unique products in your inventory
3. Burak Yucesoy | Citus Data | PGConf EU
What is distributed COUNT(DISTINCT)?
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003
4. Burak Yucesoy | Citus Data | PGConf EU
Why do we need distributed COUNT(DISTINCT)?
● Your data is too big to fit in memory of single machine
● Naive approach for COUNT(DISTINCT) needs too much memory
5. Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
CoordinatorSELECT COUNT(*) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
600
100 200 300SELECT COUNT(*) FROM ...;
6. Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
CoordinatorSELECT COUNT(DISTINCT username) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
SELECT COUNT(DISTINCT user_id) FROM ...;
7. Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
username | date
----------+-----------
Alice | 2017-01-02
Bob | 2017-01-03
Charlie | 2017-01-05
Eve | 2017-01-07
Worker Node 3
logins_003
username | date
----------+-----------
Frank | 2017-03-23
Eve | 2017-03-29
Charlie | 2017-03-02
Charlie | 2017-03-03
Worker Node 2
logins_002
username | date
----------+-----------
Bob | 2017-02-11
Bob | 2017-02-13
Dave | 2017-02-17
Alice | 2017-02-19
8. Burak Yucesoy | Citus Data | PGConf EU
Why does distributed COUNT(DISTINCT) is difficult?
Worker Node 1
logins_001
username | date
----------+-----------
Alice | 2017-01-02
Bob | 2017-01-03
Charlie | 2017-01-05
Eve | 2017-01-07
Worker Node 3
logins_003
username | date
----------+-----------
Dave | 2017-03-23
Eve | 2017-03-29
Charlie | 2017-03-02
Charlie | 2017-03-03
Worker Node 2
logins_002
username | date
----------+-----------
Bob | 2017-02-11
Bob | 2017-02-13
Dave | 2017-02-17
Alice | 2017-02-19
9. Burak Yucesoy | Citus Data | PGConf EU
Some Possible Approaches
● Pull all distinct data to one node and count there. (Doesn’t scale)
● Repartition data on the fly. (Scales but it’s very slow)
● Use HyperLogLog. (Scales and fast)
10. Burak Yucesoy | Citus Data | PGConf EU
HyperLogLog(HLL)
HLL is;
● Approximation algorithm
● Estimates cardinality of given data
● Mathematically proven error bounds
11. Burak Yucesoy | Citus Data | PGConf EU
Is it OK to approximate?
It depends…
12. Burak Yucesoy | Citus Data | PGConf EU
HLL
● Very fast
● Low memory footprint
● Can work with streaming data
● Can merge estimations of two separate datasets efficiently
13. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
14. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
15. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0101...010
Number of leading zeros: 1
Maximum number of leading zeros: 2
16. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zeros: 7
Cardinality Estimation: 27
17. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
20. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
21. Burak Yucesoy | Citus Data | PGConf EU
How does HLL work? Stochastic Averaging
01000101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros
22. Burak Yucesoy | Citus Data | PGConf EU
Error rate of HLL is damn good
● Typical Error Rate: 1.04 / sqrt(number of partitions)
● Memory need is number of partitions * log(log(max. value in hash space)) bit
● Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
● Memory vs accuracy tradeoff
23. Burak Yucesoy | Citus Data | PGConf EU
Why does HLL work?
It turns out, combination of lots of bad estimation is a
good estimation
24. Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 3
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100110...001
... ... ...
25. Burak Yucesoy | Citus Data | PGConf EU
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000000...000
... ... ...
26. Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll
● https://github.com/aggregateknowledge/postgresql-hll
● https://github.com/citusdata/postgresql-hll
● Companies using postgresql-hll for their dashboard
● Neustar
● Cloudflare
27. Burak Yucesoy | Citus Data | PGConf EU
postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
● Use hll_hash_bigint to hash elements.
○ There are some other functions for other common data types.
● Use hll_add_agg to aggregate hashed elements into hll data structure.
● Use hll_cardinality to materialize hll data structure to actual distinct count.
postgresql-hll in single node
28. Burak Yucesoy | Citus Data | PGConf EU
What Happens in
Distributed Scenario?
29. Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
30. Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 2
Shard 2
Partition 1
Shard 2
Partition 3
Shard 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
31. Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
11
7
12
1053.255
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
32. Burak Yucesoy | Citus Data | PGConf EU
How to merge COUNT(DISTINCT) with HLL
Shard 1
+
Shard 2
Shard 1
Partition 1(7)
+
Shard 2
Partition 1(11)
11
7
12
1053.255
Estimation
Shard 1
Partition 2(5)
+
Shard 2
Partition 2(7)
Shard 1
Partition 3(12)
+
Shard 2
Partition 4(8)
33. Burak Yucesoy | Citus Data | PGConf EU
1. Separate data into shards.
postgresql-hll in distributed environment
logins_001 logins_002 logins_003
34. Burak Yucesoy | Citus Data | PGConf EU
2. Put shards into separate nodes.
postgresql-hll in distributed environment
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
logins_001 logins_002 logins_003
35. Burak Yucesoy | Citus Data | PGConf EU
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distributed environment
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
36. Burak Yucesoy | Citus Data | PGConf EU
4. Pull intermediate results to a single node.
postgresql-hll in distributed environment
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
37. Burak Yucesoy | Citus Data | PGConf EU
5. Merge separate hll data structures and materialize them
postgresql-hll in distributed environment
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
38. Burak Yucesoy | Citus Data | PGConf EU
Or use Citus :)
postgresql-hll in distributed environment
39. Burak Yucesoy | Citus Data | PGConf EU
Burak Yucesoy
burak@citusdata.com
@byucesoy
Thank You
citusdata.com | @citusdata