5. The Former Team @ Metamarkets
5
Vadim Ogievetsky Gian Merlino Fangjin Yang
6. Their Challenges
● Scale: when data is large, we need a lot of servers
● Speed: aiming for sub-second response time
● Complexity: too much fine grain to precompute
● High dimensionality: 10s or 100s of dimensions
● Concurrency: many users and tenants
● Freshness: load from streams
6
9. Key features
● Low latency ingestion from Kafka
● Bulk load from Hadoop
● Can pre-aggregate data during ingestion
● “Schema light”
● Ad-hoc queries
● Exact and approximate algorithms
● Can keep a lot of history (years are ok)
9
15. What is Druid?
● “high performance”: low query latency, high ingest rates
● “analytics”: counting, ranking, groupBy, time trend
● “data store”: the cluster stores a copy of your data
● “event-driven data”: fact data like clickstream, network flows,
user behavior, digital marketing, server metrics, IoT
15
16. New class of data store
● Column oriented
● High concurrency
● Scalable to 100s of servers, millions of messages/sec
● Partition key for query pruning
● May or may not have secondary indexes
● Query through SQL
● Rapid queries on denormalized data
16
17. New class of data store
● “Operational analytics” or “big OLAP” data stores
● Examples
○ Apache Druid [incubating] (open source community)
○ Scuba (from Facebook)
○ Pinot (from LinkedIn)
○ Doris, formerly Palo (from Baidu)
○ ClickHouse (from Yandex)
17
21. Optimized For A Reason
● denormalized
● roll up or to-not-roll-up
● Query time vs ingest time aggregation
● no joins
● lookups for slowly changing dimensions
21
25. Download
Druid community site (current): http://druid.io/
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
25