When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
Thank you all for coming to my talk.
The title of this talk is Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Sub-Second means Fast Interactive queries which can be used to power interactive dashboards, fast analytics, monitoring and alerting applications.
I am a Software Engineer at Hortonworks, a committer and PMC Member in Druid and part of PPMC for Superset Incubation.
I am part of the Business Intelligence team at Hortonworks.
Prior to that I have spent 2 years working at Metamarkets where I was responsible for handling the analytics infrastructure, including real-time analytics with Druid
Motivation
Druid introduction and use case
Demo
Druid Architecture
Storage Internals
Recent Improvements
Initial Use Case
Power ad-tech analytics product at metamarkets. Similar to as shown in the picture in the right, A dashboard where you can visualize timeseries data and do arbitrary filtering and grouping on any combinations of dimensions.
Requirements
- Data store needs to support Arbitrary queries i.e users should be able to filter and group on any combination of dimensions.
Scalability : should be able to handle trillions of events/day
Interactive : since the data store was going to power and interactive dashboard low latency queries was must
Real-time : the time when between an event occurred and it is visible dashboard should be mininal (order of few seconds..)
High Availability – no central point of failure
Rolling Upgrades – the architecture was required to support Rolling upgrades
MOTIVATION
Interactive real time visualizations on Complex data streams
Answer BI questions
How many unique male visitors visited my website last month ?
How many products were sold last quarter broken down by a demographic and product category ?
Not interested in dumping entire dataset
Suppose I am running an ad campaign, and I want to understand
what kind of Impressions are there
What is my click through rate
How many users decided to purchase my services
We have User Activity Stream and we may want to know How the users are behaving.
We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime.
Also, For very large distributed clusters there is a need to answer questions about application performance.
How individual node in my cluster behaving ?
Are there any Anomalies in query response time ?
All the above use cases can have data streams which can be huge in volume depending on the scale of business.
How do I analyze this information ?
How do I get insights from these Stream of Events in realtime ?
Druid Architecture
What is Druid ?
Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns.
Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times
Realtime streaming ingestion from almost any ETL pipeline.
Arbitrary slicing and dicing of data – no need to create pre-canned drill downs
Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time.
Approximate algorithms (hyperLogLog, theta) – for fast approximate answers
Scalable to petabytes of data
Highly available
Druid Architecture
Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
Wikipedia Real-Time Dashboard: How it Works
Druid Architecture
Realtime Index Tasks-
Handle Real-Time Ingestion, Support both pull & push based ingestion.
Handle Queries - Ability to serve queries as soon as data is ingested.
Store data in write optimized data structure on heap
In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL layer before sending it to druid.
Realtime Periodically persist their in memory data to deep storage in form of read optimized chunks.
Deep storage can be any distributed FS and acts as a permanent backup of data
Historical Nodes -
Main workhorses of druid cluster
Use Memory Mapped files to load columnar data
Respond to User queries
Now Lets see the how data can be queried.
Broker Nodes -
Keeps track of the data chunks being loaded by each node in the cluster
Ability to scatter query across multiple Historical and Realtime nodes
Caching Layer
Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid
Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into druid’s columnar format and persist it to deep storage.
With many historical nodes in a cluster there is a need for balance the load across them, this is done by the Coordinator Nodes -
Uses Zookeeper for coordination
Asks historical Nodes to load or drop data
They also move data across historical nodes to balances load in the cluster
Manages Data replication
Indexing Data
Indexing Service
Indexing is performed by
Overlord
Middle Managers
Peons
Middle Managers spawn peons which runs ingestion tasks
Each peon runs 1 task
Task definition defines which task to run and its properties
Realtime Ingestion
Done by Realtime Index Tasks
Ability to ingest streams of data
Stores data in write-optimized structure
Periodically converts write-optimized structureto read-optimized segments
Event query-able as soon as it is ingested
Both push and pull based ingestion
Maintain a In-Memory row oriented key-value store
Data stored inside the heap within a map Indexed by time and dimension values.
Persist data to disk based on threshold or lapse time
Batch Ingestion
HadoopIndexTask
Peon launches Hadoop MR job
Mappers read data
Reducers create Druid segment files
Index Task
Suitable for data sizes(<1G)
Querying Data
HTTP Rest API
Queries and results expressed in JSON
Multiple Query Types
Time Boundary
Timeseries
TopN
GroupBy
Select
Segment Metadata
Druid in Practice
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
Query performance – query time, segment scan time …
Ingestion Rate – events ingested, events persisted …
JVM Health – JVM Heap usage, GC stats …
Cache Related – cache hits, cache misses, cache evictions …
System related – cpu, disk, network, swap usage etc..
No Downtime
Data redundancy
Rolling upgrades
This shows some of the production users.
I can talk about some of the large ones which have common use cases.
Alibaba and Ebay use druid for ecommerce and user behavior analytics
Cisco has a realtime analytics product for analyzing network flows
Yahoo uses druid for user behavior analytics and realtime cluster monitoring
Hulu does interactive analysis on user and application behavior
Paypal, SK telecom – uses druid for business analytics
Recent Improvements
Built in SQL support (cli, http, jdbc)
Kerberos Authentication
Performance improvements
Optimized large amounts of and/or with concise bitmaps
Index-based regex simple filters like ‘foo%’
~30% improvement on non-time groupBys
Apache Hive Integration – Supports full SQL, Large Joins, Batch Indexing
Apache Ambari Integration – Easy deployments and Cluster management
Future Work
Improved schema definition & management
reIndexing/compaction without hadoop
Closer Hive/Druid integration
performance improvements
Select query performance improvements
Jit-friendly topN queries
Security enhancements
Row/Column level security
Integration with Apache Ranger
Work towards supporting Joins
Summary
Scalability
Horizontal Scalability.
Columnar storage, indexing and compression.
Multi-tenancy.
Real-time
Ingestion latency < seconds.
Query latency < seconds.
Arbitrary slice and dice big data like ninja
No more pre-canned drill downs.
Query with more fine-grained granularity.
High availability and Rolling deployment capabilities
Less costly to run.
Very active open source community.
Storage Internals
Druid: Segments
Data in Druid is stored in Segment Files.
Partitioned by time
Ideally, segment files are each smaller than 1GB.
If files are large, smaller time partitions are needed.
Example Wikipedia Edit Dataset
Data Rollup
Rollup by hour
Dictionary Encoding
Create and store Ids for each value
e.g. page column
Values - Justin Bieber, Ke$ha, Selena Gomes
Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
Column Data - [0 0 0 1 1 2]
city column - [0 0 0 1 1 1]
Bitmap Indices
Store Bitmap Indices for each value
Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
Selena Gomes -> [5] -> [0 0 0 0 0 1]
Queries
Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
Indexes compressed with Concise or Roaring encoding