SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Druid : Sub-Second OLAP
queries over Petabytes of
Streaming Data
Nishant Bangarwa
Hortonworks
Druid Committer, PMC
Superset Incubator PPMC
June 2017
© Hortonworks Inc. 2011 – 2016. All Rights Reserved2
Agenda
History and Motivation
Introduction
Demo
Druid Architecture – Indexing and Querying Data
Druid In Production
Recent Improvements
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
HISTORY AND MOTIVATION
 Druid Open sourced in late 2012
 Initial Use case
 Power ad-tech analytics product
 Requirements
 Arbitrary queries
 Scalability : trillions of events/day
 Interactive : low latency queries
 Real-time : data freshness
 High Availability
 Rolling Upgrades
© Hortonworks Inc. 2011 – 2016. All Rights Reserved4
MOTIVATION
 Business Intelligence Queries
 Arbitrary slicing and dicing of data
 Interactive real time visualizations on
Complex data streams
 Answer BI questions
– How many unique male visitors visited my
website last month ?
– How many products were sold last quarter
broken down by a demographic and product
category ?
 Not interested in dumping entire dataset
© Hortonworks Inc. 2011 – 2016. All Rights Reserved5
Introdution
© Hortonworks Inc. 2011 – 2016. All Rights Reserved6
What is Druid ?
 Column-oriented distributed datastore
 Sub-Second query times
 Realtime streaming ingestion
 Arbitrary slicing and dicing of data
 Automatic Data Summarization
 Approximate algorithms (hyperLogLog, theta)
 Scalable to petabytes of data
 Highly available
© Hortonworks Inc. 2011 – 2016. All Rights Reserved7
Demo
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wikipedia Real-Time Dashboard: How it Works
Wikipedia
Edits Data
Stream
Exactly-Once
Ingestion
Write
Read
Java Stream
Reader
© Hortonworks Inc. 2011 – 2016. All Rights Reserved10
Druid Architecture
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realtime
Nodes
Historical
Nodes
11
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff
© Hortonworks Inc. 2011 – 2016. All Rights Reserved12
Druid Architecture
Batch Data
Queries
Metadata
Store
Coordinator
Nodes
Zookeepe
r
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Handoff
Optional Distributed
Query Cache
© Hortonworks Inc. 2011 – 2016. All Rights Reserved13
Indexing Data
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Indexing Service
 Indexing is performed by
 Overlord
 Middle Managers
 Peons
 Middle Managers spawn peons which runs ingestion
tasks
 Each peon runs 1 task
 Task definition defines which task to run and its
properties
© Hortonworks Inc. 2011 – 2016. All Rights Reserved15
Streaming Ingestion : Realtime Index Tasks
 Ability to ingest streams of data
 Stores data in write-optimized structure
 Periodically converts write-optimized structure
to read-optimized segments
 Event query-able as soon as it is ingested
 Both push and pull based ingestion
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Ingestion : Tranquility
 Helper library for coordinating
streaming ingestion
 Simple API to send events to
druid
 Transparently Manages
 Realtime index Task Creation
 Partitioning and Replication
 Schema Evolution
 Can be used with Flink, Samza,
Spark, Storm any other ETL
framework
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Indexing Service (experimental)
 Supports Exactly once ingestion
 Messages pulled by Kafka Index Tasks
 Each Kafka Index Task consumes from a set of
partitions with start and end offset
 Each message verified to ensure sequence
 Kafka Offsets and corresponding segments
persisted in same metadata transaction
atomically
 Kafka Supervisor
 embedded inside overlord
 Manages kafka index tasks
 Retry failed tasks
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Batch Ingestion
 HadoopIndexTask
 Peon launches Hadoop MR job
 Mappers read data
 Reducers create Druid segment files
 Index Task
 Runs in single JVM i.e peon
 Suitable for data sizes(<1G)
 Integrations with Apache HIVE and Spark for Batch Ingestion
© Hortonworks Inc. 2011 – 2016. All Rights Reserved19
Querying Data
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Data from Druid
 Druid supports
 JSON Queries over HTTP
 In built SQL (experimental)
 Querying libraries available for
 Python
 R
 Ruby
 Javascript
 Clojure
 PHP
© Hortonworks Inc. 2011 – 2016. All Rights Reserved21
JSON Over HTTP
 HTTP Rest API
 Queries and results expressed in JSON
 Multiple Query Types
 Time Boundary
 Timeseries
 TopN
 GroupBy
 Select
 Segment Metadata
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
In built SQL (experimental)
 Apache Calcite based parser and planner
 Ability to connect druid to any BI tool that supports JDBC
 SQL via JSON over HTTP
 Supports Approximate queries
 APPROX_COUNT_DISTINCT(col)
 Ability to do Fast Approx TopN queries
 APPROX_QUANTILE(column, probability)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integrated with multiple Open Source UI tools
 Superset –
 Developed at AirBnb
 In Apache Incubation since May 2017
 Grafana – Druid plugin (https://github.com/grafana-druid-plugin/druidplugin)
 Metabase
 With in-built SQL connect with any BI tool supporting JDBC
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
 Python backend
 Flask app builder
 Authentication
 Pandas for rich analytics
 SqlAlchemy for SQL toolkit
 Javascript frontend
 React, NVD3
 Deep integration with Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Treemaps
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Sunburst
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!
© Hortonworks Inc. 2011 – 2016. All Rights Reserved28
Druid in Production
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Production readiness
 Is Druid suitable for my Use case ?
 Will Druid meet my performance requirements at scale ?
 How complex is it to Operate and Manage Druid cluster ?
 How to monitor a Druid cluster ?
 High Availability ?
 How to upgrade Druid cluster without downtime ?
 Security ?
 Extensibility for future Use cases ?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Suitable Use Cases
 Powering Interactive user facing applications
 Arbitrary slicing and dicing of large datasets
 User behavior analysis
 measuring distinct counts
 retention analysis
 funnel analysis
 A/B testing
 Exploratory analytics/root cause analysis
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved32
Performance
 Query Latency
– average - 500ms
– 90%ile < 1sec
– 95%ile < 5sec
– 99%ile < 10 sec
 Query Volume
– 1000s queries per minute
 Benchmarking code
 https://github.com/druid-
io/druid-benchmark
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance : Approximate Algorithms
 Ability to Store Approximate Data Sketches for high cardinality columns e.g userid
 Reduced storage size
 Use Cases
 Fast approximate distinct counts
 Approximate Top-K queries
 Approximate histograms
 Funnel/retention analysis
 Limitation
 Not possible to do exact counts
 filter on individual row values
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplified Druid Cluster Management with Ambari
 Install, configure and manage Druid and all external dependencies from Ambari
 Easy to enable HA, Security, Monitoring …
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplified Druid Cluster Management with Ambari
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring a Druid Cluster
 Each Druid Node emits metrics for
 Query performance
 Ingestion Rate
 JVM Health
 Query Cache performance
 System health
 Emitted as JSON objects to a runtime log file or over HTTP to other services
 Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka
 Easy to implement your own metrics emitter
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring using Ambari Metrics Server
 HDP 2.6.1 contains pre-defined grafana dashboards
 Health of Druid Nodes
 Ingestion
 Query performance
 Easy to create new dashboards and setup alerts
 Auto configured when both Druid and Ambari Metrics Server are installed
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring using Ambari Metrics Server
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring using Ambari Metrics Server
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
High Availability
 Deploy Coordinator/Overlord on multiple instances
 Leader election in zookeeper
 Broker – install multiple brokers
 Use druid Router/ Any Load balancer to route queries to brokers
 Realtime Index Tasks – create redundant tasks.
 Historical Nodes – create load rule with replication factor >= 2 (default = 2)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved41
Rolling Upgrades
 Maintain backwards compatibility
 Data redundancy
 Shared Nothing Architecture
 Rolling upgrades
 No Downtime
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
 Supports Authentication via Kerberos/ SPNEGO
 Easy Wizard based kerberos security enablement via Ambari
Druid
Druid
KDC server
User
Browser1 kinit user
2 Token
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extending Core Druid
 Plugin Based Architecture
 leverage Guice in order to load extensions at runtime
 Possible to add extension to
 Add a new deep storage implementation
 Add a new Firehose
 Add Aggregators
 Add Complex metrics
 Add new Query types
 Add new Jersey resources
 Bundle your extension with all the other Druid extensions
© Hortonworks Inc. 2011 – 2016. All Rights Reserved44
Companies Using Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved45
Recent Improvements
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid 0.10.0
 Kafka Indexing Service – Exactly once Ingestion
 Built in SQL support (cli, http, jdbc)
 Numeric Dimensions
 Kerberos Authentication
 Performance improvements
 Optimized large amounts of and/or with concise bitmaps
 Index-based regex simple filters like ‘foo%’
 ~30% improvement on non-time groupBys
 Apache Hive Integration – Supports full SQL, Large Joins, Batch Indexing
 Apache Ambari Integration – Easy deployments and Cluster management
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
 Improved schema definition & management
 Improvements to Hive/Druid integration
 Materialized Views, Push down more filters, support complex columns etc…
 Performance improvements
 Select query performance improvements
 Jit-friendly topN queries
 Security enhancements
 Row/Column level security
 Integration with Apache Ranger
 And much more……
© Hortonworks Inc. 2011 – 2016. All Rights Reserved48
Community
 User google group - druid-user@googlegroups.com
 Dev google group - druid-dev@googlegroups.com
 Github - druid-io/druid
 IRC - #druid-dev on irc.freenode.net
© Hortonworks Inc. 2011 – 2016. All Rights Reserved49
Summary
 Easy installation and management via Ambari
 Real-time
– Ingestion latency < seconds.
– Query latency < seconds.
 Arbitrary slice and dice big data like ninja
– No more pre-canned drill downs.
– Query with more fine-grained granularity.
 High availability and Rolling deployment capabilities
 Secure and Production ready
 Vibrant and Active community
 Available as Tech Preview in HDP 2.6.1
© Hortonworks Inc. 2011 – 2016. All Rights Reserved50
Thank you ! Questions ?
 Twitter - @NishantBangarwa
 Email - nbangarwa@hortonworks.com
 Linkedin - https://www.linkedin.com/in/nishant-bangarwa
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atscale + Hive + Druid
 Leverage Atscale cubing
capabilities
 store aggregate tables in
Druid
 Updatable dimensions in
HIVE
© Hortonworks Inc. 2011 – 2016. All Rights Reserved52
Storage Format
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday
© Hortonworks Inc. 2011 – 2016. All Rights Reserved54
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics
© Hortonworks Inc. 2011 – 2016. All Rights Reserved55
Data Rollup
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved56
Dictionary Encoding
 Create and store Ids for each value
 e.g. page column
 Values - Justin Bieber, Ke$ha, Selena Gomes
 Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
 Column Data - [0 0 0 1 1 2]
 city column - [0 0 0 1 1 1]
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved57
Bitmap Indices
 Store Bitmap Indices for each value
 Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
 Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
 Selena Gomes -> [5] -> [0 0 0 0 0 1]
 Queries
 Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
 language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
 Indexes compressed with Concise or Roaring encoding
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99
2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved58
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Approximate Sketch Columns
 Better rollup for high cardinality columns e.g userid
 Reduced storage size
 Use Cases
 Fast approximate distinct counts
 Approximate histograms
 Funnel/retention analysis
 Limitation
 Not possible to do exact counts
 filter on individual row values

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...HostedbyConfluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium confluent
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsMydbops
 

Was ist angesagt? (20)

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability Methods
 

Andere mochten auch

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01gianmerlino
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on HadoopYuta Imai
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Sudhir Tonse
 

Andere mochten auch (7)

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on Hadoop
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Ähnlich wie Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druidKashif Khan
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHaimo Liu
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming dataCarolyn Duby
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionMilind Pandit
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerAbdelkrim Hadjidj
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Hortonworks
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit
 

Ähnlich wie Druid: Sub-Second OLAP queries over Petabytes of Streaming Data (20)

Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Kürzlich hochgeladen (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

  • 1. Druid : Sub-Second OLAP queries over Petabytes of Streaming Data Nishant Bangarwa Hortonworks Druid Committer, PMC Superset Incubator PPMC June 2017
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 Agenda History and Motivation Introduction Demo Druid Architecture – Indexing and Querying Data Druid In Production Recent Improvements
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved HISTORY AND MOTIVATION  Druid Open sourced in late 2012  Initial Use case  Power ad-tech analytics product  Requirements  Arbitrary queries  Scalability : trillions of events/day  Interactive : low latency queries  Real-time : data freshness  High Availability  Rolling Upgrades
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved4 MOTIVATION  Business Intelligence Queries  Arbitrary slicing and dicing of data  Interactive real time visualizations on Complex data streams  Answer BI questions – How many unique male visitors visited my website last month ? – How many products were sold last quarter broken down by a demographic and product category ?  Not interested in dumping entire dataset
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 Introdution
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 What is Druid ?  Column-oriented distributed datastore  Sub-Second query times  Realtime streaming ingestion  Arbitrary slicing and dicing of data  Automatic Data Summarization  Approximate algorithms (hyperLogLog, theta)  Scalable to petabytes of data  Highly available
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved7 Demo
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wikipedia Real-Time Dashboard: How it Works Wikipedia Edits Data Stream Exactly-Once Ingestion Write Read Java Stream Reader
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved10 Druid Architecture
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Realtime Nodes Historical Nodes 11 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 Druid Architecture Batch Data Queries Metadata Store Coordinator Nodes Zookeepe r Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Handoff Optional Distributed Query Cache
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved13 Indexing Data
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Indexing Service  Indexing is performed by  Overlord  Middle Managers  Peons  Middle Managers spawn peons which runs ingestion tasks  Each peon runs 1 task  Task definition defines which task to run and its properties
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved15 Streaming Ingestion : Realtime Index Tasks  Ability to ingest streams of data  Stores data in write-optimized structure  Periodically converts write-optimized structure to read-optimized segments  Event query-able as soon as it is ingested  Both push and pull based ingestion
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Ingestion : Tranquility  Helper library for coordinating streaming ingestion  Simple API to send events to druid  Transparently Manages  Realtime index Task Creation  Partitioning and Replication  Schema Evolution  Can be used with Flink, Samza, Spark, Storm any other ETL framework
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Indexing Service (experimental)  Supports Exactly once ingestion  Messages pulled by Kafka Index Tasks  Each Kafka Index Task consumes from a set of partitions with start and end offset  Each message verified to ensure sequence  Kafka Offsets and corresponding segments persisted in same metadata transaction atomically  Kafka Supervisor  embedded inside overlord  Manages kafka index tasks  Retry failed tasks
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Batch Ingestion  HadoopIndexTask  Peon launches Hadoop MR job  Mappers read data  Reducers create Druid segment files  Index Task  Runs in single JVM i.e peon  Suitable for data sizes(<1G)  Integrations with Apache HIVE and Spark for Batch Ingestion
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 Querying Data
  • 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Data from Druid  Druid supports  JSON Queries over HTTP  In built SQL (experimental)  Querying libraries available for  Python  R  Ruby  Javascript  Clojure  PHP
  • 21. © Hortonworks Inc. 2011 – 2016. All Rights Reserved21 JSON Over HTTP  HTTP Rest API  Queries and results expressed in JSON  Multiple Query Types  Time Boundary  Timeseries  TopN  GroupBy  Select  Segment Metadata
  • 22. © Hortonworks Inc. 2011 – 2016. All Rights Reserved In built SQL (experimental)  Apache Calcite based parser and planner  Ability to connect druid to any BI tool that supports JDBC  SQL via JSON over HTTP  Supports Approximate queries  APPROX_COUNT_DISTINCT(col)  Ability to do Fast Approx TopN queries  APPROX_QUANTILE(column, probability)
  • 23. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integrated with multiple Open Source UI tools  Superset –  Developed at AirBnb  In Apache Incubation since May 2017  Grafana – Druid plugin (https://github.com/grafana-druid-plugin/druidplugin)  Metabase  With in-built SQL connect with any BI tool supporting JDBC
  • 24. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset  Python backend  Flask app builder  Authentication  Pandas for rich analytics  SqlAlchemy for SQL toolkit  Javascript frontend  React, NVD3  Deep integration with Druid
  • 25. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Treemaps
  • 26. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Sunburst
  • 27. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!
  • 28. © Hortonworks Inc. 2011 – 2016. All Rights Reserved28 Druid in Production
  • 29. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Production readiness  Is Druid suitable for my Use case ?  Will Druid meet my performance requirements at scale ?  How complex is it to Operate and Manage Druid cluster ?  How to monitor a Druid cluster ?  High Availability ?  How to upgrade Druid cluster without downtime ?  Security ?  Extensibility for future Use cases ?
  • 30. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Suitable Use Cases  Powering Interactive user facing applications  Arbitrary slicing and dicing of large datasets  User behavior analysis  measuring distinct counts  retention analysis  funnel analysis  A/B testing  Exploratory analytics/root cause analysis
  • 31. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 32. © Hortonworks Inc. 2011 – 2016. All Rights Reserved32 Performance  Query Latency – average - 500ms – 90%ile < 1sec – 95%ile < 5sec – 99%ile < 10 sec  Query Volume – 1000s queries per minute  Benchmarking code  https://github.com/druid- io/druid-benchmark
  • 33. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance : Approximate Algorithms  Ability to Store Approximate Data Sketches for high cardinality columns e.g userid  Reduced storage size  Use Cases  Fast approximate distinct counts  Approximate Top-K queries  Approximate histograms  Funnel/retention analysis  Limitation  Not possible to do exact counts  filter on individual row values
  • 34. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simplified Druid Cluster Management with Ambari  Install, configure and manage Druid and all external dependencies from Ambari  Easy to enable HA, Security, Monitoring …
  • 35. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simplified Druid Cluster Management with Ambari
  • 36. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring a Druid Cluster  Each Druid Node emits metrics for  Query performance  Ingestion Rate  JVM Health  Query Cache performance  System health  Emitted as JSON objects to a runtime log file or over HTTP to other services  Emitters available for Ambari Metrics Server, Graphite, StatsD, Kafka  Easy to implement your own metrics emitter
  • 37. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring using Ambari Metrics Server  HDP 2.6.1 contains pre-defined grafana dashboards  Health of Druid Nodes  Ingestion  Query performance  Easy to create new dashboards and setup alerts  Auto configured when both Druid and Ambari Metrics Server are installed
  • 38. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring using Ambari Metrics Server
  • 39. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring using Ambari Metrics Server
  • 40. © Hortonworks Inc. 2011 – 2016. All Rights Reserved High Availability  Deploy Coordinator/Overlord on multiple instances  Leader election in zookeeper  Broker – install multiple brokers  Use druid Router/ Any Load balancer to route queries to brokers  Realtime Index Tasks – create redundant tasks.  Historical Nodes – create load rule with replication factor >= 2 (default = 2)
  • 41. © Hortonworks Inc. 2011 – 2016. All Rights Reserved41 Rolling Upgrades  Maintain backwards compatibility  Data redundancy  Shared Nothing Architecture  Rolling upgrades  No Downtime 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3
  • 42. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security  Supports Authentication via Kerberos/ SPNEGO  Easy Wizard based kerberos security enablement via Ambari Druid Druid KDC server User Browser1 kinit user 2 Token
  • 43. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Extending Core Druid  Plugin Based Architecture  leverage Guice in order to load extensions at runtime  Possible to add extension to  Add a new deep storage implementation  Add a new Firehose  Add Aggregators  Add Complex metrics  Add new Query types  Add new Jersey resources  Bundle your extension with all the other Druid extensions
  • 44. © Hortonworks Inc. 2011 – 2016. All Rights Reserved44 Companies Using Druid
  • 45. © Hortonworks Inc. 2011 – 2016. All Rights Reserved45 Recent Improvements
  • 46. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid 0.10.0  Kafka Indexing Service – Exactly once Ingestion  Built in SQL support (cli, http, jdbc)  Numeric Dimensions  Kerberos Authentication  Performance improvements  Optimized large amounts of and/or with concise bitmaps  Index-based regex simple filters like ‘foo%’  ~30% improvement on non-time groupBys  Apache Hive Integration – Supports full SQL, Large Joins, Batch Indexing  Apache Ambari Integration – Easy deployments and Cluster management
  • 47. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work  Improved schema definition & management  Improvements to Hive/Druid integration  Materialized Views, Push down more filters, support complex columns etc…  Performance improvements  Select query performance improvements  Jit-friendly topN queries  Security enhancements  Row/Column level security  Integration with Apache Ranger  And much more……
  • 48. © Hortonworks Inc. 2011 – 2016. All Rights Reserved48 Community  User google group - druid-user@googlegroups.com  Dev google group - druid-dev@googlegroups.com  Github - druid-io/druid  IRC - #druid-dev on irc.freenode.net
  • 49. © Hortonworks Inc. 2011 – 2016. All Rights Reserved49 Summary  Easy installation and management via Ambari  Real-time – Ingestion latency < seconds. – Query latency < seconds.  Arbitrary slice and dice big data like ninja – No more pre-canned drill downs. – Query with more fine-grained granularity.  High availability and Rolling deployment capabilities  Secure and Production ready  Vibrant and Active community  Available as Tech Preview in HDP 2.6.1
  • 50. © Hortonworks Inc. 2011 – 2016. All Rights Reserved50 Thank you ! Questions ?  Twitter - @NishantBangarwa  Email - nbangarwa@hortonworks.com  Linkedin - https://www.linkedin.com/in/nishant-bangarwa
  • 51. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atscale + Hive + Druid  Leverage Atscale cubing capabilities  store aggregate tables in Druid  Updatable dimensions in HIVE
  • 52. © Hortonworks Inc. 2011 – 2016. All Rights Reserved52 Storage Format
  • 53. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  • 54. © Hortonworks Inc. 2011 – 2016. All Rights Reserved54 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  • 55. © Hortonworks Inc. 2011 – 2016. All Rights Reserved55 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  • 56. © Hortonworks Inc. 2011 – 2016. All Rights Reserved56 Dictionary Encoding  Create and store Ids for each value  e.g. page column  Values - Justin Bieber, Ke$ha, Selena Gomes  Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2  Column Data - [0 0 0 1 1 2]  city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  • 57. © Hortonworks Inc. 2011 – 2016. All Rights Reserved57 Bitmap Indices  Store Bitmap Indices for each value  Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]  Ke$ha -> [3, 4] -> [0 0 0 1 1 0]  Selena Gomes -> [5] -> [0 0 0 0 0 1]  Queries  Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]  language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]  Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  • 58. © Hortonworks Inc. 2011 – 2016. All Rights Reserved58 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  • 59. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Approximate Sketch Columns  Better rollup for high cardinality columns e.g userid  Reduced storage size  Use Cases  Fast approximate distinct counts  Approximate histograms  Funnel/retention analysis  Limitation  Not possible to do exact counts  filter on individual row values

Hinweis der Redaktion

  1. Thank you all for coming to my talk. The title of this talk is Druid: Sub-Second OLAP queries over Petabytes of Streaming Data Sub-Second means Fast Interactive queries which can be used to power interactive dashboards, fast analytics, monitoring and alerting applications. I am a Software Engineer at Hortonworks, a committer and PMC Member in Druid and part of PPMC for Superset Incubation. I am part of the Business Intelligence team at Hortonworks. Prior to that I have spent 2 years working at Metamarkets where I was responsible for handling the analytics infrastructure, including real-time analytics with Druid
  2. Motivation Druid introduction and use case Demo Druid Architecture Storage Internals Recent Improvements
  3. Initial Use Case Power ad-tech analytics product at metamarkets. Similar to as shown in the picture in the right, A dashboard where you can visualize timeseries data and do arbitrary filtering and grouping on any combinations of dimensions. Requirements - Data store needs to support Arbitrary queries i.e users should be able to filter and group on any combination of dimensions. Scalability : should be able to handle trillions of events/day Interactive : since the data store was going to power and interactive dashboard low latency queries was must Real-time : the time when between an event occurred and it is visible dashboard should be mininal (order of few seconds..) High Availability – no central point of failure Rolling Upgrades – the architecture was required to support Rolling upgrades
  4. MOTIVATION Interactive real time visualizations on Complex data streams Answer BI questions How many unique male visitors visited my website last month ? How many products were sold last quarter broken down by a demographic and product category ? Not interested in dumping entire dataset Suppose I am running an ad campaign, and I want to understand what kind of Impressions are there What is my click through rate How many users decided to purchase my services We have User Activity Stream and we may want to know How the users are behaving. We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime. Also, For very large distributed clusters there is a need to answer questions about application performance. How individual node in my cluster behaving ? Are there any Anomalies in query response time ? All the above use cases can have data streams which can be huge in volume depending on the scale of business. How do I analyze this information ? How do I get insights from these Stream of Events in realtime ?
  5. Druid Architecture
  6. What is Druid ? Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns. Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times Realtime streaming ingestion from almost any ETL pipeline. Arbitrary slicing and dicing of data – no need to create pre-canned drill downs Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time. Approximate algorithms (hyperLogLog, theta) – for fast approximate answers Scalable to petabytes of data Highly available
  7. Druid Architecture
  8. Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  9. Wikipedia Real-Time Dashboard: How it Works
  10. Druid Architecture
  11. Realtime Index Tasks- Handle Real-Time Ingestion, Support both pull & push based ingestion. Handle Queries - Ability to serve queries as soon as data is ingested. Store data in write optimized data structure on heap In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL layer before sending it to druid. Realtime Periodically persist their in memory data to deep storage in form of read optimized chunks. Deep storage can be any distributed FS and acts as a permanent backup of data Historical Nodes - Main workhorses of druid cluster Use Memory Mapped files to load columnar data Respond to User queries Now Lets see the how data can be queried. Broker Nodes - Keeps track of the data chunks being loaded by each node in the cluster Ability to scatter query across multiple Historical and Realtime nodes Caching Layer Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into druid’s columnar format and persist it to deep storage.
  12. With many historical nodes in a cluster there is a need for balance the load across them, this is done by the Coordinator Nodes - Uses Zookeeper for coordination Asks historical Nodes to load or drop data They also move data across historical nodes to balances load in the cluster Manages Data replication
  13. Indexing Data
  14. Indexing Service Indexing is performed by Overlord Middle Managers Peons Middle Managers spawn peons which runs ingestion tasks Each peon runs 1 task Task definition defines which task to run and its properties
  15. Realtime Ingestion Done by Realtime Index Tasks Ability to ingest streams of data Stores data in write-optimized structure Periodically converts write-optimized structure to read-optimized segments Event query-able as soon as it is ingested Both push and pull based ingestion Maintain a In-Memory row oriented key-value store Data stored inside the heap within a map Indexed by time and dimension values. Persist data to disk based on threshold or lapse time
  16. Batch Ingestion HadoopIndexTask Peon launches Hadoop MR job Mappers read data Reducers create Druid segment files Index Task Suitable for data sizes(<1G)
  17. Querying Data
  18. HTTP Rest API Queries and results expressed in JSON Multiple Query Types Time Boundary Timeseries TopN GroupBy Select Segment Metadata
  19. Druid in Practice
  20. Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  21. Query Latency average - 500ms 90%ile < 1sec 95%ile < 5sec 99%ile < 10 sec Query Volume 1000s queries per minute
  22. Query performance – query time, segment scan time … Ingestion Rate – events ingested, events persisted … JVM Health – JVM Heap usage, GC stats … Cache Related – cache hits, cache misses, cache evictions … System related – cpu, disk, network, swap usage etc..
  23. No Downtime Data redundancy Rolling upgrades
  24. This shows some of the production users. I can talk about some of the large ones which have common use cases. Alibaba and Ebay use druid for ecommerce and user behavior analytics Cisco has a realtime analytics product for analyzing network flows Yahoo uses druid for user behavior analytics and realtime cluster monitoring Hulu does interactive analysis on user and application behavior Paypal, SK telecom – uses druid for business analytics
  25. Recent Improvements
  26. Built in SQL support (cli, http, jdbc) Kerberos Authentication Performance improvements Optimized large amounts of and/or with concise bitmaps Index-based regex simple filters like ‘foo%’ ~30% improvement on non-time groupBys Apache Hive Integration – Supports full SQL, Large Joins, Batch Indexing Apache Ambari Integration – Easy deployments and Cluster management
  27. Future Work Improved schema definition & management reIndexing/compaction without hadoop Closer Hive/Druid integration performance improvements Select query performance improvements Jit-friendly topN queries Security enhancements Row/Column level security Integration with Apache Ranger Work towards supporting Joins
  28. Summary Scalability Horizontal Scalability. Columnar storage, indexing and compression. Multi-tenancy. Real-time Ingestion latency < seconds. Query latency < seconds. Arbitrary slice and dice big data like ninja No more pre-canned drill downs. Query with more fine-grained granularity. High availability and Rolling deployment capabilities Less costly to run. Very active open source community.
  29. Storage Internals
  30. Druid: Segments Data in Druid is stored in Segment Files. Partitioned by time Ideally, segment files are each smaller than 1GB. If files are large, smaller time partitions are needed.
  31. Example Wikipedia Edit Dataset
  32. Data Rollup Rollup by hour
  33. Dictionary Encoding Create and store Ids for each value e.g. page column Values - Justin Bieber, Ke$ha, Selena Gomes Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 Column Data - [0 0 0 1 1 2] city column - [0 0 0 1 1 1]
  34. Bitmap Indices Store Bitmap Indices for each value Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] Ke$ha -> [3, 4] -> [0 0 0 1 1 0] Selena Gomes -> [5] -> [0 0 0 0 0 1] Queries Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] Indexes compressed with Concise or Roaring encoding
  35. Data Rollup Rollup by hour