SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Benchmarking Apache Druid
July 16, 2020
1
Matt Sarrel (matt.sarrel@imply.io)
Developer Evangelist
2
Agenda:
1. Intro
2. Why Benchmark?
3. Star Schema Benchmark
4. What We Did
5. DIY Druid Benchmarking
Imply Overview
3
Founded by the creators of Apache Druid
Funded by Tier 1 investors
Trusted by innovative enterprises
Best-in-class revenue growth
41x
ARR
growth in
3 years
Leading contributor to Druid
Open core
Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure.
Druid
● Next generation analytics engine
● Widely adopted
Workflow transformation
● Subsecond speed unlocks new workflows
● Self-service explanations of data patterns
● Make data fun again
4
Core Design
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
● Optimized storage for
time-based datasets
● Time-based functions
SEARCH PLATFORM TIME SERIES DB OLAP
Key features
● Column oriented
● High concurrency
● Scalable to 1000s of servers, millions of messages/sec
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds
6
Druid in Data Pipeline
Data lakes
Message buses
Raw data Staging (and Processing) Analytics Database End User Application
clicks, ad impressions
network telemetry
application events
Druid Architecture
Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks
Test Configs
Data NodesD
● 3 i3.2xlarge (8CPU / 61GB RAM / 1.9TB
NVMe SSD storage)
Query NodesQ ● 2 m5d.large (2 CPU / 8GB RAM)
Master NodesM ● 1 m5.large (2 CPU / 8GB RAM)
Streaming Ingestion
Method Kafka Kinesis Tranquility
Supervisor
type
kafka kinesis N/A
How it works
Druid reads
directly from
Apache Kafka.
Druid reads directly
from Amazon
Kinesis.
Tranquility, a library that ships separately
from Druid, is used to push data into Druid.
Can ingest
late data?
Yes Yes
No (late data is dropped based on the
windowPeriod config)
Exactly-once
guarantees?
Yes Yes No
Batch Ingestion
Method Native batch (simple) Native batch (parallel) Hadoop-based
Parallel? No. Each task is single-threaded.
Yes, if firehose is splittable and
maxNumConcurrentSubTasks > 1 in
tuningConfig. See firehose
documentation for details.
Yes, always.
Can append or
overwrite?
Yes, both. Yes, both. Overwrite only.
File formats
Text file formats (CSV, TSV,
JSON).
Text file formats (CSV, TSV, JSON).
Any Hadoop
InputFormat.
Rollup modes
Perfect if forceGuaranteedRollup =
true in the tuningConfig.
Perfect if forceGuaranteedRollup =
true in the tuningConfig.
Always perfect.
Partitioning
options
Hash-based partitioning is
supported when
forceGuaranteedRollup = true in
the tuningConfig.
Hash-based partitioning (when
forceGuaranteedRollup = true).
Hash-based or range-
based partitioning via
partitionsSpec.
Is Druid Right For My Project?
● Timestamp dimension
● Streaming
● Denormalized
● Many attributes (30+ dimensions)
● High cardinality
Data Characteristics
● Large dataset
● Fast query response (<1s)
● Low latency data ingestion
● Interactive, ad-hoc queries
● Arbitrary slicing and dicing (OLAP)
● Query real-time & historical data
● Infrequent updates
Use Case Characteristics
Long Term Benchmark Plan
● Loosely follow the enterprise digital transformation journey
● Using widely accepted benchmarks, characterize query
performance on batched data
● Using widely accepted data sets benchmarks, characterize
streaming data ingestion and query performance
● Fully characterize ingestion with respect to timing and storage
● Develop the Streaming OLAP Benchmark the world needs
Druid and Data Warehouses
● Druid is not a DW
● Druid augments DW to provide the following
○ consistent, sub-second SLA
○ pre-aggregation/metrics generation upon ingest
○ simple schema
○ high concurrency reads
● Hot and warm queries in Druid, cold queries in DW
● Druid for internal and external customer powering realtime
visualization
● DW for internal customer
Confidential. Do not redistribute.
Realtime DW Solution Architecture
16
Apps
Storage
Machines
Events Stream > Parse > Search > Detect >
Correlate
Custom Dashboard
Notify
ETL ML
Block Control Permit
Allow Prohibit Custom
Data centers
managed/unmanaged
Confidential. Do not redistribute.
Logical Test Architecture
17
Star Schema Benchmark
● Designed to evaluate database system performance of star
schema data warehouse queries
● Based on TPC-H
● Widely used since 2007
● Combines standard generated test data with 13 SQL queries
● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF
Star Schema Benchmark Data Generation
● DBGEN utility
● Generates
● Fact table – lineorder.tbl
● Dimension tables –
customer.tbl, part.tbl,
supplier.tbl, date.tbl
● Scale Factor (SF=1) to
generate 600 million rows
or roughly 100GB
SSB ETL and Ingestion
● TBL files are tab delimited
● Generate on EBS, store on S3
● Amazon Athena (Apache Hive) used to denormalize 5 files into
one
● Saved in ORC and parquet formats for flexibility (ORC tested in
Druid)
How data is structured
● Druid stores data in immutable segments
● Column-oriented compressed format
● Dictionary-encoded at column level
● Bitmap Index Compression : concise & roaring
○ Roaring -typically recommended, faster for boolean operations such
as filters
● Rollup (partial aggregation)
Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction
Controlling Segment Size
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000
Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance
Ingestion (and the 5 million rows)
Run Rules
We ran JMeter against each platform’s HTTP API under the following conditions:
Query cache off
Each SSB query was run 10 times (10 samples per query)
Each query flight consisted of all 13 SSB queries run in succession
For each test, Average Response Time, Lowest Response Time, Highest Response
Time, and Average Response Time Standard Deviation per query were calculated
Each test was repeated five times
The lowest and highest test results were discarded, a standard practice to remove
outliers from performance testing results, leaving results from 3 test runs
The remaining 3 results for each query were averaged to provide results for
Average Response Time, Lowest Response Time, Highest Response Time, and
Average Response Time Standard Deviation per query were calculated
Star Schema Benchmark Queries
● Designed to be around classic DW use cases
● Select from table exactly once
● Restrictions on dimensions
● Druid supports native and SQL queries
13 Queries in Plain English
Query Flight 1 has restrictions on 1 dimension and measures revenue increase from eliminating ranges of discounts in given product order quantity intervals shipped in a given year.
Q1.1 has restrictions d_year = 1993, lo_quantity < 25, and lo_discount between 1 and 3.
Q1.2 changes restrictions of Q1.1 to d_yearmonthnum = 199401, lo_quantity between 26 and 35, lo_discount between 4 and 6.
Q1.3 changes the restrictions to d_weeknuminyear = 6 and d_year= 1994, lo_quantity between 36 and 40, and lo_discount between 5 and 7
Query flight 2 has restictions on 2 dimensions. The query compares revenues for certain product classes and suppliers in a certain region, grouped by more restrictive product classes and all years of orders.
2.1 has restrictions on p_category and s_region.
2.2 changes restrictions of Q2.1 to p_brand1 between 'MFGR#2221' and 'MFGR#2228' and s_regrion to 'ASIA'
2.3 changes restriction to p_brand1='MFGR#2339' and s_region='EUROPE'
Query flight 3, has restrictions on 3 dimensions. The query is intended to retrieve total revenue for lineorder transactions within and given region in a certain time period, grouped by customer nation, supplier
nation and year.
Q3.1 has restriction c_region = 'ASIA', s_region='ASIA', and restricts d_year to a 6-year period, grouped by c_nation, s_nation and d_year
3.2 changes region restrictions to c_nation = ""UNITED STATES' and s_nation = 'UNITED STATES', grouping revenue by customer city, supplier city and year.
3.3 changes restrictions to c_city and s_city to two cities in 'UNITED KINGDOM' and retrieves revenue grouped by c_city, s_city, d_year.
3.4 changes date restriction to a single month. After partitioning the 12 billion row dataset on d_yearmonth, we needed to rewrite the query for d_yearmonthnum
Query flight 4 provides a ""what-if"" sequence of queries that might be generated in an OLAP-style of exploration. Starting with a query with rather weak constraints on three dimensional columns, we retreive
aggregate profit, sum(lo_revenue-lo_supplycost), grouped by d_year and c_nation. Successive queries modify predicate constraints by drilling down to find the source of an anomaly.
Q4.1 restricts c_region and s_region both to 'AMERICA', and p_mfgr to one of two possilities.
Q4.2 utilizes a typical workflow to dig deeper into the results. We pivot away from grouping by s_nation, restrict d_year to 1997 and 1998, and drill down to group by p_category to see where the profit change arises.
Q4.3 digs deeper, restricting s_nation to 'UNITED STATES' and p_category = 'MFGR#14', drilling down to group by s_city (in the USA) and p_brand1 (within p_category 'MFGR#14').
Query Optimization
● Date! Date! Date!
Biggest impacts in
optimization came
from aligning date
as ingested with
anticipated
queries.
● Optimize SQL
expressions
● Vectorize
Query Optimization Stage Query 4.3
SSB (Original) select d_year, s_city, p_brand1, sum(lo_revenue -
lo_supplycost) as profit from denormalized where
s_nation = 'UNITED STATES' and (d_year = 1997
or d_year = 1998) and p_category = 'MFGR#14'
group by d_year, s_city, p_brand1 order by d_year,
s_city, p_brand1
Apache Druid select d_year, s_nation, p_category,
sum(lo_revenue) - sum(lo_supplycost) as profit from
${jmDataSource} where c_region = 'AMERICA' and
s_region = 'AMERICA' and (FLOOR("__time" to
YEAR) = TIME_PARSE('1997-01-
01T00:00:00.000Z') or FLOOR("__time" to YEAR)
= TIME_PARSE('1998-01-01T00:00:00.000Z')) and
(p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group
by d_year, s_nation, p_category order by d_year,
s_nation, p_category
Explain Plan
EXPLAIN PLAN FOR
SELECT d_year, s_city, p_brand1, sum(lo_revenue –
lo_supplycost) as profit
FROM ssb_data
WHERE s_nation = 'UNITED STATES' and (d_year =
1997 or d_year = 1998) and p_category = 'MFGR#14'
GROUP BY d_year, s_city, p_brand1
ORDER BY d_year, s_city, p_brand1
JMETER Config
JMETER Queries
Apache Druid SSB Results
Now Go Do It Yourself!
● Spec out your test project thoroughly
● Representative Data
● Representative Queries
● Install a small cluster (Quickstart)
● Ingest and tune
● Query via console for functional testing
● Install Jmeter (on query server and locally)
● Run queries against the HTTP API (no GUI, query server)
● Change, rerun, measure differences and learn
● The best way to learn is to just do it!
Resources
● Druid.apache.org
● Druid.apache.org/community
● ASF #druid Slack channel
● Jmeter.apache.org
● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF
● https://github.com/lemire/StarSchemaBenchmark
● https://github.com/implydata/benchmark-tools

Weitere ähnliche Inhalte

Was ist angesagt?

Getting Data into Splunk
Getting Data into SplunkGetting Data into Splunk
Getting Data into SplunkSplunk
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Anastasia Lubennikova
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in CloudSteven Wu
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
LogRhythm Appliance Data Sheet
LogRhythm Appliance Data SheetLogRhythm Appliance Data Sheet
LogRhythm Appliance Data Sheetjordagro
 
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...confluent
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
 

Was ist angesagt? (20)

Getting Data into Splunk
Getting Data into SplunkGetting Data into Splunk
Getting Data into Splunk
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Real time data quality on Flink
Real time data quality on FlinkReal time data quality on Flink
Real time data quality on Flink
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
LogRhythm Appliance Data Sheet
LogRhythm Appliance Data SheetLogRhythm Appliance Data Sheet
LogRhythm Appliance Data Sheet
 
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 

Ähnlich wie Benchmarking Apache Druid

Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Jelena Zanko
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
Google F1
Google F1Google F1
Google F1ikewu83
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Knoldus Inc.
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
 

Ähnlich wie Benchmarking Apache Druid (20)

Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Google F1
Google F1Google F1
Google F1
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 

Kürzlich hochgeladen

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 

Benchmarking Apache Druid

  • 1. Benchmarking Apache Druid July 16, 2020 1 Matt Sarrel (matt.sarrel@imply.io) Developer Evangelist
  • 2. 2 Agenda: 1. Intro 2. Why Benchmark? 3. Star Schema Benchmark 4. What We Did 5. DIY Druid Benchmarking
  • 3. Imply Overview 3 Founded by the creators of Apache Druid Funded by Tier 1 investors Trusted by innovative enterprises Best-in-class revenue growth 41x ARR growth in 3 years Leading contributor to Druid
  • 4. Open core Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure. Druid ● Next generation analytics engine ● Widely adopted Workflow transformation ● Subsecond speed unlocks new workflows ● Self-service explanations of data patterns ● Make data fun again 4
  • 5. Core Design ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries ● Optimized storage for time-based datasets ● Time-based functions SEARCH PLATFORM TIME SERIES DB OLAP
  • 6. Key features ● Column oriented ● High concurrency ● Scalable to 1000s of servers, millions of messages/sec ● Continuous, real-time ingest ● Query through SQL ● Target query latency sub-second to a few seconds 6
  • 7. Druid in Data Pipeline Data lakes Message buses Raw data Staging (and Processing) Analytics Database End User Application clicks, ad impressions network telemetry application events
  • 9. Pick your servers Data NodesD ● Large-ish ● Scales with size of data and query volume ● Lots of cores, lots of memory, fast NVMe disk Query NodesQ ● Medium-ish ● Scales with concurrency and # of Data nodes ● Typically CPU bound Master NodesM ● Small-ish Nodes ● Coordinator scales with # of segments ● Overlord scales with # of supervisors and tasks
  • 10. Test Configs Data NodesD ● 3 i3.2xlarge (8CPU / 61GB RAM / 1.9TB NVMe SSD storage) Query NodesQ ● 2 m5d.large (2 CPU / 8GB RAM) Master NodesM ● 1 m5.large (2 CPU / 8GB RAM)
  • 11. Streaming Ingestion Method Kafka Kinesis Tranquility Supervisor type kafka kinesis N/A How it works Druid reads directly from Apache Kafka. Druid reads directly from Amazon Kinesis. Tranquility, a library that ships separately from Druid, is used to push data into Druid. Can ingest late data? Yes Yes No (late data is dropped based on the windowPeriod config) Exactly-once guarantees? Yes Yes No
  • 12. Batch Ingestion Method Native batch (simple) Native batch (parallel) Hadoop-based Parallel? No. Each task is single-threaded. Yes, if firehose is splittable and maxNumConcurrentSubTasks > 1 in tuningConfig. See firehose documentation for details. Yes, always. Can append or overwrite? Yes, both. Yes, both. Overwrite only. File formats Text file formats (CSV, TSV, JSON). Text file formats (CSV, TSV, JSON). Any Hadoop InputFormat. Rollup modes Perfect if forceGuaranteedRollup = true in the tuningConfig. Perfect if forceGuaranteedRollup = true in the tuningConfig. Always perfect. Partitioning options Hash-based partitioning is supported when forceGuaranteedRollup = true in the tuningConfig. Hash-based partitioning (when forceGuaranteedRollup = true). Hash-based or range- based partitioning via partitionsSpec.
  • 13. Is Druid Right For My Project? ● Timestamp dimension ● Streaming ● Denormalized ● Many attributes (30+ dimensions) ● High cardinality Data Characteristics ● Large dataset ● Fast query response (<1s) ● Low latency data ingestion ● Interactive, ad-hoc queries ● Arbitrary slicing and dicing (OLAP) ● Query real-time & historical data ● Infrequent updates Use Case Characteristics
  • 14. Long Term Benchmark Plan ● Loosely follow the enterprise digital transformation journey ● Using widely accepted benchmarks, characterize query performance on batched data ● Using widely accepted data sets benchmarks, characterize streaming data ingestion and query performance ● Fully characterize ingestion with respect to timing and storage ● Develop the Streaming OLAP Benchmark the world needs
  • 15. Druid and Data Warehouses ● Druid is not a DW ● Druid augments DW to provide the following ○ consistent, sub-second SLA ○ pre-aggregation/metrics generation upon ingest ○ simple schema ○ high concurrency reads ● Hot and warm queries in Druid, cold queries in DW ● Druid for internal and external customer powering realtime visualization ● DW for internal customer
  • 16. Confidential. Do not redistribute. Realtime DW Solution Architecture 16 Apps Storage Machines Events Stream > Parse > Search > Detect > Correlate Custom Dashboard Notify ETL ML Block Control Permit Allow Prohibit Custom Data centers managed/unmanaged
  • 17. Confidential. Do not redistribute. Logical Test Architecture 17
  • 18. Star Schema Benchmark ● Designed to evaluate database system performance of star schema data warehouse queries ● Based on TPC-H ● Widely used since 2007 ● Combines standard generated test data with 13 SQL queries ● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF
  • 19. Star Schema Benchmark Data Generation ● DBGEN utility ● Generates ● Fact table – lineorder.tbl ● Dimension tables – customer.tbl, part.tbl, supplier.tbl, date.tbl ● Scale Factor (SF=1) to generate 600 million rows or roughly 100GB
  • 20. SSB ETL and Ingestion ● TBL files are tab delimited ● Generate on EBS, store on S3 ● Amazon Athena (Apache Hive) used to denormalize 5 files into one ● Saved in ORC and parquet formats for flexibility (ORC tested in Druid)
  • 21. How data is structured ● Druid stores data in immutable segments ● Column-oriented compressed format ● Dictionary-encoded at column level ● Bitmap Index Compression : concise & roaring ○ Roaring -typically recommended, faster for boolean operations such as filters ● Rollup (partial aggregation)
  • 22. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  • 23. Controlling Segment Size ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  • 24. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  • 25. Ingestion (and the 5 million rows)
  • 26. Run Rules We ran JMeter against each platform’s HTTP API under the following conditions: Query cache off Each SSB query was run 10 times (10 samples per query) Each query flight consisted of all 13 SSB queries run in succession For each test, Average Response Time, Lowest Response Time, Highest Response Time, and Average Response Time Standard Deviation per query were calculated Each test was repeated five times The lowest and highest test results were discarded, a standard practice to remove outliers from performance testing results, leaving results from 3 test runs The remaining 3 results for each query were averaged to provide results for Average Response Time, Lowest Response Time, Highest Response Time, and Average Response Time Standard Deviation per query were calculated
  • 27. Star Schema Benchmark Queries ● Designed to be around classic DW use cases ● Select from table exactly once ● Restrictions on dimensions ● Druid supports native and SQL queries
  • 28. 13 Queries in Plain English Query Flight 1 has restrictions on 1 dimension and measures revenue increase from eliminating ranges of discounts in given product order quantity intervals shipped in a given year. Q1.1 has restrictions d_year = 1993, lo_quantity < 25, and lo_discount between 1 and 3. Q1.2 changes restrictions of Q1.1 to d_yearmonthnum = 199401, lo_quantity between 26 and 35, lo_discount between 4 and 6. Q1.3 changes the restrictions to d_weeknuminyear = 6 and d_year= 1994, lo_quantity between 36 and 40, and lo_discount between 5 and 7 Query flight 2 has restictions on 2 dimensions. The query compares revenues for certain product classes and suppliers in a certain region, grouped by more restrictive product classes and all years of orders. 2.1 has restrictions on p_category and s_region. 2.2 changes restrictions of Q2.1 to p_brand1 between 'MFGR#2221' and 'MFGR#2228' and s_regrion to 'ASIA' 2.3 changes restriction to p_brand1='MFGR#2339' and s_region='EUROPE' Query flight 3, has restrictions on 3 dimensions. The query is intended to retrieve total revenue for lineorder transactions within and given region in a certain time period, grouped by customer nation, supplier nation and year. Q3.1 has restriction c_region = 'ASIA', s_region='ASIA', and restricts d_year to a 6-year period, grouped by c_nation, s_nation and d_year 3.2 changes region restrictions to c_nation = ""UNITED STATES' and s_nation = 'UNITED STATES', grouping revenue by customer city, supplier city and year. 3.3 changes restrictions to c_city and s_city to two cities in 'UNITED KINGDOM' and retrieves revenue grouped by c_city, s_city, d_year. 3.4 changes date restriction to a single month. After partitioning the 12 billion row dataset on d_yearmonth, we needed to rewrite the query for d_yearmonthnum Query flight 4 provides a ""what-if"" sequence of queries that might be generated in an OLAP-style of exploration. Starting with a query with rather weak constraints on three dimensional columns, we retreive aggregate profit, sum(lo_revenue-lo_supplycost), grouped by d_year and c_nation. Successive queries modify predicate constraints by drilling down to find the source of an anomaly. Q4.1 restricts c_region and s_region both to 'AMERICA', and p_mfgr to one of two possilities. Q4.2 utilizes a typical workflow to dig deeper into the results. We pivot away from grouping by s_nation, restrict d_year to 1997 and 1998, and drill down to group by p_category to see where the profit change arises. Q4.3 digs deeper, restricting s_nation to 'UNITED STATES' and p_category = 'MFGR#14', drilling down to group by s_city (in the USA) and p_brand1 (within p_category 'MFGR#14').
  • 29. Query Optimization ● Date! Date! Date! Biggest impacts in optimization came from aligning date as ingested with anticipated queries. ● Optimize SQL expressions ● Vectorize Query Optimization Stage Query 4.3 SSB (Original) select d_year, s_city, p_brand1, sum(lo_revenue - lo_supplycost) as profit from denormalized where s_nation = 'UNITED STATES' and (d_year = 1997 or d_year = 1998) and p_category = 'MFGR#14' group by d_year, s_city, p_brand1 order by d_year, s_city, p_brand1 Apache Druid select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit from ${jmDataSource} where c_region = 'AMERICA' and s_region = 'AMERICA' and (FLOOR("__time" to YEAR) = TIME_PARSE('1997-01- 01T00:00:00.000Z') or FLOOR("__time" to YEAR) = TIME_PARSE('1998-01-01T00:00:00.000Z')) and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, s_nation, p_category order by d_year, s_nation, p_category
  • 30. Explain Plan EXPLAIN PLAN FOR SELECT d_year, s_city, p_brand1, sum(lo_revenue – lo_supplycost) as profit FROM ssb_data WHERE s_nation = 'UNITED STATES' and (d_year = 1997 or d_year = 1998) and p_category = 'MFGR#14' GROUP BY d_year, s_city, p_brand1 ORDER BY d_year, s_city, p_brand1
  • 33. Apache Druid SSB Results
  • 34. Now Go Do It Yourself! ● Spec out your test project thoroughly ● Representative Data ● Representative Queries ● Install a small cluster (Quickstart) ● Ingest and tune ● Query via console for functional testing ● Install Jmeter (on query server and locally) ● Run queries against the HTTP API (no GUI, query server) ● Change, rerun, measure differences and learn ● The best way to learn is to just do it!
  • 35. Resources ● Druid.apache.org ● Druid.apache.org/community ● ASF #druid Slack channel ● Jmeter.apache.org ● https://www.cs.umb.edu/~poneil/StarSchemaB.PDF ● https://github.com/lemire/StarSchemaBenchmark ● https://github.com/implydata/benchmark-tools