SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Datadog: A Real-Time Metrics
Database for Trillions of
Points/Day
Ian NOWLAND (https://twitter.com/inowland)
VP, Metrics and Monitors
Joel BARCIAUSKAS (https://twitter.com/JoelBarciauskas)
Director, Aggregation Metrics
QCon NYC ‘19
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-metrics-db/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Some of Our Customers
2
Some of What We Store
3
Changing Source Lifecycle
4
Months/years Seconds
Datacenter Cloud/VM Containers
Changing Data Volume
5
100’s
10,000’s
System
Application
Per User Device
SLIs
Applying Performance Mantras
• Don't do it
• Do it, but don't do it again
• Do it less
• Do it later
• Do it when they're not looking
• Do it concurrently
• Do it cheaper
*From Craig Hanson and Pat Crain, and the performance engineering
community - see http://www.brendangregg.com/methodology.html
6
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
7
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
8
Example Metrics Query 1
“What is the system load on instance i-xyz across the last 30 minutes”
9
A Time Series
metric system.load.1
timestamp 1526382440
value 0.92
tags host:i-xyz,env:dev,...
10
Example Metrics Query 2
“Alert when the system load, averaged across our fleet in us-east-1a for a 5
minute interval, goes above 90%”
11
Example Metrics Query 2
“Alert when the system load, averaged across my fleet in us-east-1a for a 5
minute interval, goes above 90%”
12
Aggregate DimensionTake Action
Metrics Name and Tags
Name: single string dening what you are measuring, e.g.
system.cpu.user
aws.elb.latency
dd.frontend.internal.ajax.queue.length.total
Tags: list of k:v strings, used to qualify metric and add dimensions to lter/aggregate over, e.g.
['host:server-1', 'availability-zone:us-east-1a', 'kernel_version:4.4.0']
['host:server-2', 'availability-zone:us-east-1a', 'kernel_version:2.6.32']
['host:server-3', 'availability-zone:us-east-1b', 'kernel_version:2.6.32']
13
Tags for all the dimensions
Host / container: system metrics by host
Application: internal cache hit rates, timers by module
Service: hits, latencies or errors/s by path and/or response code
Business: # of orders processed, $'s per second by customer ID
14
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
15
Pipeline Architecture
16
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Performance mantras
• Don't do it
• Do it, but don't do it again
• Do it less
• Do it later
• Do it when they're not looking
• Do it concurrently
• Do it cheaper
17
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less
• Do it later
• Do it when they're not looking
• Do it concurrently
• Do it cheaper
18
Pipeline Architecture
19
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Pipeline Architecture
20
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Metrics Store Characteristics
• Most metrics report with a tag set for quite some time
=> Therefore separate tag stores from time series stores
21
Pipeline Architecture
22
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Kafka for Independent Storage Systems
Intake
Incoming
Data
Kafka Points
Store 1
Store 2
Kafka
Tag Sets
Tag Index
Tag
Describer
S3
S3 Writer
Query
System
Outgoing
Data
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently
• Do it cheaper
24
Kafka for Independent Storage Systems
Intake
Incoming
Data
Kafka Points
Store 1
Store 2
Kafka
Tag Sets
Tag Index
Tag
Describer
S3
S3 Writer
Query
System
Outgoing
Data
Scaling through Kafka
Data is separated by partition to distribute it
Partitions are customers, or a mod hash of their metric name
This also gives us isolation.
Intake
Kafka partition:1Incoming
Data
Kafka partition:2
Kafka partition:0
Store 1
Kafka partition:3
Store 2
Store 2
Store 2
Store 1
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently - use independent horizontally scalable data
stores
• Do it cheaper
27
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
28
Per Customer Volume Ballparking
29
104
Number of apps; 1,000’s hosts times 10’s containers
103
Number of metrics emitted from each app/container
100
1 point a second per metric
105
Seconds in a day (actually 86,400)
101
Bytes/point (8 byte float, amortized tags)
= 1013
10 Terabytes a Day For One Average Customer
Volume Math
• $210 to store 10 TB in S3 for a month
• $60,000 for a month rolling queryable (300 TB)
• But S3 is not for real time, high throughput queries
30
Cloud Storage Characteristics
31
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1
4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2
60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3
100+ ms $214
11 nines durability
Glacier Infinite 12 GB/s3
hours $44
11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance
2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance
3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.
4. Storage Cost only
Volume Math
• 80 x1e.32xlarge DRAM for a month
• $300,000 to store for a month
• This is with no indexes or overhead
• And people want to query much more than a month.
32
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper
33
Returning to an Example Query
“Alert when the system load, averaged across our fleet in us-east-1a for a 5
minute interval, goes above 90%”
34
Queries We Need to Support
35
DESCRIBE TAGS What tags are queryable for this metric?
TAG INDEX Given a time series id, what tags were used?
TAG INVERTED
INDEX
Given some tags and a time range, what were
the time series ingested?
POINT STORE What are the values of a time series between
two times?
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper
36
Performance mantras
• Don't do it
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies
37
Cloud Storage Characteristics
38
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1
4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2
60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3
100+ ms $214
11 nines durability
Glacier Infinite 12 GB/s3
hours $44
11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance
2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance
3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.
4. Storage Cost only
Cloud Storage Characteristics
39
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1
4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2
60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3
100+ ms $214
11 nines durability
Glacier Infinite 12 GB/s3
hours $44
11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance
2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance
3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.
4. Storage Cost only
Hybrid Data Storage Types
40
System
DESCRIBE TAGS
TAG INDEX
TAG INVERTED INDEX
POINT STORE
QUERY RESULTS
Hybrid Data Storage Types
41
System Type Persistence
DESCRIBE TAGS Local SSD Years
TAG INDEX DRAM Cache (Hours)
Local SSD Years
TAG INVERTED INDEX DRAM Hours
On SSD Days
S3 Years
POINT STORE DRAM Hours
Local SSD Days
S3 Years
QUERY RESULTS DRAM Cache (Days)
Hybrid Data Storage Technologies
42
System Type Persistence Technology Why?
DESCRIBE TAGS Local SSD Years LevelDB High performing single node k,v
TAG INDEX DRAM Cache (Hours) Redis Very high performance, in memory k,v
Local SSD Years Cassandra Horizontal scaling, persistent k,v
TAG INVERTED INDEX DRAM Hours In house Very customized index data structures
On SSD Days RocksDB + SQLite Rich and flexible queries
S3 Years Parquet Flexible Schema over time
POINT STORE DRAM Hours In house Very customized index data structures
Local SSD Days In house Very customized index data structures
S3 Years Parquet Flexible Schema over time
QUERY RESULTS DRAM Cache (Days) Redis Very high performance, in memory k,v
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
43
Alerts/Monitors Synchronization
• Level sensitive
• False positives is almost as important as false negative
• Small delay preferable to evaluating incomplete data
• Synchronization need is to be sure evaluation bucket is filled
before processing
44
Pipeline Architecture
45
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Inject
heartbeat here
Pipeline Architecture
46
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Inject
heartbeat here
And test it gets to
here
Heartbeats for Synchronization
Semantics:
- 1 second tick time for metrics
- Last write wins to handle agent concurrency
- Inject fake data as heartbeat through pipeline
Then:
- Monitor evaluator ensure heartbeat gets through before evaluating next period
Challenges:
- With sharding and multiple stores, lots of independent paths to make sure
heartbeats go through
47
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies
48
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
49
Types of metrics
50
Counter, aggregate by sum Gauges, aggregate by last or avg
Ex: Requests, errors/s, total
time spent (stopwatch)
Ex: CPU/network/disk use,
queue length
Aggregation
51
{0, 1, 0, 1, 0, 1, 0, 1, 0, 1}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{5, 5, 5, 5, 5, 5, 5, 5, 5, 5}
{0, 2, 4, 8, 16, 32, 64, 128, 256, 512}
Time
S
p
ac
e
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
Query output
Counters: {5, 40, 50, 1023}
Gauges (average): {0.5, 4, 5, 102.3}
Gauges (last): {1, 9, 5, 512}
Query characteristics
52
User:
• Bursty and unpredictable
• Latency Sensitive - ideal end user response is 100ms, 1s at most.
• Skews to recent data, but want same latency on old data
Query characteristics
53
Dashboards:
• Predictable
• Important enough to save
• Looking for step-function changes, e.g. performance regressions,
changes in usage
Focus on outputs
54
These graphs are both aggregating 70k series
Not a lot, but still output 10x to 2000x less than input!
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking?
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies
55
Pipeline Architecture
56
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Aggregation
Points
Pipeline Architecture
57
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Aggregation
Points
Streaming
Aggregator
Pipeline Architecture
58
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Aggregation
Points
No one's looking here!
Streaming
Aggregator
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking - pre-aggregate
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies
59
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
60
Distributions
61
Aggregate by percentile or SLO
(count of values above or below a threshold)
Ex: Latency, request size
Calculating distributions
62
{0, 1, 0, 1, 0, 1, 0, 1, 0, 1}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{5, 5, 5, 5, 5, 5, 5, 5, 5, 5}
{0, 2, 4, 8, 16, 32, 64, 128, 256, 512}
Time
S
p
ac
e
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
{0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8,
8, 9, 16, 32, 64, 128, 256,
512}
p90
p50
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking - pre-aggregate
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper again?
63
What are "sketches"?
64
Data structures designed for operating on streams of data
• Examine each item a limited number of times (ideally once)
• Limited memory usage (logarithmic to the size of the stream,
or xed)
Max size
Examples of sketches
HyperLogLog
• Cardinality / unique count estimation
• Used in Redis PFADD, PFCOUNT, PFMERGE
Others: Bloom lters (also for set membership), frequency
sketches (top-N lists)
65
Tradeoffs
Understand the tradeoffs - speed, accuracy, space
What other characteristics do you need?
• Well-defined or arbitrary range of inputs?
• What kinds of queries are you answering?
66
Approximation for distribution metrics
What's important for approximating distribution metrics?
• Bounded error
• Performance - size, speed of inserts
• Aggregation (aka "merging")
67
How do you compress a distribution
68
Histograms
Basic example from OpenMetrics / Prometheus
69
Histograms
Basic example from OpenMetrics / Prometheus
70
Time spent Count
<= 0.05 (50ms) 24054
<= 0.1 (100ms) 33444
<= 0.2 (200ms) 100392
<= 0.5 (500ms) 129389
<= 1s 133988
> 1s 144320
median = ~158ms (using linear interpolation)
72160
158ms
p99 = ?!
Rank and relative error
71
Rank and relative error
72
Relative error
In metrics, specically latency metrics, we care about about both
the distribution of data as well as specic values
E.g., for an SLO, I want to know, is my p99 500ms or less?
Relative error bounds mean we can answer this: Yes, within 99%
of requests are <= 500ms +/- 1%
Otherwise stated: 99% of requests are guaranteed <= 505ms
73
Fast insertion
Each insertion is just two operations - nd the bucket, increase
the count (sometimes there's an allocation)
74
Fixed Size - how?
With certain distributions, we may reach the maximum number
of buckets (in our case, 4000)
• Roll up lower buckets - lower percentiles are generally not as
interesting!*
*Note that we've yet to nd a data set that actually needs this in practice
75
Aggregation and merging
76
"a binary operation is commutative if changing the order of the
operands does not change the result"
Why is this important?
Talk Plan
1. What Are Metrics Databases?
2. Our Architecture
3. Deep Dive On Our Datastores
4. Handling Synchronization
5. Introducing Aggregation
6. Aggregation For Deeper Insights Using Sketches
7. Sketches Enabling Flexible Architecture
77
Before, during, save for later
If we have two-way mergeable sketches, we can re-aggregate
the aggregations
• Agent
• Streaming during ingestion
• At query time
• In the data store (saving partial results)
78
Pipeline Architecture
79
Customer
Browser
IntakeMetrics sources
Query
System
Web frontend &
APIs
Customer
Monitors and
Alerts
Slack/Email/
PagerDuty etc
Data Stores
Data Stores
Data Stores
Query
Cache
Aggregation
Points
Streaming
Aggregator
DDSketch
DDSketch (Distributed Distribution Sketch) is open source (part
of the agent today)
• Presenting at VLDB2019 in August
• Open-sourcing standalone versions in several languages
80
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking - pre-aggregate
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies
81
Performance mantras
• Don't do it - build the minimal synchronization needed
• Do it, but don't do it again - query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking - pre-aggregate
• Do it concurrently - use independent horizontally scalable data stores
• Do it cheaper - use hybrid data storage types and technologies, and
use compression techniques based on what customers really need
82
Summary
• Don't do it - build the bare minimal synchronization needed
• Do it, but don't do it again - use query caching
• Do it less - only index what you need
• Do it later - minimize processing on path to persistence
• Do it when they're not looking - pre-aggregate where is cost effective
• Do it concurrently - use independent horizontally scaleable data stores
• Do it cheaper - use hybrid data storage types and technologies, and
use compression techniques based on what customers really need
83
Thank You
Challenges and opportunities of
aggregation
• Challenges:
• Accuracy
• Latency
• Opportunity:
• Orders of magnitude performance improvement on common and
highly visible queries
85
Human factors and dashboards
86
• Human-latency sensitive - high visibility
Late-arriving data makes people nervous
• Human granularity - how many lines can you reason about on a
dashboard?
Oh no...
Where aggregation happens
87
At the metric source (agent/lambda/etc)
• Counts by sum
• Gauges by last
At query time
• Arbitrary user selection (avg/sum/min/max)
• Impacts user experience
Adding a new metric type
Counters, gauges, distributions!
Used gauges for latency, etc, but aggregate by last is not what
you want
Need to update the agent, libraries, integrations
We're learning and building on what we have today
88
Building blocks
We have a way to move data around (Kafka)
We have ways to index that data (tagsets)
We know how to separate recent and historical data
Plan for the future
[Lego / puzzle with gaps]
89
Connect the dots
90
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-metrics-db/

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseNAVER Engineering
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala CookbookCloudera, Inc.
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesAmazon Web Services
 
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...Chris Richardson
 
Operationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksOperationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

Was ist angesagt? (20)

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouse
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
 
Operationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at StarbucksOperationalizing Machine Learning at Scale at Starbucks
Operationalizing Machine Learning at Scale at Starbucks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Ähnlich wie Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldStĂŠphane Dorrekens
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Cloudian
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 

Ähnlich wie Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the field
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 

Mehr von C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Mehr von C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

KĂźrzlich hochgeladen

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

KĂźrzlich hochgeladen (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

  • 1. Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND (https://twitter.com/inowland) VP, Metrics and Monitors Joel BARCIAUSKAS (https://twitter.com/JoelBarciauskas) Director, Aggregation Metrics QCon NYC ‘19
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-metrics-db/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Some of Our Customers 2
  • 5. Some of What We Store 3
  • 6. Changing Source Lifecycle 4 Months/years Seconds Datacenter Cloud/VM Containers
  • 8. Applying Performance Mantras • Don't do it • Do it, but don't do it again • Do it less • Do it later • Do it when they're not looking • Do it concurrently • Do it cheaper *From Craig Hanson and Pat Crain, and the performance engineering community - see http://www.brendangregg.com/methodology.html 6
  • 9. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 7
  • 10. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 8
  • 11. Example Metrics Query 1 “What is the system load on instance i-xyz across the last 30 minutes” 9
  • 12. A Time Series metric system.load.1 timestamp 1526382440 value 0.92 tags host:i-xyz,env:dev,... 10
  • 13. Example Metrics Query 2 “Alert when the system load, averaged across our fleet in us-east-1a for a 5 minute interval, goes above 90%” 11
  • 14. Example Metrics Query 2 “Alert when the system load, averaged across my fleet in us-east-1a for a 5 minute interval, goes above 90%” 12 Aggregate DimensionTake Action
  • 15. Metrics Name and Tags Name: single string dening what you are measuring, e.g. system.cpu.user aws.elb.latency dd.frontend.internal.ajax.queue.length.total Tags: list of k:v strings, used to qualify metric and add dimensions to lter/aggregate over, e.g. ['host:server-1', 'availability-zone:us-east-1a', 'kernel_version:4.4.0'] ['host:server-2', 'availability-zone:us-east-1a', 'kernel_version:2.6.32'] ['host:server-3', 'availability-zone:us-east-1b', 'kernel_version:2.6.32'] 13
  • 16. Tags for all the dimensions Host / container: system metrics by host Application: internal cache hit rates, timers by module Service: hits, latencies or errors/s by path and/or response code Business: # of orders processed, $'s per second by customer ID 14
  • 17. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 15
  • 18. Pipeline Architecture 16 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores
  • 19. Performance mantras • Don't do it • Do it, but don't do it again • Do it less • Do it later • Do it when they're not looking • Do it concurrently • Do it cheaper 17
  • 20. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less • Do it later • Do it when they're not looking • Do it concurrently • Do it cheaper 18
  • 21. Pipeline Architecture 19 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache
  • 22. Pipeline Architecture 20 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache
  • 23. Metrics Store Characteristics • Most metrics report with a tag set for quite some time => Therefore separate tag stores from time series stores 21
  • 24. Pipeline Architecture 22 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache
  • 25. Kafka for Independent Storage Systems Intake Incoming Data Kafka Points Store 1 Store 2 Kafka Tag Sets Tag Index Tag Describer S3 S3 Writer Query System Outgoing Data
  • 26. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently • Do it cheaper 24
  • 27. Kafka for Independent Storage Systems Intake Incoming Data Kafka Points Store 1 Store 2 Kafka Tag Sets Tag Index Tag Describer S3 S3 Writer Query System Outgoing Data
  • 28. Scaling through Kafka Data is separated by partition to distribute it Partitions are customers, or a mod hash of their metric name This also gives us isolation. Intake Kafka partition:1Incoming Data Kafka partition:2 Kafka partition:0 Store 1 Kafka partition:3 Store 2 Store 2 Store 2 Store 1
  • 29. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper 27
  • 30. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 28
  • 31. Per Customer Volume Ballparking 29 104 Number of apps; 1,000’s hosts times 10’s containers 103 Number of metrics emitted from each app/container 100 1 point a second per metric 105 Seconds in a day (actually 86,400) 101 Bytes/point (8 byte float, amortized tags) = 1013 10 Terabytes a Day For One Average Customer
  • 32. Volume Math • $210 to store 10 TB in S3 for a month • $60,000 for a month rolling queryable (300 TB) • But S3 is not for real time, high throughput queries 30
  • 33. Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot SSD2 60 TB 12 GB/s 1 us $60 Instance Failures EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability Glacier Infinite 12 GB/s3 hours $44 11 nines durability 1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance 2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance 3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out. 4. Storage Cost only
  • 34. Volume Math • 80 x1e.32xlarge DRAM for a month • $300,000 to store for a month • This is with no indexes or overhead • And people want to query much more than a month. 32
  • 35. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper 33
  • 36. Returning to an Example Query “Alert when the system load, averaged across our fleet in us-east-1a for a 5 minute interval, goes above 90%” 34
  • 37. Queries We Need to Support 35 DESCRIBE TAGS What tags are queryable for this metric? TAG INDEX Given a time series id, what tags were used? TAG INVERTED INDEX Given some tags and a time range, what were the time series ingested? POINT STORE What are the values of a time series between two times?
  • 38. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper 36
  • 39. Performance mantras • Don't do it • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies 37
  • 40. Cloud Storage Characteristics 38 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot SSD2 60 TB 12 GB/s 1 us $60 Instance Failures EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability Glacier Infinite 12 GB/s3 hours $44 11 nines durability 1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance 2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance 3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out. 4. Storage Cost only
  • 41. Cloud Storage Characteristics 39 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot SSD2 60 TB 12 GB/s 1 us $60 Instance Failures EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability Glacier Infinite 12 GB/s3 hours $44 11 nines durability 1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance 2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance 3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out. 4. Storage Cost only
  • 42. Hybrid Data Storage Types 40 System DESCRIBE TAGS TAG INDEX TAG INVERTED INDEX POINT STORE QUERY RESULTS
  • 43. Hybrid Data Storage Types 41 System Type Persistence DESCRIBE TAGS Local SSD Years TAG INDEX DRAM Cache (Hours) Local SSD Years TAG INVERTED INDEX DRAM Hours On SSD Days S3 Years POINT STORE DRAM Hours Local SSD Days S3 Years QUERY RESULTS DRAM Cache (Days)
  • 44. Hybrid Data Storage Technologies 42 System Type Persistence Technology Why? DESCRIBE TAGS Local SSD Years LevelDB High performing single node k,v TAG INDEX DRAM Cache (Hours) Redis Very high performance, in memory k,v Local SSD Years Cassandra Horizontal scaling, persistent k,v TAG INVERTED INDEX DRAM Hours In house Very customized index data structures On SSD Days RocksDB + SQLite Rich and flexible queries S3 Years Parquet Flexible Schema over time POINT STORE DRAM Hours In house Very customized index data structures Local SSD Days In house Very customized index data structures S3 Years Parquet Flexible Schema over time QUERY RESULTS DRAM Cache (Days) Redis Very high performance, in memory k,v
  • 45. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 43
  • 46. Alerts/Monitors Synchronization • Level sensitive • False positives is almost as important as false negative • Small delay preferable to evaluating incomplete data • Synchronization need is to be sure evaluation bucket is lled before processing 44
  • 47. Pipeline Architecture 45 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Inject heartbeat here
  • 48. Pipeline Architecture 46 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Inject heartbeat here And test it gets to here
  • 49. Heartbeats for Synchronization Semantics: - 1 second tick time for metrics - Last write wins to handle agent concurrency - Inject fake data as heartbeat through pipeline Then: - Monitor evaluator ensure heartbeat gets through before evaluating next period Challenges: - With sharding and multiple stores, lots of independent paths to make sure heartbeats go through 47
  • 50. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies 48
  • 51. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 49
  • 52. Types of metrics 50 Counter, aggregate by sum Gauges, aggregate by last or avg Ex: Requests, errors/s, total time spent (stopwatch) Ex: CPU/network/disk use, queue length
  • 53. Aggregation 51 {0, 1, 0, 1, 0, 1, 0, 1, 0, 1} {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} {5, 5, 5, 5, 5, 5, 5, 5, 5, 5} {0, 2, 4, 8, 16, 32, 64, 128, 256, 512} Time S p ac e t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Query output Counters: {5, 40, 50, 1023} Gauges (average): {0.5, 4, 5, 102.3} Gauges (last): {1, 9, 5, 512}
  • 54. Query characteristics 52 User: • Bursty and unpredictable • Latency Sensitive - ideal end user response is 100ms, 1s at most. • Skews to recent data, but want same latency on old data
  • 55. Query characteristics 53 Dashboards: • Predictable • Important enough to save • Looking for step-function changes, e.g. performance regressions, changes in usage
  • 56. Focus on outputs 54 These graphs are both aggregating 70k series Not a lot, but still output 10x to 2000x less than input!
  • 57. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking? • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies 55
  • 58. Pipeline Architecture 56 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Aggregation Points
  • 59. Pipeline Architecture 57 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Aggregation Points Streaming Aggregator
  • 60. Pipeline Architecture 58 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Aggregation Points No one's looking here! Streaming Aggregator
  • 61. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking - pre-aggregate • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies 59
  • 62. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 60
  • 63. Distributions 61 Aggregate by percentile or SLO (count of values above or below a threshold) Ex: Latency, request size
  • 64. Calculating distributions 62 {0, 1, 0, 1, 0, 1, 0, 1, 0, 1} {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} {5, 5, 5, 5, 5, 5, 5, 5, 5, 5} {0, 2, 4, 8, 16, 32, 64, 128, 256, 512} Time S p ac e t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 {0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 9, 16, 32, 64, 128, 256, 512} p90 p50
  • 65. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking - pre-aggregate • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper again? 63
  • 66. What are "sketches"? 64 Data structures designed for operating on streams of data • Examine each item a limited number of times (ideally once) • Limited memory usage (logarithmic to the size of the stream, or xed) Max size
  • 67. Examples of sketches HyperLogLog • Cardinality / unique count estimation • Used in Redis PFADD, PFCOUNT, PFMERGE Others: Bloom lters (also for set membership), frequency sketches (top-N lists) 65
  • 68. Tradeoffs Understand the tradeoffs - speed, accuracy, space What other characteristics do you need? • Well-dened or arbitrary range of inputs? • What kinds of queries are you answering? 66
  • 69. Approximation for distribution metrics What's important for approximating distribution metrics? • Bounded error • Performance - size, speed of inserts • Aggregation (aka "merging") 67
  • 70. How do you compress a distribution 68
  • 71. Histograms Basic example from OpenMetrics / Prometheus 69
  • 72. Histograms Basic example from OpenMetrics / Prometheus 70 Time spent Count <= 0.05 (50ms) 24054 <= 0.1 (100ms) 33444 <= 0.2 (200ms) 100392 <= 0.5 (500ms) 129389 <= 1s 133988 > 1s 144320 median = ~158ms (using linear interpolation) 72160 158ms p99 = ?!
  • 73. Rank and relative error 71
  • 74. Rank and relative error 72
  • 75. Relative error In metrics, specically latency metrics, we care about about both the distribution of data as well as specic values E.g., for an SLO, I want to know, is my p99 500ms or less? Relative error bounds mean we can answer this: Yes, within 99% of requests are <= 500ms +/- 1% Otherwise stated: 99% of requests are guaranteed <= 505ms 73
  • 76. Fast insertion Each insertion is just two operations - nd the bucket, increase the count (sometimes there's an allocation) 74
  • 77. Fixed Size - how? With certain distributions, we may reach the maximum number of buckets (in our case, 4000) • Roll up lower buckets - lower percentiles are generally not as interesting!* *Note that we've yet to nd a data set that actually needs this in practice 75
  • 78. Aggregation and merging 76 "a binary operation is commutative if changing the order of the operands does not change the result" Why is this important?
  • 79. Talk Plan 1. What Are Metrics Databases? 2. Our Architecture 3. Deep Dive On Our Datastores 4. Handling Synchronization 5. Introducing Aggregation 6. Aggregation For Deeper Insights Using Sketches 7. Sketches Enabling Flexible Architecture 77
  • 80. Before, during, save for later If we have two-way mergeable sketches, we can re-aggregate the aggregations • Agent • Streaming during ingestion • At query time • In the data store (saving partial results) 78
  • 81. Pipeline Architecture 79 Customer Browser IntakeMetrics sources Query System Web frontend & APIs Customer Monitors and Alerts Slack/Email/ PagerDuty etc Data Stores Data Stores Data Stores Query Cache Aggregation Points Streaming Aggregator
  • 82. DDSketch DDSketch (Distributed Distribution Sketch) is open source (part of the agent today) • Presenting at VLDB2019 in August • Open-sourcing standalone versions in several languages 80
  • 83. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking - pre-aggregate • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies 81
  • 84. Performance mantras • Don't do it - build the minimal synchronization needed • Do it, but don't do it again - query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking - pre-aggregate • Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies, and use compression techniques based on what customers really need 82
  • 85. Summary • Don't do it - build the bare minimal synchronization needed • Do it, but don't do it again - use query caching • Do it less - only index what you need • Do it later - minimize processing on path to persistence • Do it when they're not looking - pre-aggregate where is cost effective • Do it concurrently - use independent horizontally scaleable data stores • Do it cheaper - use hybrid data storage types and technologies, and use compression techniques based on what customers really need 83
  • 87. Challenges and opportunities of aggregation • Challenges: • Accuracy • Latency • Opportunity: • Orders of magnitude performance improvement on common and highly visible queries 85
  • 88. Human factors and dashboards 86 • Human-latency sensitive - high visibility Late-arriving data makes people nervous • Human granularity - how many lines can you reason about on a dashboard? Oh no...
  • 89. Where aggregation happens 87 At the metric source (agent/lambda/etc) • Counts by sum • Gauges by last At query time • Arbitrary user selection (avg/sum/min/max) • Impacts user experience
  • 90. Adding a new metric type Counters, gauges, distributions! Used gauges for latency, etc, but aggregate by last is not what you want Need to update the agent, libraries, integrations We're learning and building on what we have today 88
  • 91. Building blocks We have a way to move data around (Kafka) We have ways to index that data (tagsets) We know how to separate recent and historical data Plan for the future [Lego / puzzle with gaps] 89
  • 93. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-metrics-db/