Cortex: Prometheus as a Service, One Year On

•

4 likes•1,612 views

Presented by Tom Wilkie at PromCon 2017 With Speaker Notes: https://goo.gl/V9qGva At PromCon 2016, I presented "Project Frankenstein: A multitenant, horizontally scalable Prometheus as a service". It's now one year later, and lots has changed - not least the name! This talk will discuss what we've learnt running a Prometheus service for the past year, the architectural changes we made from the original design, and the improvements we've made to the Cortex user experience.

Technology

Cortex: Prometheus as a
Service, One Year On
Tom Wilkie, PromCon 2017
tom.wilkie@gmail.com
https://github.com/weaveworks/cortex

https://www.youtube.com/watch?v=3Tb4Wc0kfCM

Cortex: Prometheus as a Service
• Natively multi tenant; isolate different
customers in the same services.
• Different story around scaling & HA
• “Virtually inﬁnite” retention and durability
• Opportunities for performance
enhancements
Cortex
YourYourYourYour
Your Jobs
Alertmanager
Grafana Prometheus HA

Frontend
Ditributor
DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3
Cortex Architecture

https://github.com/weaveworks/cortex/issues/254

Frontend
Ditributor
DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3Table Manager
Cortex Architecture

Problem #2: DynamoDB Write
Throughput, again

Original schema:
• Hash Key: <user ID>:<hour>:<metric name>
• Range Key: <label name>:<label value>:<chunk ID>
New schema:
• Hash Key: <user ID>:<day>:<metric name>:<label name>
• Range Key: <chunk ID>:<chunk end time>
https://github.com/weaveworks/cortex/pull/262

Frontend
Ditributor Querier
Table Manager DynamoDB Memcache
Consul
Ingester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
S3
Cortex Architecture

https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

S3 DynamoDB
IOP Cost
($/IOP)
5x10-6 2x10-7
Storage Cost
($/GB/Month)
0.023 0.250
https://github.com/weaveworks/cortex/issues/141

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06
0.005
0.01
0.015
0.02
0.025
Object size (GB)
Cost ($)
S3
DynamoDB

Frontend
Ditributor Querier
Table Manager DynamoDB Memcache
Consul
RulerIngester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
Cortex Architecture

Frontend
Ditributor Querier
Table Manager BigTable Memcache
Consul
RulerIngester
Write requests
Read requests
Control requests
Prometheus
Your Jobs
Cortex Architecture

DynamoDB BigTable
99th Percentile Write
Latency (ms)
70-100 50-150
99th Percentile Read
Latency (ms)
100-2500 ~250
LOC ~2000 ~400
DynamoDB numbers courtesy of Weaveworks

1. DynamoDB Write Throughput
2. DynamoDB Write Throughput, again
3. Recording rules and alerts
4. Long tail
5. Cost
6. DynamoDB, again

Running for >12months
• Availability: querier unavailable for <12hrs ~99.9%
• Durability: lost <2 days of data >99.5%
• 99th percentile write performance ~60ms
• 99th percentile query performance <200ms

Future
• Direct chunk writes from Prometheus to Cortex Chunk Store
• Separate ingester index for better load balancing
• Use prometheus/tsdb for the ingesters
• Etcd & gossip for ring storage
• Chunks in Google Cloud Storage

I left Weaveworks at the begging of June to
focus on Prometheus & Cortex development.
Since then I’ve teamed up with David to
develop some ideas around Prometheus,
logging, and tracing.
We’re available for Prometheus hosting,
consulting, training and support.
email: hello@kausal.co

What's hot

Kong in 1.x TerritoryThibault Charbonnier

Solving some of the scalability problems at booking.comIvan Kruglov

Using Kubernetes to deploy Django in GCPWalter Liu

CNCF explore k8s api using java clientErhwen Kuo

Cortex: Horizontally Scalable, Highly Available PrometheusGrafana Labs

Cncf k8s_network_part1Erhwen Kuo

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...Weaveworks

Self Created Load Balancer for MTA on AWSsharu1204

Cncf k8s_network_02Erhwen Kuo

Developing a user-friendly OpenResty applicationThibault Charbonnier

Breaking Prometheus (Promcon Berlin '16)Matthew Campbell

From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen

Building Thick Clients with Tower in RustLucio Franco

Altitude NY 2018: 132 websites, 1 service: Your local news runs on FastlyFastly

A Cassandra driver from and for the Lua communityThibault Charbonnier

PubSub++ - Atmosphere 2015Krzysztof Debski

Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSAWS Vietnam Community

Traefik on google kubernetes engineManuel Zapf

Ingress overviewHarshal Shah

Tableapp architecture migration story for GCPUG.TWYen-Wen Chen

What's hot (20)

Kong in 1.x Territory

Solving some of the scalability problems at booking.com

Using Kubernetes to deploy Django in GCP

CNCF explore k8s api using java client

Cortex: Horizontally Scalable, Highly Available Prometheus

Cncf k8s_network_part1

Project Frankenstein: A multitenant, horizontally scalable Prometheus as a se...

Self Created Load Balancer for MTA on AWS

Cncf k8s_network_02

Developing a user-friendly OpenResty application

Breaking Prometheus (Promcon Berlin '16)

From AWS to GCP, TABLEAPP Architecture Story

Building Thick Clients with Tower in Rust

Altitude NY 2018: 132 websites, 1 service: Your local news runs on Fastly

A Cassandra driver from and for the Lua community

PubSub++ - Atmosphere 2015

Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS

Traefik on google kubernetes engine

Ingress overview

Tableapp architecture migration story for GCPUG.TW

Similar to Cortex: Prometheus as a Service, One Year On

Become a Performance Diagnostics HeroTechWell

Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft

Fixing twitterRoger Xia

Fixing_Twitterliujianrong

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight

5 Pitfalls to Avoid with MongoDBTim Callaghan

Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner

Performance & Scalability Improvements in PerforcePerforce

Avoiding Common Pitfalls: Spark Structured Streaming with KafkaHostedbyConfluent

Production Ready Serverless Java Applications in 3 Weeks AWS UG Cologne Febru...Vadym Kazulkin

MongoDB World 2018: Building a New Transactional ModelMongoDB

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent

Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI

Getting Started with Amazon RedshiftAmazon Web Services

Chirp 2010: Scaling TwitterJohn Adams

Running Legacy Applications with ContainersLinuxCon ContainerCon CloudOpen China

Serverlessusecase workshop feb3_v2kartraj

Scaling habits of ASP.NETDavid Giard

Presto At Treasure DataTaro L. Saito

Similar to Cortex: Prometheus as a Service, One Year On (20)

Become a Performance Diagnostics Hero

Cloud Architecture Tutorial - Running in the Cloud (3of3)

Fixing twitter

Fixing_Twitter

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...

5 Pitfalls to Avoid with MongoDB

Top Java Performance Problems and Metrics To Check in Your Pipeline

Performance & Scalability Improvements in Perforce

Avoiding Common Pitfalls: Spark Structured Streaming with Kafka

Production Ready Serverless Java Applications in 3 Weeks AWS UG Cologne Febru...

MongoDB World 2018: Building a New Transactional Model

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...

Planet-scale Data Ingestion Pipeline: Bigdam

Getting Started with Amazon Redshift

Chirp 2010: Scaling Twitter

Running Legacy Applications with Containers

Serverlessusecase workshop feb3_v2

Scaling habits of ASP.NET

Presto At Treasure Data

Recently uploaded

Advanced Computer Architecture – An IntroductionDilum Bandara

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

unit 4 immunoblotting technique complete.pptxBkGupta21

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

From Family Reminiscence to Scholarly Archive .Alan Dix

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Recently uploaded (20)

Advanced Computer Architecture – An Introduction

What is DBT - The Ultimate Data Build Tool.pdf

Generative AI for Technical Writer or Information Developers

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

The Ultimate Guide to Choosing WordPress Pros and Cons

How AI, OpenAI, and ChatGPT impact business and software.

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Take control of your SAP testing with UiPath Test Suite

DevoxxFR 2024 Reproducible Builds with Apache Maven

unit 4 immunoblotting technique complete.pptx

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Ensuring Technical Readiness For Copilot in Microsoft 365

Unleash Your Potential - Namagunga Girls Coding Club

From Family Reminiscence to Scholarly Archive .

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Cortex: Prometheus as a Service, One Year On

1. Cortex: Prometheus as a Service, One Year On Tom Wilkie, PromCon 2017 tom.wilkie@gmail.com https://github.com/weaveworks/cortex

3. https://www.youtube.com/watch?v=3Tb4Wc0kfCM

4. Cortex: Prometheus as a Service • Natively multi tenant; isolate different customers in the same services. • Different story around scaling & HA • “Virtually inﬁnite” retention and durability • Opportunities for performance enhancements Cortex YourYourYourYour Your Jobs Alertmanager Grafana Prometheus HA

5. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture

6. A Year’s Evolution

7. Problem #1: DynamoDB Write Throughput

8. https://github.com/weaveworks/cortex/issues/254

9. Frontend Ditributor DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3Table Manager Cortex Architecture

10. Problem #2: DynamoDB Write Throughput, again

11. Original schema: • Hash Key: <user ID>:<hour>:<metric name> • Range Key: <label name>:<label value>:<chunk ID> New schema: • Hash Key: <user ID>:<day>:<metric name>:<label name> • Range Key: <chunk ID>:<chunk end time> https://github.com/weaveworks/cortex/pull/262

12. Problem #3: Queries of Death

13. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Cortex Architecture

14. Problem #3: Recording rules and alerts

15. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul Ingester Write requests Read requests Control requests Prometheus Your Jobs S3 Ruler Cortex Architecture

16. Problem #4: Long tail

17.

18. https://www.weave.works/blog/the-long-tail-tools-to-investigate-high-long-tail-latency/

19. Problem #5: Cost

20. S3 DynamoDB IOP Cost ($/IOP) 5x10-6 2x10-7 Storage Cost ($/GB/Month) 0.023 0.250 https://github.com/weaveworks/cortex/issues/141

21. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.005 0.01 0.015 0.02 0.025 Object size (GB) Cost ($) S3 DynamoDB

22. Frontend Ditributor Querier Table Manager DynamoDB Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture

23. Problem #6: DynamoDB, again

24. Frontend Ditributor Querier Table Manager BigTable Memcache Consul RulerIngester Write requests Read requests Control requests Prometheus Your Jobs Cortex Architecture

25. DynamoDB BigTable 99th Percentile Write Latency (ms) 70-100 50-150 99th Percentile Read Latency (ms) 100-2500 ~250 LOC ~2000 ~400 DynamoDB numbers courtesy of Weaveworks

26. Closing thoughts

27. 1. DynamoDB Write Throughput 2. DynamoDB Write Throughput, again 3. Recording rules and alerts 4. Long tail 5. Cost 6. DynamoDB, again

28. Running for >12months • Availability: querier unavailable for <12hrs ~99.9% • Durability: lost <2 days of data >99.5% • 99th percentile write performance ~60ms • 99th percentile query performance <200ms

29. Future • Direct chunk writes from Prometheus to Cortex Chunk Store • Separate ingester index for better load balancing • Use prometheus/tsdb for the ingesters • Etcd & gossip for ring storage • Chunks in Google Cloud Storage

30. One more thing…

31. I left Weaveworks at the begging of June to focus on Prometheus & Cortex development. Since then I’ve teamed up with David to develop some ideas around Prometheus, logging, and tracing. We’re available for Prometheus hosting, consulting, training and support. email: hello@kausal.co

32. Metrics

33. Logs

34. Traces

35. Thank you! Questions?