SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
FROM NOTHING TO PROMETHEUS
ONE YEAR AFTER
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
Speaker ID
Antoine LEROYER
⬢ Infrastructure Engineer / SRE @ Deezer since 2016
⬢ DevOps @ EDF (2013-2016)
⬢ Sysadmin @ Netvibes (2012-2013)
2
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
Agenda
⬢ Deezer in 30 seconds
⬢ State of Deezer Infrastructure in 2016
⬢ Why Prometheus?
⬢ Let’s dive in our setup
⬢ What’s next for us in monitoring
⬢ Questions
3
4
Deezer in 30 seconds
5
Deezer in 30 seconds
⬢ Streaming music service
⬢ Launched in 2007
⬢ Available on multiple devices: Mobile, Desktop, TV, Speakers, etc.
6
12M 185+ 43M
active users countries tracks
(and counting)
State of Deezer Infrastructure
In 2016
7
⬢ Fully managed by our provider
⬡ Rack and initial setup of servers
⬡ Configuration management
⬡ Monitoring
⬡ Alerting
⬢ Majority of bare metal servers (400+)
⬢ Infrastructure Team was small
⬢ Technical staff went big (x4 in one year)
⬢ ...so our team got new members to handle the growth :)
8
State of Deezer Infrastructure in 2016
The new Infrastructure Team @ Deezer
9
If you want to managed yourself the production without your provider, you
need a proper monitoring solution. (and other things but that’s not the point here)
10
But first, we ask ourselves
Okay, so what our needs?
State of Deezer Infrastructure in 2016
Our needs
⬢ Have a bunch of metrics to make nice graphs
⬢ Send alerts if something went wrong
⬢ Easy to deploy on the existing infrastructure
⬢ But also support container orchestration for the future
⬢ Being able to scale up/down without triggering alerts
11
State of Deezer Infrastructure in 2016
Why Prometheus?
12
What is Prometheus?
13
⬢ Open-source systems monitoring and alerting toolkit
⬢ Time series database with metrics name and labels
⬢ Pull time series over HTTP instead of push
⬢ Targets are discovered via service discovery
⬢ No distributed storage, nodes are autonomous
https://prometheus.io/docs/introduction/overview/
Why Prometheus?
What is Prometheus?
14
⬢ Typical time series
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"} 3
Why Prometheus?
Why Prometheus?
⬢ Design for metrics (TSDB)
⬢ Provide alerting thanks to Alertmanager
⬢ Grafana support
⬢ High performances
⬢ Powerful but simple query language (PromQL)
⬢ Service Discovery
⬡ Follow your infrastructure scaling up/down
⬡ Ready for container orchestration
15
Let’s dive in our setup
16
First, a service discovery
17
We use Consul:
⬢ Already deployed on some servers
⬢ Supported by Prometheus
⬢ Blazing fast and lightweight
⬢ Service declaration with tags support
⬢ Bonus:
⬡ we have service check
⬡ and a K/V store
Let’s dive in our setup
Consul by Hashicorp (consul.io)
Then, Prometheus
⬢ 2 monstrous servers in each PoP
⬡ 32 cores
⬡ 128GB RAM
⬡ RAID 10 SSD
⬢ Currently running 2.1
Also, an Alertmanager cluster.
18
Let’s dive in our setup
And also exporters
It can be:
⬢ A daemon exposing metrics through an HTTP endpoint
⬢ A HTTP endpoint inside your application
⬢ A Prometheus pushgateway
Your endpoint must expose plain text data in Prometheus format.
19
Let’s dive in our setup
Prometheus infrastructure for one datacenter
20
How do I monitor a server and its services?
1. Deploy consul agent and a bunch of exporters on a node
2. Add services to your consul agent with some tags
3. ????
4. Profit!!!!
21
Let’s dive in our setup
Consul Agent configuration
22
Let’s dive in our setup
port where my exporter is
listening
tag to filter
environment
# Consul Service JSON for Apache
{
"service": {
"name": "apache",
"tags": [
"prod",
"apache",
"exporter-6110"
],
"address": "",
"port": 443,
"enableTagOverride": false,
"checks": [
{
"script": "apache-check.sh",
"interval": "5s"
}
]
}
}
Prometheus relabeling: a strong feature
23
Let’s dive in our setup
⬢ Before scraping, Prometheus allow you to change/create labels
⬢ You can create labels to help you identify your metrics
# Replace job label with service name
- source_labels: [__meta_consul_service]
target_label: job
# Add datacenter label
- source_labels: [__meta_consul_dc]
target_label: dc
# Add instance name
- source_labels: [__meta_consul_node]
target_label: instance
# Create a group label from node name
- source_labels: [__meta_consul_node]
regex: ^(blm|dzr|dev)-([a-z]+)-.*
target_label: group
replacement: ${2}
Prometheus relabeling: a strong feature
24
Let’s dive in our setup
⬢ You can change internal labels of Prometheus
⬡ They start with __ and will be removed before storing the metric
⬡ You can override labels used for scraping to obtain a dynamic configuration
# Retrieve exporter port from consul tags
- source_labels: [__meta_consul_tags]
regex: .*,exporter-([0-9]+),.*
target_label: __exporter_port
replacement: ${1}
# Define addr:port to scrape
- source_labels: [__meta_consul_address,__exporter_port]
separator: ":"
target_label: __address__
replacement: ${1}
After relabeling
25
Let’s dive in our setup
Just a bunch of exporters
26
Typical week @ Deezer
27
Typical day for memcached
28
Impact of Prometheus v2
29
1.8.2 2.1.0
Impact of Prometheus v2
30
1.8.2 2.1.0
⬢ We have over 2.3 millions time series
⬢ It scrapes ~57k samples per seconds
⬢ 30s interval scrape in general
⬢ No late so far
31
Some stats about Prometheus itself
OS tuning
# SSD Tuning
echo 0 > /sys/block/sdX/queue/rotational
echo deadline > /sys/block/sdX/queue/scheduler
# /etc/sysctl.d/local.conf
vm.swappiness=1
# /etc/security/limits.d/00prometheus
prometheus - nofile 10000000
# If you have an Intel CPU, want consistent CPU frequencies and scaling_governor
# doesn’t work. Put this in your kernel boot args.
intel_pstate=disable
32
Let’s dive in our setup
# Equal to 2/3 of your total memory
-storage.local.target-heap-size
# Set it to 5m to reduce charge on SSD
-storage.local.checkpoint-interval
# If you have a large number of time series and a low scrape interval
# you can increase this above 10k easily
-storage.local.num-fingerprint-mutexes
# If you have SSD, you can put this one really high
-storage.local.checkpoint-dirty-series-limit
33
Some 1.6.x to 1.8.x settings (in case you need it)
Let’s dive in our setup
Source: Configuring Prometheus for High Performance [A] - Björn Rabenstein, SoundCloud Ltd.
In 2.x
New TSDB engine. Just one setting:
--storage.tsdb.retention
Prometheus will take care of the rest.
Just ensure you have enough disk space. (depending on retention)
34
Let’s dive in our setup
What’s next for us in monitoring?
35
What’s next for us in monitoring?
⬢ Go over 15 days of retention
⬡ Use remote read/write feature to export/read back data
⬢ Experiment with remote read to have only one endpoint to read metrics from
⬢ Alerting as a Service
⬡ Try to automate Prometheus alerting rules creation
⬡ Provision Alertmanager for each team
⬢ Write some exporters :)
⬢ Kubernetes!
36
Questions?
37
Thanks!
38

Weitere ähnliche Inhalte

Was ist angesagt?

Linuxday.at - Lightning Talk
Linuxday.at - Lightning TalkLinuxday.at - Lightning Talk
Linuxday.at - Lightning TalkJan Gehring
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslogamiable_indian
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guideN Masahiro
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016Susan Potter
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Zabbix
 
Fluentd and PHP
Fluentd and PHPFluentd and PHP
Fluentd and PHPchobi e
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4琛琳 饶
 
Life of an Fluentd event
Life of an Fluentd eventLife of an Fluentd event
Life of an Fluentd eventKiyoto Tamura
 
Nmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumerationNmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumerationRobert Rowley
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with OpenstackArun prasath
 
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014Puppet
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com琛琳 饶
 
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq HanayshaNessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq HanayshaHanaysha
 
Lua tech talk
Lua tech talkLua tech talk
Lua tech talkLocaweb
 
Like loggly using open source
Like loggly using open sourceLike loggly using open source
Like loggly using open sourceThomas Alrin
 

Was ist angesagt? (20)

Linuxday.at - Lightning Talk
Linuxday.at - Lightning TalkLinuxday.at - Lightning Talk
Linuxday.at - Lightning Talk
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslog
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
 
Logstash
LogstashLogstash
Logstash
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
 
Fluentd and PHP
Fluentd and PHPFluentd and PHP
Fluentd and PHP
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Life of an Fluentd event
Life of an Fluentd eventLife of an Fluentd event
Life of an Fluentd event
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Nmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumerationNmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumeration
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
 
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
 
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq HanayshaNessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
 
Fluentd meetup #2
Fluentd meetup #2Fluentd meetup #2
Fluentd meetup #2
 
Lua tech talk
Lua tech talkLua tech talk
Lua tech talk
 
Like loggly using open source
Like loggly using open sourceLike loggly using open source
Like loggly using open source
 

Ähnlich wie From nothing to Prometheus : one year after

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasDatavail
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기Ji-Woong Choi
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesSreenivas Makam
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus TrainingTim Tyler
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...InfluxData
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65N Masahiro
 
Build reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQBuild reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQRobin Xiao
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsDigitalOcean
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...InfluxData
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011Mike Willbanks
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin Kuberton
 

Ähnlich wie From nothing to Prometheus : one year after (20)

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
System monitoring
System monitoringSystem monitoring
System monitoring
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB Atlas
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus Training
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Build reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQBuild reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQ
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 

Kürzlich hochgeladen

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Kürzlich hochgeladen (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

From nothing to Prometheus : one year after

  • 1. FROM NOTHING TO PROMETHEUS ONE YEAR AFTER MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
  • 2. Speaker ID Antoine LEROYER ⬢ Infrastructure Engineer / SRE @ Deezer since 2016 ⬢ DevOps @ EDF (2013-2016) ⬢ Sysadmin @ Netvibes (2012-2013) 2 MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
  • 3. Agenda ⬢ Deezer in 30 seconds ⬢ State of Deezer Infrastructure in 2016 ⬢ Why Prometheus? ⬢ Let’s dive in our setup ⬢ What’s next for us in monitoring ⬢ Questions 3
  • 4. 4
  • 5. Deezer in 30 seconds 5
  • 6. Deezer in 30 seconds ⬢ Streaming music service ⬢ Launched in 2007 ⬢ Available on multiple devices: Mobile, Desktop, TV, Speakers, etc. 6 12M 185+ 43M active users countries tracks (and counting)
  • 7. State of Deezer Infrastructure In 2016 7
  • 8. ⬢ Fully managed by our provider ⬡ Rack and initial setup of servers ⬡ Configuration management ⬡ Monitoring ⬡ Alerting ⬢ Majority of bare metal servers (400+) ⬢ Infrastructure Team was small ⬢ Technical staff went big (x4 in one year) ⬢ ...so our team got new members to handle the growth :) 8 State of Deezer Infrastructure in 2016
  • 9. The new Infrastructure Team @ Deezer 9
  • 10. If you want to managed yourself the production without your provider, you need a proper monitoring solution. (and other things but that’s not the point here) 10 But first, we ask ourselves Okay, so what our needs? State of Deezer Infrastructure in 2016
  • 11. Our needs ⬢ Have a bunch of metrics to make nice graphs ⬢ Send alerts if something went wrong ⬢ Easy to deploy on the existing infrastructure ⬢ But also support container orchestration for the future ⬢ Being able to scale up/down without triggering alerts 11 State of Deezer Infrastructure in 2016
  • 13. What is Prometheus? 13 ⬢ Open-source systems monitoring and alerting toolkit ⬢ Time series database with metrics name and labels ⬢ Pull time series over HTTP instead of push ⬢ Targets are discovered via service discovery ⬢ No distributed storage, nodes are autonomous https://prometheus.io/docs/introduction/overview/ Why Prometheus?
  • 14. What is Prometheus? 14 ⬢ Typical time series # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 http_requests_total{method="post",code="400"} 3 Why Prometheus?
  • 15. Why Prometheus? ⬢ Design for metrics (TSDB) ⬢ Provide alerting thanks to Alertmanager ⬢ Grafana support ⬢ High performances ⬢ Powerful but simple query language (PromQL) ⬢ Service Discovery ⬡ Follow your infrastructure scaling up/down ⬡ Ready for container orchestration 15
  • 16. Let’s dive in our setup 16
  • 17. First, a service discovery 17 We use Consul: ⬢ Already deployed on some servers ⬢ Supported by Prometheus ⬢ Blazing fast and lightweight ⬢ Service declaration with tags support ⬢ Bonus: ⬡ we have service check ⬡ and a K/V store Let’s dive in our setup Consul by Hashicorp (consul.io)
  • 18. Then, Prometheus ⬢ 2 monstrous servers in each PoP ⬡ 32 cores ⬡ 128GB RAM ⬡ RAID 10 SSD ⬢ Currently running 2.1 Also, an Alertmanager cluster. 18 Let’s dive in our setup
  • 19. And also exporters It can be: ⬢ A daemon exposing metrics through an HTTP endpoint ⬢ A HTTP endpoint inside your application ⬢ A Prometheus pushgateway Your endpoint must expose plain text data in Prometheus format. 19 Let’s dive in our setup
  • 20. Prometheus infrastructure for one datacenter 20
  • 21. How do I monitor a server and its services? 1. Deploy consul agent and a bunch of exporters on a node 2. Add services to your consul agent with some tags 3. ???? 4. Profit!!!! 21 Let’s dive in our setup
  • 22. Consul Agent configuration 22 Let’s dive in our setup port where my exporter is listening tag to filter environment # Consul Service JSON for Apache { "service": { "name": "apache", "tags": [ "prod", "apache", "exporter-6110" ], "address": "", "port": 443, "enableTagOverride": false, "checks": [ { "script": "apache-check.sh", "interval": "5s" } ] } }
  • 23. Prometheus relabeling: a strong feature 23 Let’s dive in our setup ⬢ Before scraping, Prometheus allow you to change/create labels ⬢ You can create labels to help you identify your metrics # Replace job label with service name - source_labels: [__meta_consul_service] target_label: job # Add datacenter label - source_labels: [__meta_consul_dc] target_label: dc # Add instance name - source_labels: [__meta_consul_node] target_label: instance # Create a group label from node name - source_labels: [__meta_consul_node] regex: ^(blm|dzr|dev)-([a-z]+)-.* target_label: group replacement: ${2}
  • 24. Prometheus relabeling: a strong feature 24 Let’s dive in our setup ⬢ You can change internal labels of Prometheus ⬡ They start with __ and will be removed before storing the metric ⬡ You can override labels used for scraping to obtain a dynamic configuration # Retrieve exporter port from consul tags - source_labels: [__meta_consul_tags] regex: .*,exporter-([0-9]+),.* target_label: __exporter_port replacement: ${1} # Define addr:port to scrape - source_labels: [__meta_consul_address,__exporter_port] separator: ":" target_label: __address__ replacement: ${1}
  • 26. Just a bunch of exporters 26
  • 27. Typical week @ Deezer 27
  • 28. Typical day for memcached 28
  • 29. Impact of Prometheus v2 29 1.8.2 2.1.0
  • 30. Impact of Prometheus v2 30 1.8.2 2.1.0
  • 31. ⬢ We have over 2.3 millions time series ⬢ It scrapes ~57k samples per seconds ⬢ 30s interval scrape in general ⬢ No late so far 31 Some stats about Prometheus itself
  • 32. OS tuning # SSD Tuning echo 0 > /sys/block/sdX/queue/rotational echo deadline > /sys/block/sdX/queue/scheduler # /etc/sysctl.d/local.conf vm.swappiness=1 # /etc/security/limits.d/00prometheus prometheus - nofile 10000000 # If you have an Intel CPU, want consistent CPU frequencies and scaling_governor # doesn’t work. Put this in your kernel boot args. intel_pstate=disable 32 Let’s dive in our setup
  • 33. # Equal to 2/3 of your total memory -storage.local.target-heap-size # Set it to 5m to reduce charge on SSD -storage.local.checkpoint-interval # If you have a large number of time series and a low scrape interval # you can increase this above 10k easily -storage.local.num-fingerprint-mutexes # If you have SSD, you can put this one really high -storage.local.checkpoint-dirty-series-limit 33 Some 1.6.x to 1.8.x settings (in case you need it) Let’s dive in our setup Source: Configuring Prometheus for High Performance [A] - Björn Rabenstein, SoundCloud Ltd.
  • 34. In 2.x New TSDB engine. Just one setting: --storage.tsdb.retention Prometheus will take care of the rest. Just ensure you have enough disk space. (depending on retention) 34 Let’s dive in our setup
  • 35. What’s next for us in monitoring? 35
  • 36. What’s next for us in monitoring? ⬢ Go over 15 days of retention ⬡ Use remote read/write feature to export/read back data ⬢ Experiment with remote read to have only one endpoint to read metrics from ⬢ Alerting as a Service ⬡ Try to automate Prometheus alerting rules creation ⬡ Provision Alertmanager for each team ⬢ Write some exporters :) ⬢ Kubernetes! 36