Monitoring infrastructure with prometheus

Monitoring Infrastructure with
Prometheus
@
- Mohd Sahnawaz, Sr. SRE
Singapore Kubernetes User Group, 4th July 2018

Currently…
➔ 600+ servers
➔ External load balancers see avg 5k+ requests per second
➔ Internal Amplification of 8x to 12x
➔ Self managed deployments:
◆ ElasticSearch (Dynamic Scaling)
◆ PostgresQL
◆ Cassandra
◆ Kafka
◆ Redis
◆ RabbitMQ
◆ And more…
➔ Uptime of 99.95
➔ Ability to handle AZ failures

monitoring
Dashboard
Grafana
Data Store
Prometheus
Metrics Source
● Exporter
○ Node
○ Postgres
○ JVM
○ ElasticSearch
○ HAProxy
○ Hystrix
○ StatsD
○ …
● Write your own

Dashboard : Grafana (cont…)
K8S Deployment view →
Hystrix Metric view →

➔ GCE SD configurations to populate hosts
➔ Instance metadata to cluster nodes.
Data source: Prometheus (cont… )
➔ Multiple Instances with different retention
➔ Separate dedicated instances for APM, Node Metrics, ICMP, Kubernetes
➔ Grafana connects to all of these

➔ GCE SD configurations to populate hosts
metrics source: GCE with exporters
scrape_configs:
- job_name: node
scrape_interval: 15s
scrape_timeout: 15s
gce_sd_configs:
- project: <project_name>
zone: <zone_name>
port: 9100
filter: "(name ne .*stage.*)(name ne .*test.*)"
relabel_configs:
- source_labels: [__meta_gce_instance_name]
target_label: host
- source_labels: [__meta_gce_zone]
separator: '/'
regex: '(.*)/(.*)'
replacement: '${2}'
target_label: zone
- source_labels: [__meta_gce_metadata_cluster, __meta_gce_metadata_cluster_name]
separator: ';'
regex: '(.*);(.*)'
replacement: '${1}${2}'
target_label: cluster
- source_labels: [__meta_gce_metadata_env]
target_label: env
- source_labels: [__meta_gce_metadata_component]
target_label: component
Target Labels

➔ K8S SD configurations to populate pods
metrics source: K8S API
- job_name: 'pod-metrics'
scrape_interval: 15s
scrape_timeout: 5s
kubernetes_sd_configs:
- api_server: '<api-path>'
role: node
basic_auth:
username: <user>
password: <access_token>
tls_config:
insecure_skip_verify: true
- api_server: '<api-path>'
role: pod
basic_auth:
username: <user>
password: <access_token>
tls_config:
insecure_skip_verify: true
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: host
- source_labels: [__address__]
action: keep
- source_labels: [__address__]
action: replace
target_label: address
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: kubernetes_container
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: cluster
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: 9284
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node_name
Target Labels

➔ Hystrix is great for real time monitoring.
➔ Helps in quickly identifying failures.
➔ We capture hystrix data to prometheus.
➔ Help in debugging/retrospectives
Metrics source: Hystrix
app statsd-exporter prometheus
turbine

➔ Custom metrics with Hystrix
Metrics source: Hystrix
scrape_configs:
- job_name: 'hystrix-stats'
scrape_interval: '15s'
file_sd_configs:
- files:
- 'hystrix-stats*.yml'
refresh_interval: 5m
metric_relabel_configs:
- source_labels: [__name__]
regex: '^(.*)_hystrix.*'
target_label: hcluster
replacement: '${1}'
regex: '^.*_(hystrix.*)'
target_label: __name__
replacement: '${1}'
regex: '^hystrix_([^_]+)_.*'
target_label: command
replacement: '${1}'
regex: '^hystrix_[^_]+_(.*)'
target_label: __name__
replacement: 'hystrix_${1}'

➔ Official client https://github.com/prometheus/client_golang
➔ Support 4 metric types (counter, gauge, histogram, summary)
➔ Built-in support Common gRPC metrics
➔ Exposed in http://ip:port/metrics
Metrics source: Custom

Alerting : Alertmanager
Prometheus alertmanager
slack
victorops
Alerting process →
Alert rule →

Thank you!
Q & A
We’re hiring, visit careers.carousell.com

Monitoring infrastructure with prometheus

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Monitoring infrastructure with prometheus

Ähnlich wie Monitoring infrastructure with prometheus (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Monitoring infrastructure with prometheus