SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Monitoring @ Facebook
Ran Leibman
Production Engineer
Monitoring Tools, Components, & Mentality at Facebook
Who Am I ?
Agenda
• Problems in today’s monitoring, solutions & approaches
• Facebook Monitoring Architecture
• Dive into each component
• Show Use cases
The Problems
Problems - Nagios Checks
•Good for binary checks
•Monitoring based on a “point in time” when the script executes
•Can’t monitor on a time window (can do clowny checks with a
temp file but that is not so elegant …)
•Perf data
•Not all plugins implement this
•Data gathering is coupled with the way you want to alert
•Hard to aggregate from perf data from multiple checks
Problems - Cron Scripts
•Difficult to put this scripts in “maintenance mode” while deploying
code or acknowledging an issue
•How do you know that your script actually runs?
•Who stopped crond!?
•Error handling?
Problems - Metrics R 2nd Class Citizen
•We are not always treating our metric store like we treat our
application data
•Storing metrics in clowny temp files or some unmaintained mysql
•Metric is DATA like any other and we should treat it as such!
•How & from where we are going to query it?
•What is the best way to store it?
•Retention?
Problems - Ops Ownership
•Usually only the ops teams
owns monitoring
•Even if the developers wants
to add metrics and alerts they
have a steep learning curve in
order to achieve this
Facebook Monitoring Architecture
Operational Data Store - ODS
Operational Data Store - ODS
•“key —> float” type of metrics, associated with an entity
•system.load1, system.cpu-user, system.n_eth-txbyt
•chef.run_sucess
•chef.last_run_time
•myapp.request_num
•myapp.request_median
Entity / Key-Value
Operational Data Store - ODS
•Use Gorilla in-memory TSDB for short term data (24 hours)
•Store permanent data in HBase on top of HDFS
•Aggregation
•by rack / cluster / tier
•by custom tags - app, tier name, etc …
•cross datacenter aggregations
Data Store
Collecting Metrics
Operational Data Store - ODS
• API modules for every imaginable language
• Thrift Endpoint - https://thrift.apache.org
• Implements fb303 counters
• fb303 counters are collected from the service by FBAgent
• FBAgent submit the metrics over to ODS
Retention
Operational Data Store - ODS
•Metrics is A LOT of data!
•All ODS metrics are being rolled up in the same way
•Daily - 2 Weeks (depends on you)
•Weekly - 2 Weeks (1min)
•Monthly - 1 Month (1h)
•Yearly - until the end of time (1h)
•How do I solve the data loss??
Aggregation
Operational Data Store
•Aggregate important metrics save spikes
•p50, p90, p99
•top(N)
•count
•min, max
•Aggregate by cluster | rack | custom
Scuba - Real-Time Deep Dive Log Monitoring
Whats is going on RIGHT NOW with my service?
Scuba - Real-Time Log Monitoring
•Was started as a hackathon ! today we can’t live without it
•Combine application logs from all servers & containers into a
single table
•Data is stored in memory
•Very small lag (<1min)
•Super fast queries (median of ~300ms)
•SQL like query syntax
Logging to scuba
Scuba - Real-Time Log Monitoring
• Libraries for every imaginable language
• PHP, Python, C++, Bash, etc …
• Scuba supports:
• String
• Ints
• Set of String
• Stack of strings (usually for stack traces)
So… What’s the catch ?
Scuba - Real-Time Log Monitoring
• Strict quota policy
• by size
• by time
• Use sampling in order to reduce load
• Not good for pipelining of data
Alarm System
What is an Alert?
Alarm System
•Creating an alert does not mean you’ll be notified!
•Alert is an event stating that something happened
•Can’t ssh to server
•p50 of request time is slower than 100ms 80% of the time in the last 10min
•the application tier in us-east is 50% down
•The alert should contain ALL relevant data about the event
•Alerts can be suppressed in case of maintenance
Alarm System - Alert Structure
FBAR - Facebook Auto-Remediation
Automation Automation Automation !
FBAR
• Most alarms could be auto remediate
without human intervention
• Code it once, never do it again
• Doing the work of 136,000 engineering
hours (29/04/2015)
•136,000 / 8 = 17,000 engineers a day !
FBAR
FBAR
FBAR
Notifications & Subscriptions
What should I be paged about?
Notification & Subscriptions
•Actionable alerts
•Impactful alerts
•Before you subscribe to an alert ask yourself:
•Can I automate this? (FBAR!)
•Is this actionable?
•Should an engineer wake up because of this?
Notification & Subscriptions
Dashboards
What are they good for ?
Dashboards
• Awesome tool for debugging production issues
• Making your case:
• We need to fix this code path
• If we had the BLABLA tool it would reduce this by a factor of X
• Since we deployed the last release engagement dropped by
10% in west Europe
• Dashboards are cool to look at =) it’s the best way to get an
understanding of what is going on with the service
Cubism
based on Cubism.js
? ‫לנו‬ ‫היה‬ ‫מה‬ ‫אז‬
Use data!
Think before you alert
Surfacing problems you never thought
existed
Monitoring is not an “Ops Job”
Questions?
Ran Leibman
Production Engineer
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Weitere ähnliche Inhalte

Was ist angesagt?

Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
Amazee Labs
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
OpenStack Korea Community
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 

Was ist angesagt? (20)

Docker containers
Docker containersDocker containers
Docker containers
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
LoadBalancer using KeepAlived
LoadBalancer using KeepAlivedLoadBalancer using KeepAlived
LoadBalancer using KeepAlived
 
Automating Network Infrastructure : Ansible
Automating Network Infrastructure : AnsibleAutomating Network Infrastructure : Ansible
Automating Network Infrastructure : Ansible
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log ManagementELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Elasticsearch Logstash and Kibana Stack for Log Management
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIOHigh Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 

Andere mochten auch

How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
DevOpsDays Tel Aviv
 

Andere mochten auch (20)

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin ParmOSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
 
2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoring
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
 
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOp...
 
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
AppSphere 15 - Mining the World’s Largest Healthcare Data Warehouse while Ens...
 
Metals, nonmetals, metalloids
Metals, nonmetals, metalloidsMetals, nonmetals, metalloids
Metals, nonmetals, metalloids
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Visibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWSVisibility from user to infrastructure on AWS
Visibility from user to infrastructure on AWS
 
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
AppSphere 15 - Application Analytics helping DevOps with Data Driven Decision...
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin Whittle
 
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
AppSphere 2016 - Automate performance testing with AppDynamics using continuo...
 
How to Speak "Manager"
How to Speak "Manager"How to Speak "Manager"
How to Speak "Manager"
 

Ähnlich wie Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Amazon Web Services
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
Enkitec
 

Ähnlich wie Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015 (20)

Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applications
 
Metrics & more
Metrics & more Metrics & more
Metrics & more
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestBuilding an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Tech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application StackTech for the Non Technical - Anatomy of an Application Stack
Tech for the Non Technical - Anatomy of an Application Stack
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
 
Performance tuning Grails applications
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applications
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception data
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec Pipeline
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems
 
Ioug oow12 em12c
Ioug oow12 em12cIoug oow12 em12c
Ioug oow12 em12c
 

Mehr von DevOpsDays Tel Aviv

THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
DevOpsDays Tel Aviv
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
DevOpsDays Tel Aviv
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DevOpsDays Tel Aviv
 

Mehr von DevOpsDays Tel Aviv (20)

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
 
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
 
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABEL
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

  • 1. Monitoring @ Facebook Ran Leibman Production Engineer Monitoring Tools, Components, & Mentality at Facebook
  • 3. Agenda • Problems in today’s monitoring, solutions & approaches • Facebook Monitoring Architecture • Dive into each component • Show Use cases
  • 5.
  • 6. Problems - Nagios Checks •Good for binary checks •Monitoring based on a “point in time” when the script executes •Can’t monitor on a time window (can do clowny checks with a temp file but that is not so elegant …) •Perf data •Not all plugins implement this •Data gathering is coupled with the way you want to alert •Hard to aggregate from perf data from multiple checks
  • 7. Problems - Cron Scripts •Difficult to put this scripts in “maintenance mode” while deploying code or acknowledging an issue •How do you know that your script actually runs? •Who stopped crond!? •Error handling?
  • 8. Problems - Metrics R 2nd Class Citizen •We are not always treating our metric store like we treat our application data •Storing metrics in clowny temp files or some unmaintained mysql •Metric is DATA like any other and we should treat it as such! •How & from where we are going to query it? •What is the best way to store it? •Retention?
  • 9. Problems - Ops Ownership •Usually only the ops teams owns monitoring •Even if the developers wants to add metrics and alerts they have a steep learning curve in order to achieve this
  • 12. Operational Data Store - ODS •“key —> float” type of metrics, associated with an entity •system.load1, system.cpu-user, system.n_eth-txbyt •chef.run_sucess •chef.last_run_time •myapp.request_num •myapp.request_median Entity / Key-Value
  • 13. Operational Data Store - ODS •Use Gorilla in-memory TSDB for short term data (24 hours) •Store permanent data in HBase on top of HDFS •Aggregation •by rack / cluster / tier •by custom tags - app, tier name, etc … •cross datacenter aggregations Data Store
  • 14. Collecting Metrics Operational Data Store - ODS • API modules for every imaginable language • Thrift Endpoint - https://thrift.apache.org • Implements fb303 counters • fb303 counters are collected from the service by FBAgent • FBAgent submit the metrics over to ODS
  • 15. Retention Operational Data Store - ODS •Metrics is A LOT of data! •All ODS metrics are being rolled up in the same way •Daily - 2 Weeks (depends on you) •Weekly - 2 Weeks (1min) •Monthly - 1 Month (1h) •Yearly - until the end of time (1h) •How do I solve the data loss??
  • 16. Aggregation Operational Data Store •Aggregate important metrics save spikes •p50, p90, p99 •top(N) •count •min, max •Aggregate by cluster | rack | custom
  • 17. Scuba - Real-Time Deep Dive Log Monitoring
  • 18. Whats is going on RIGHT NOW with my service? Scuba - Real-Time Log Monitoring •Was started as a hackathon ! today we can’t live without it •Combine application logs from all servers & containers into a single table •Data is stored in memory •Very small lag (<1min) •Super fast queries (median of ~300ms) •SQL like query syntax
  • 19. Logging to scuba Scuba - Real-Time Log Monitoring • Libraries for every imaginable language • PHP, Python, C++, Bash, etc … • Scuba supports: • String • Ints • Set of String • Stack of strings (usually for stack traces)
  • 20. So… What’s the catch ? Scuba - Real-Time Log Monitoring • Strict quota policy • by size • by time • Use sampling in order to reduce load • Not good for pipelining of data
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 27.
  • 28. What is an Alert? Alarm System •Creating an alert does not mean you’ll be notified! •Alert is an event stating that something happened •Can’t ssh to server •p50 of request time is slower than 100ms 80% of the time in the last 10min •the application tier in us-east is 50% down •The alert should contain ALL relevant data about the event •Alerts can be suppressed in case of maintenance
  • 29. Alarm System - Alert Structure
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. FBAR - Facebook Auto-Remediation
  • 35. Automation Automation Automation ! FBAR • Most alarms could be auto remediate without human intervention • Code it once, never do it again • Doing the work of 136,000 engineering hours (29/04/2015) •136,000 / 8 = 17,000 engineers a day !
  • 36. FBAR
  • 37. FBAR
  • 38. FBAR
  • 40.
  • 41.
  • 42. What should I be paged about? Notification & Subscriptions •Actionable alerts •Impactful alerts •Before you subscribe to an alert ask yourself: •Can I automate this? (FBAR!) •Is this actionable? •Should an engineer wake up because of this?
  • 43.
  • 44.
  • 47. What are they good for ? Dashboards • Awesome tool for debugging production issues • Making your case: • We need to fix this code path • If we had the BLABLA tool it would reduce this by a factor of X • Since we deployed the last release engagement dropped by 10% in west Europe • Dashboards are cool to look at =) it’s the best way to get an understanding of what is going on with the service
  • 49.
  • 50.
  • 51.
  • 52. ? ‫לנו‬ ‫היה‬ ‫מה‬ ‫אז‬
  • 55. Surfacing problems you never thought existed
  • 56. Monitoring is not an “Ops Job”