SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Nicolas Brousse
Director, Operations Engineering
Bringing Business Awareness
to Your Operation Team
Nagios World Conference 2013
About TubeMogul...
● Founded in 2006
● Formerly a video distribution and analytics platform
● TubeMogul is a Brand-Focused Video Marketing
Company
○ Build for Branding
○ Integrate real-time media buying, ad serving, targeting,
optimization and brand measurement
TubeMogul simplifies the delivery of video ads and
maximizes the impact of every dollar spent by brand
marketers
http://www.tubemogul.com
About TubeMogul...
● Monitoring between 800 and 1000 servers
● Servers spread across 6 different locations
○ 4 Amazon EC2 Regions
○ 1 Hosted (Liquidweb) & 1 VPS (Linode)
● Little monitoring resources
○ Collecting over 120,000 metrics
○ Monitoring over 20,000 services with Nagios
● Multiple billions of HTTP requests a day
○ Most of it must be served in less than 100ms
○ Lost of traffic could mean lost of business opportunity
○ Or worst, over-spending…
Our Environment
● Over 80 different server profiles
● Our stack:
○ Java (Embedded Jetty, Tomcat)
○ PHP, RoR
○ Hadoop: HDFS, M/R, Hbase, Hive
○ Couchbase
○ MySQL, Vertica
○ ElasticSearch
● Monitoring: Nagios, NSCA
● Graphing: Ganglia, sFlow, Graphite
● Configuration Management: Puppet
Amazon Cloud Environment
Amazon Cloud Environment
● We use EC2, SDB, SQS, EMR, S3, etc.
● We don’t use ELB
● We heavily use EC2 Tags
ec2-describe-instances -F tag:hostname=dev-build01
Automated Monitoring
Automated Monitoring
Process of event when starting a new host and add it to our monitoring:
1. We start a new instance using Cerveza and Cloud-init
2. Puppet configure Gmond or Host sFlow on the instance
3. Our monitoring server running Gmond and Gmetad get data from the new
instance
4. A Nagios check run every minute and check for new hosts
○ Look for new hosts using EC2 API
○ Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start
○ Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance
5. If a new host is found:
○ We build a config for the host based on a template file and doing some string replace
○ Once all config have been generated, we rebuild pre-cache objects and reload Nagios
6. If we find “Zombie” host, we generate a Warning alert
7. If the config is corrupt, we send a Critical alert
Efficient Monitoring
We reduce noise by disabling most notifications and using our "cluster check"
Efficient Monitoring
Efficient Monitoring
Efficient on-call rotation
● Follow the sun
○ OPS team is in Ukraine, no more Tier 1 night on-call for US Team
● Timeperiod and escalation are a pain to maintain
○ Nagios notification plugged to Google Calendar
■ Using our own notification script for email and paging
■ Google Calendar make it easy for each team to manage their
own on-call calendar
■ Support for multiple Tier and complex schedules
■ Caching Google Calendar info locally every hour
○ Simpler definitions and rules in Nagios contacts
○ Notify only people on-call, unless they asked for “off
call” emails
Efficient on-call rotation
Efficient on-call rotation
● Simple contact definitions
● Google Calendar info
● Tier Filter (Regex)
● Tier Interval (time to wait before escalating alert since
last tier)
● Off call email
Efficient on-call rotation
● Centralized dashboard!
○ Plugged to Google Apps
○ List all on-call contacts
Efficient on-call rotation
● Centralized View of Multiple Nagios!
Efficient on-call rotation
Efficient on-call rotation
Business Awareness!
Now that I know what is
breaking…
Which one should I fix first?
Business Awareness!
Service Health Dashboard!
Business Awareness!
● A service represent a Business Critical function
● A service can be global or limited to a region
● A service is defined by multiple Service Component
○ We use Nagios Event Handler to update services
component status
○ REST API dashboard allow easy update from third
party monitoring, QA test, scripts, crons, Nagios
● We can define service SLA and quickly see SLA
breaking (based on OLA)
● OPS Team can perform more actions, post comments,
link to Jira tickets
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
How do we make sure we answer the business
needs?
● Review SLAs and monitoring configurations
monthly/quarterly
● Have a checklist when launching new
product or features
○ We now have a SRE Hand-off Checklist with
detailed questions
○ How is capacity planning done?
○ Who can have an impact on traffic, storage, etc?
○ Make sure to ask about expected OLAs or SLAs
To summarize...
● We easily automate our deployment across
multiple geos
● We control the noise
● We easily schedule our on-call rotations
● In one location we know everything
● We know what is impacting the business
and how we should prioritize our actions
Team Process - Daily Kanban!
Request based on
Dashboards,
Monitoring, Paging or
Engineers
Ticket categorized in two Swimlane:
● Production Support
○ High Priority: Top to Bottom
○ On-call 24/7, OLAs, SLAs...
○ Incident are handled 1st
○ Maintenance are handled 2nd
● Developer Support
○ Best Effort: Top to Bottom
○ Long effort request moved to
Infrastructure pipeline
Team Process - Long Term Agile!
Request moved from OPS to INF
SRE Hand-off Checklist used to
define Epic, Story and Tasks
Do
Plan
Check Act
All this wouldn’t be possible without a strong
SRE Operation Engineering team:
Aleksey Mykhailov
Marylene Tanfin
Mykola Mogylenko
Nicolas Brousse
Pierre Gohon
Pierre Grandin
Stan Rudenko
OPS @ TubeMogul
We are Hiring !
http://www.tubemogul.com/jobs

Weitere ähnliche Inhalte

Ähnlich wie Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
JUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsJUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsPeter Mounce
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudiMasters
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksClarotech_Events
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...WSO2
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Nicolas Brousse
 
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDMulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDGonzalo Marcos Ansoain
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsTimur Bagirov
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product Invotra
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuDavid Scruggs
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Getting Started with PHP on Engine Yard Cloud
Getting Started with PHP on Engine Yard CloudGetting Started with PHP on Engine Yard Cloud
Getting Started with PHP on Engine Yard CloudEngine Yard
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...InfluxData
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Severalnines
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGPablo Garbossa
 
SplunkLive! Utrecht 2017 - Coolblue Customer Presentation
SplunkLive! Utrecht 2017 - Coolblue Customer PresentationSplunkLive! Utrecht 2017 - Coolblue Customer Presentation
SplunkLive! Utrecht 2017 - Coolblue Customer PresentationSplunk
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Amazon Web Services
 

Ähnlich wie Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness (20)

Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
JUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsJUST EAT: Embracing DevOps
JUST EAT: Embracing DevOps
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and Tricks
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDMulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT Operations
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to heroku
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Getting Started with PHP on Engine Yard Cloud
Getting Started with PHP on Engine Yard CloudGetting Started with PHP on Engine Yard Cloud
Getting Started with PHP on Engine Yard Cloud
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 
SplunkLive! Utrecht 2017 - Coolblue Customer Presentation
SplunkLive! Utrecht 2017 - Coolblue Customer PresentationSplunkLive! Utrecht 2017 - Coolblue Customer Presentation
SplunkLive! Utrecht 2017 - Coolblue Customer Presentation
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 

Mehr von Nagios

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best PracticesNagios
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewNagios
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The HoodNagios
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsNagios
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsNagios
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 

Mehr von Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 

Kürzlich hochgeladen

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfEasyPrinterHelp
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 

Kürzlich hochgeladen (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 

Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness

  • 1. Nicolas Brousse Director, Operations Engineering Bringing Business Awareness to Your Operation Team Nagios World Conference 2013
  • 2. About TubeMogul... ● Founded in 2006 ● Formerly a video distribution and analytics platform ● TubeMogul is a Brand-Focused Video Marketing Company ○ Build for Branding ○ Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com
  • 3. About TubeMogul... ● Monitoring between 800 and 1000 servers ● Servers spread across 6 different locations ○ 4 Amazon EC2 Regions ○ 1 Hosted (Liquidweb) & 1 VPS (Linode) ● Little monitoring resources ○ Collecting over 120,000 metrics ○ Monitoring over 20,000 services with Nagios ● Multiple billions of HTTP requests a day ○ Most of it must be served in less than 100ms ○ Lost of traffic could mean lost of business opportunity ○ Or worst, over-spending…
  • 4. Our Environment ● Over 80 different server profiles ● Our stack: ○ Java (Embedded Jetty, Tomcat) ○ PHP, RoR ○ Hadoop: HDFS, M/R, Hbase, Hive ○ Couchbase ○ MySQL, Vertica ○ ElasticSearch ● Monitoring: Nagios, NSCA ● Graphing: Ganglia, sFlow, Graphite ● Configuration Management: Puppet
  • 6. Amazon Cloud Environment ● We use EC2, SDB, SQS, EMR, S3, etc. ● We don’t use ELB ● We heavily use EC2 Tags ec2-describe-instances -F tag:hostname=dev-build01
  • 8. Automated Monitoring Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond or Host sFlow on the instance 3. Our monitoring server running Gmond and Gmetad get data from the new instance 4. A Nagios check run every minute and check for new hosts ○ Look for new hosts using EC2 API ○ Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start ○ Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance 5. If a new host is found: ○ We build a config for the host based on a template file and doing some string replace ○ Once all config have been generated, we rebuild pre-cache objects and reload Nagios 6. If we find “Zombie” host, we generate a Warning alert 7. If the config is corrupt, we send a Critical alert
  • 9. Efficient Monitoring We reduce noise by disabling most notifications and using our "cluster check"
  • 12. Efficient on-call rotation ● Follow the sun ○ OPS team is in Ukraine, no more Tier 1 night on-call for US Team ● Timeperiod and escalation are a pain to maintain ○ Nagios notification plugged to Google Calendar ■ Using our own notification script for email and paging ■ Google Calendar make it easy for each team to manage their own on-call calendar ■ Support for multiple Tier and complex schedules ■ Caching Google Calendar info locally every hour ○ Simpler definitions and rules in Nagios contacts ○ Notify only people on-call, unless they asked for “off call” emails
  • 14. Efficient on-call rotation ● Simple contact definitions ● Google Calendar info ● Tier Filter (Regex) ● Tier Interval (time to wait before escalating alert since last tier) ● Off call email
  • 15. Efficient on-call rotation ● Centralized dashboard! ○ Plugged to Google Apps ○ List all on-call contacts
  • 16. Efficient on-call rotation ● Centralized View of Multiple Nagios!
  • 19. Business Awareness! Now that I know what is breaking… Which one should I fix first?
  • 21. Business Awareness! ● A service represent a Business Critical function ● A service can be global or limited to a region ● A service is defined by multiple Service Component ○ We use Nagios Event Handler to update services component status ○ REST API dashboard allow easy update from third party monitoring, QA test, scripts, crons, Nagios ● We can define service SLA and quickly see SLA breaking (based on OLA) ● OPS Team can perform more actions, post comments, link to Jira tickets
  • 29. Business Awareness! How do we make sure we answer the business needs? ● Review SLAs and monitoring configurations monthly/quarterly ● Have a checklist when launching new product or features ○ We now have a SRE Hand-off Checklist with detailed questions ○ How is capacity planning done? ○ Who can have an impact on traffic, storage, etc? ○ Make sure to ask about expected OLAs or SLAs
  • 30. To summarize... ● We easily automate our deployment across multiple geos ● We control the noise ● We easily schedule our on-call rotations ● In one location we know everything ● We know what is impacting the business and how we should prioritize our actions
  • 31. Team Process - Daily Kanban! Request based on Dashboards, Monitoring, Paging or Engineers Ticket categorized in two Swimlane: ● Production Support ○ High Priority: Top to Bottom ○ On-call 24/7, OLAs, SLAs... ○ Incident are handled 1st ○ Maintenance are handled 2nd ● Developer Support ○ Best Effort: Top to Bottom ○ Long effort request moved to Infrastructure pipeline
  • 32. Team Process - Long Term Agile! Request moved from OPS to INF SRE Hand-off Checklist used to define Epic, Story and Tasks Do Plan Check Act
  • 33. All this wouldn’t be possible without a strong SRE Operation Engineering team: Aleksey Mykhailov Marylene Tanfin Mykola Mogylenko Nicolas Brousse Pierre Gohon Pierre Grandin Stan Rudenko OPS @ TubeMogul
  • 34. We are Hiring ! http://www.tubemogul.com/jobs