SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
NWC 2011
     Monitoring a Cloud Infrastructure in a Multi-Region Topology

                                                      Nicolas Brousse
                                                    nicolas@tubemogul.com

                                                      September 29th 2011




2011 TubeMogul Incorporated All rights reserved.
                                                                            1
Introduction - About the speaker
   • My name is Nicolas Brousse
   • I previously worked for many industry leading company in France
      – From Web Hosting to Online Video services
           (Lycos, MultiMania, Kewego, MediaPlazza...)
     – Heavy traffic environment and large user databases
   • I work as a Lead Operations Engineer at TubeMogul.com since 2008
   • I help TubeMogul to scale its infrastructure
     – From 20 servers to +500 servers
     – Using 4 Amazon EC2 Regions + 1 Colo
     – Monitoring with Nagios over 6,000 actives services and 1,000 passives services
     – Collecting over 80,000 metrics with Ganglia
     – Managing over 300 TB of data in Hadoop HDFS
     – Billions HTTP queries a day
   • Occasionally contribute to OpenSource projects
     – Ganglia (PHP and PERL module)
     – PHP Judy
2011 TubeMogul Incorporated All rights reserved.
                                                                                        2
Introduction - About TubeMogul

   • Created in November 2006 by John Hughes and Brett Wilson
   • Formerly a video distribution and analytics platform
   • Acquire Illuminex - a flash analytics firm - in October 2008
   • New platform call PlayTime™ :
      – TubeMogul is a Video Marketing Company
      – Built for Branding
      – Integrate real-time media buying, ad serving, targeting, optimization and brand
           measurement



   TubeMogul simplifies the delivery of video ads and maximizes the impact of
                     every dollar spent by brand marketers


                                  http://www.tubemogul.com/company/about_us


2011 TubeMogul Incorporated All rights reserved.
                                                                                          3
Our Environment
   • +10 servers hosted at LiquidWeb
   • Few VPS on Linode
   • +500 instances on Amazon EC2
       – Over 50 different servers configurations
   • Our technology stack :
       – JAVA, PHP
       – Hadoop : HDFS, MapReduce, HBase, Hive
       – Membase
       – Memcache
       – MySQL
       – And more...
   • Monitoring with Nagios
       – Using NSCA when possible
   • Graphing and Trending using Ganglia with Python plugins
       – Some legacy servers using Munin
   • Configuration Management using Puppet
2011 TubeMogul Incorporated All rights reserved.
                                                               4
Amazon Clound Environment




2011 TubeMogul Incorporated All rights reserved.
                                                   5
Amazon Clound Environment
   • We like it because....
      – We can quickly start new servers/clusters
      – We can quickly start new servers/clusters in many regions
         • US East (Virginia)
             • US West (North California)
             • Europe (Dublin)
             • Asia Pacific (Tokyo & Singapore)
      – We can use different type of instances (RAM, CPU, Disks, etc.)
      – It’s easy to automate with EC2 API
      – It’s easy to plug to a configuration management tool

   • But...
      – It can be hard to troubleshoot some failures or network problems
      – Occasionally being notified of hardware failures after the facts
      – No Multicast (Though, possible with Amazon VPC)
      – Bandwidth cost between regions can get expensive


2011 TubeMogul Incorporated All rights reserved.
                                                                           6
What’s the plan ?
   • Our monitoring must be able to scale
   • We need a better Graphing/Trending solution
   • Our monitoring configuration must be automated
      – How to monitor a cluster of servers with variables number of servers every hours ?
      – How to change configuration in multiple regions without missing something ?
   • A failure in one region shouldn’t impact other regions
   • We want to be wake-up only when it really matter
   • We have limited resources
      – Can’t spend big bucks for monitoring
      – Small operation team




2011 TubeMogul Incorporated All rights reserved.
                                                                                             7
Graphing, Trending...

       Munin                                                 Ganglia
                      munin-update                                    Gmetad
                      munin-graph

Pull                                                  Pull


                                                              Gmond    Gmond   Gmetad


                       munin-nodes                                                      Pull
                                                      Push
                   sequential polling




   2011 TubeMogul Incorporated All rights reserved.
                                                                                           8
Graphing, Trending...
   • Why we switched from Munin to Ganglia ?

      – Pretty much : Pull vs Push
         • Munin server fetch data from Munin Clients (munin-nodes)
                    – Can quickly overload the Munin server in disk I/O and CPU
                    – Data collected in sequential order impacted by previous run time and server load
             • Ganglia Client send data to representative clusters nodes. Data get federated
               periodically by a Gmetad process.
                    – Lighter on the aggregation side
                    – Clients push data at defined interval
                    – Can use threshold to send data only when it make sense
                       » using time_threshold and value_threshold in the metric

      – Ganglia is designed for Clusters and Grids
         • You can use multiple layer of gmond/gmetad process
             • You don’t need to manually add servers to your configuration



2011 TubeMogul Incorporated All rights reserved.
                                                                                                         9
Monitoring with Nagios




2011 TubeMogul Incorporated All rights reserved.
                                                   10
Automating Nagios configuration
• Puppet will configure our monitoring instance in each Region
   – We use Nagios regex : use_regexp_matching=1
   – But we don’t use true regex : use_true_regexp_matching=0
   – We use NSCA with Upstart




   – We don’t use the perfdata
   – We includes our configurations from 3 directories
    - objects => templates, contacts, commands, event_handlers
    - servers => contain a configuration file for each server
    - clusters => contain a configuration file for each cluster


2011 TubeMogul Incorporated All rights reserved.
                                                                  11
Automating Nagios configuration
Process of event when starting a new host and add it to our monitoring:

1. We start a new instance using Cerveza and Cloud-init

2. Puppet configure Gmond on the instance

3. Our monitoring server running Gmetad get data from the new instance

4. A Nagios check run every minute and look for new hosts in Ganglia

5. If a new host is found, the check script rebuild the Nagios config and
 reload Nagios

6. If the config is corrupt, the check script will send a critical alert



2011 TubeMogul Incorporated All rights reserved.
                                                                            12
Automating Nagios configuration
• Each server configuration is
  generated from a template
• Our nagios plugin
  “check_tm_clusters”, goes
  over the RRD files generated
  by Ganglia
• If a new host is found, it
  simply copy the template to
  the servers config directory
  and replace the variables as
  reported by Ganglia and
  looking at DNS entries




2011 TubeMogul Incorporated All rights reserved.
                                                   13
Reducing noise and false positive
• We disable most notification and only care of a cluster status




• Most of our checks are based on Ganglia RRD files




2011 TubeMogul Incorporated All rights reserved.
                                                                   14
Reducing noise and false positive
• It become really easy to monitor any metrics returned by Ganglia




2011 TubeMogul Incorporated All rights reserved.
                                                                     15
Reducing noise and false positive
• We can check cluster status by hosts/services but also per returned
   messages !




2011 TubeMogul Incorporated All rights reserved.
                                                                        16
Reducing noise and false positive
• We extensively use our “check_cluster” plugin
• We limit as much as possible email notification
• We use a custom variable _PAGING to identify pageable services
• Paging ONLY on Critical alerts for services/hosts with _PAGING=yes
• Use different contacts and time periods to send alerts to the right person
• We use Nagios Checker for FireFox and Chrome




2011 TubeMogul Incorporated All rights reserved.
                                                                           17
Thank You...


                                                   TubeMogul is Hiring !

    http://www.tubemogul.com/company/careers
               jobs@tubemogul.com


                                                   Follow us on Twitter
                                       @tubemogul                         @orieg

2011 TubeMogul Incorporated All rights reserved.
                                                                                   18

Weitere ähnliche Inhalte

Mehr von Nagios

Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios
 

Mehr von Nagios (20)

Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
 

Kürzlich hochgeladen

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 

Kürzlich hochgeladen (20)

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 

Nagios Conference 2011 - Nicolas Brousse - Monitoring A Cloud Infrastructure In A Multi-Region Topology

  • 1. NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology Nicolas Brousse nicolas@tubemogul.com September 29th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Our Environment • +10 servers hosted at LiquidWeb • Few VPS on Linode • +500 instances on Amazon EC2 – Over 50 different servers configurations • Our technology stack : – JAVA, PHP – Hadoop : HDFS, MapReduce, HBase, Hive – Membase – Memcache – MySQL – And more... • Monitoring with Nagios – Using NSCA when possible • Graphing and Trending using Ganglia with Python plugins – Some legacy servers using Munin • Configuration Management using Puppet 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. What’s the plan ? • Our monitoring must be able to scale • We need a better Graphing/Trending solution • Our monitoring configuration must be automated – How to monitor a cluster of servers with variables number of servers every hours ? – How to change configuration in multiple regions without missing something ? • A failure in one region shouldn’t impact other regions • We want to be wake-up only when it really matter • We have limited resources – Can’t spend big bucks for monitoring – Small operation team 2011 TubeMogul Incorporated All rights reserved. 7
  • 8. Graphing, Trending... Munin Ganglia munin-update Gmetad munin-graph Pull Pull Gmond Gmond Gmetad munin-nodes Pull Push sequential polling 2011 TubeMogul Incorporated All rights reserved. 8
  • 9. Graphing, Trending... • Why we switched from Munin to Ganglia ? – Pretty much : Pull vs Push • Munin server fetch data from Munin Clients (munin-nodes) – Can quickly overload the Munin server in disk I/O and CPU – Data collected in sequential order impacted by previous run time and server load • Ganglia Client send data to representative clusters nodes. Data get federated periodically by a Gmetad process. – Lighter on the aggregation side – Clients push data at defined interval – Can use threshold to send data only when it make sense » using time_threshold and value_threshold in the metric – Ganglia is designed for Clusters and Grids • You can use multiple layer of gmond/gmetad process • You don’t need to manually add servers to your configuration 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Monitoring with Nagios 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Automating Nagios configuration • Puppet will configure our monitoring instance in each Region – We use Nagios regex : use_regexp_matching=1 – But we don’t use true regex : use_true_regexp_matching=0 – We use NSCA with Upstart – We don’t use the perfdata – We includes our configurations from 3 directories - objects => templates, contacts, commands, event_handlers - servers => contain a configuration file for each server - clusters => contain a configuration file for each cluster 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Automating Nagios configuration Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond on the instance 3. Our monitoring server running Gmetad get data from the new instance 4. A Nagios check run every minute and look for new hosts in Ganglia 5. If a new host is found, the check script rebuild the Nagios config and reload Nagios 6. If the config is corrupt, the check script will send a critical alert 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Automating Nagios configuration • Each server configuration is generated from a template • Our nagios plugin “check_tm_clusters”, goes over the RRD files generated by Ganglia • If a new host is found, it simply copy the template to the servers config directory and replace the variables as reported by Ganglia and looking at DNS entries 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Reducing noise and false positive • We disable most notification and only care of a cluster status • Most of our checks are based on Ganglia RRD files 2011 TubeMogul Incorporated All rights reserved. 14
  • 15. Reducing noise and false positive • It become really easy to monitor any metrics returned by Ganglia 2011 TubeMogul Incorporated All rights reserved. 15
  • 16. Reducing noise and false positive • We can check cluster status by hosts/services but also per returned messages ! 2011 TubeMogul Incorporated All rights reserved. 16
  • 17. Reducing noise and false positive • We extensively use our “check_cluster” plugin • We limit as much as possible email notification • We use a custom variable _PAGING to identify pageable services • Paging ONLY on Critical alerts for services/hosts with _PAGING=yes • Use different contacts and time periods to send alerts to the right person • We use Nagios Checker for FireFox and Chrome 2011 TubeMogul Incorporated All rights reserved. 17
  • 18. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul @orieg 2011 TubeMogul Incorporated All rights reserved. 18