SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Downloaden Sie, um offline zu lesen
Actionable Metrics
                      Enabling Decision-Making in Netflix’s Decentralized
                                         Environment

                                      Cloud Tech III
                                     October 6, 2012
                                      Roy Rapoport
                               @royrapoport, rsr@netflix.com

Thursday, October 18, 12
Me

                     • Been in tech for about 20 years
                     • Systems engineering, networking, software
                           development, QA, release management
                     • Time at Netflix: 1195 days (3y:3m:1w)
                     • (Current) job at Netflix: Make things better
                           (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor



                       % of instances with even public IP addresses




Thursday, October 18, 12
Technology Overview




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Culture Overview




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations
     • Get out of the
             way of
             Developers



Thursday, October 18, 12
The Metric Lifecycle




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look

Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look
                     • Alert

Thursday, October 18, 12
Systems

                     • Flexible
                     • Scalable
                     • Self-Service


Thursday, October 18, 12
Telemetry
                             Flexible, Scalable, Self-Service
                   import netflix.metrics
                   [...]
                       self.nm = netflix.metrics.Metrics("core_cag")
                   [...]
                   def api(self):
                       self.nm.nfCounter("api")
                       [...]
                       self.nm.nfCounter(“application_%s” % application)
                   [...]




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds
     • Compare to
             history




Thursday, October 18, 12
For Example ...
                           Last 3 hours’ core_tools.core_cag_api




                                         What the ...




Thursday, October 18, 12
For Example ...
                                  Visualization (Continued)

                           Last 4 days’ core_tools.core_cag_api




                                    even more questions!



Thursday, October 18, 12
For Example ...
                                   Visualization (Continued)

                           Last 10 days’ core_tools.core_cag_api




                                   What caused the spike?


Thursday, October 18, 12
For Example ...
                                 Visualization (Continued)

                           Show alert volume per application




                             Someone had a rough few days...


Thursday, October 18, 12
Don’t Like Surprises...
                 {
                           "alerts": [
                               {
                                   "applyTo": "cluster",
                                   "condition": {
                                       "minPercent": 90.0,
                                       "noise" : .2,
                                       "maxPercent": 25.0,
                                       "type": "DoubleExponential"
                                   },
                                   "metricName": "core_cag_api",
                                   "severity": "major"
                               }
                           ],
                           "clusters": [
                               "core_tools"
                           ]
                 }




Thursday, October 18, 12
Threshold Tuning


                     • An Abbreviated History ...



Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)




                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT


                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket

                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket
                     • Want to tune an alert? Submit a ticket
                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                               (It gets better)




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold
                     • Freedom!




Thursday, October 18, 12
Threshold Tuning
                                        (It gets better)

                     • You get to configure your own threshold
                     • Freedom!
                     • Also, you have to configure your own
                           thresholds




Thursday, October 18, 12
Threshold Tuning
                              (Are we there yet?)




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference
                     • Still falls short



Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
If Time Allows ...



Thursday, October 18, 12
Events vs Metrics




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time



Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time
                     • Lack magnitude


Thursday, October 18, 12
Why Build It?




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?


Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?
                     • Better Alerting

Thursday, October 18, 12
Chronos




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying
                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming                 •   Something didn’t happen




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume
                     •     Recursive
                           •   Recursive



Thursday, October 18, 12
End Result




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes



Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR


Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments

Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments
                     • You should do this
Thursday, October 18, 12
I Didn’t Mention

                     • End-to-end testing and alerting
                     • External availability and performance
                     • Open Connect
                     • Jobs

Thursday, October 18, 12
Questions?




Thursday, October 18, 12

Weitere ähnliche Inhalte

Andere mochten auch

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Alois Reitbauer
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in ProductionAlois Reitbauer
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixExtract Data Conference
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Andere mochten auch (19)

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in Production
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at Netflix
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Ähnlich wie Cloud Tech III: Actionable Metrics

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Daum DNA
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloudJill Mee
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...Rodrigo Laiola Guimarães
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack Foundation
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm__lucas
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o HerokuFilipe Ximenes
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)Empatika
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureAtlassian
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applicationsLuke Cawood
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkRichard Tuin
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipelineKen Collins
 

Ähnlich wie Cloud Tech III: Actionable Metrics (20)

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloud
 
April JavaScript Tools
April JavaScript ToolsApril JavaScript Tools
April JavaScript Tools
 
What is SCRUM?
What is SCRUM?What is SCRUM?
What is SCRUM?
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm
 
hello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdfhello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdf
 
KubeSecOps
KubeSecOpsKubeSecOps
KubeSecOps
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o Heroku
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applications
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and Mink
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
 

Kürzlich hochgeladen

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Kürzlich hochgeladen (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Cloud Tech III: Actionable Metrics

  • 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.com Thursday, October 18, 12
  • 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1195 days (3y:3m:1w) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. ) Thursday, October 18, 12
  • 7. Metrics Humor % of instances with even public IP addresses Thursday, October 18, 12
  • 9. Technology Overview • SoA, REST, Mostly Java Thursday, October 18, 12
  • 10. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 11. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 12. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 14. Culture Overview • Freedom and Responsibility Thursday, October 18, 12
  • 15. Culture Overview • Freedom and Responsibility • Distributed Operations Thursday, October 18, 12
  • 16. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers Thursday, October 18, 12
  • 18. The Metric Lifecycle • Send Thursday, October 18, 12
  • 19. The Metric Lifecycle • Send • Look Thursday, October 18, 12
  • 20. The Metric Lifecycle • Send • Look • Alert Thursday, October 18, 12
  • 21. Systems • Flexible • Scalable • Self-Service Thursday, October 18, 12
  • 22. Telemetry Flexible, Scalable, Self-Service import netflix.metrics [...] self.nm = netflix.metrics.Metrics("core_cag") [...] def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application) [...] Thursday, October 18, 12
  • 23. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 24. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 25. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 26. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 27. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 28. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 29. Alerting Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 30. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds Thursday, October 18, 12
  • 31. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Compare to history Thursday, October 18, 12
  • 32. For Example ... Last 3 hours’ core_tools.core_cag_api What the ... Thursday, October 18, 12
  • 33. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions! Thursday, October 18, 12
  • 34. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike? Thursday, October 18, 12
  • 35. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days... Thursday, October 18, 12
  • 36. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ] } Thursday, October 18, 12
  • 37. Threshold Tuning • An Abbreviated History ... Thursday, October 18, 12
  • 38. Threshold Tuning (in the beginning) Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 39. Threshold Tuning (in the beginning) • Systems owned by IT Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 40. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 41. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket • Want to tune an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 42. Threshold Tuning (It gets better) Thursday, October 18, 12
  • 43. Threshold Tuning (It gets better) • You get to configure your own threshold Thursday, October 18, 12
  • 44. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! Thursday, October 18, 12
  • 45. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! • Also, you have to configure your own thresholds Thursday, October 18, 12
  • 46. Threshold Tuning (Are we there yet?) Thursday, October 18, 12
  • 47. Threshold Tuning (Are we there yet?) • Play with historical data Thursday, October 18, 12
  • 48. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference Thursday, October 18, 12
  • 49. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference • Still falls short Thursday, October 18, 12
  • 50. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 51. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 52. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 53. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 54. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 55. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 56. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 57. If Time Allows ... Thursday, October 18, 12
  • 58. Events vs Metrics Thursday, October 18, 12
  • 59. Events vs Metrics • Irregular Interval Thursday, October 18, 12
  • 60. Events vs Metrics • Irregular Interval • Point in time Thursday, October 18, 12
  • 61. Events vs Metrics • Irregular Interval • Point in time • Lack magnitude Thursday, October 18, 12
  • 62. Why Build It? Thursday, October 18, 12
  • 63. Why Build It? • Change management • Vs Change control Thursday, October 18, 12
  • 64. Why Build It? • Change management • Vs Change control • What Changed? Thursday, October 18, 12
  • 65. Why Build It? • Change management • Vs Change control • What Changed? • Better Alerting Thursday, October 18, 12
  • 67. Chronos • Rapidly Prototyped Thursday, October 18, 12
  • 68. Chronos • Rapidly Prototyped • Adapters and reporters Thursday, October 18, 12
  • 69. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying Thursday, October 18, 12
  • 70. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • Alarming Thursday, October 18, 12
  • 71. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming Thursday, October 18, 12
  • 72. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming • Something didn’t happen Thursday, October 18, 12
  • 73. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume Thursday, October 18, 12
  • 74. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume • Recursive • Recursive Thursday, October 18, 12
  • 76. End Result • Massive decrease in change control tickets Thursday, October 18, 12
  • 77. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI Thursday, October 18, 12
  • 78. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes Thursday, October 18, 12
  • 79. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR Thursday, October 18, 12
  • 80. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments Thursday, October 18, 12
  • 81. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments • You should do this Thursday, October 18, 12
  • 82. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Open Connect • Jobs Thursday, October 18, 12