SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Downloaden Sie, um offline zu lesen
Actionable Metrics
                      Enabling Decision-Making in Netflix’s Decentralized
                                         Environment

                                      Cloud Tech III
                                     October 6, 2012
                                      Roy Rapoport
                               @royrapoport, rsr@netflix.com

Thursday, October 18, 12
Me

                     • Been in tech for about 20 years
                     • Systems engineering, networking, software
                           development, QA, release management
                     • Time at Netflix: 1195 days (3y:3m:1w)
                     • (Current) job at Netflix: Make things better
                           (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor




Thursday, October 18, 12
Metrics Humor



                       % of instances with even public IP addresses




Thursday, October 18, 12
Technology Overview




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Technology Overview
                     • SoA, REST, Mostly Java
                     • Simple overall architecture:




Thursday, October 18, 12
Culture Overview




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations




Thursday, October 18, 12
Culture Overview
     • Freedom and
             Responsibility
     • Distributed
             Operations
     • Get out of the
             way of
             Developers



Thursday, October 18, 12
The Metric Lifecycle




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send




Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look

Thursday, October 18, 12
The Metric Lifecycle

                     •     Send
                     • Look
                     • Alert

Thursday, October 18, 12
Systems

                     • Flexible
                     • Scalable
                     • Self-Service


Thursday, October 18, 12
Telemetry
                             Flexible, Scalable, Self-Service
                   import netflix.metrics
                   [...]
                       self.nm = netflix.metrics.Metrics("core_cag")
                   [...]
                   def api(self):
                       self.nm.nfCounter("api")
                       [...]
                       self.nm.nfCounter(“application_%s” % application)
                   [...]




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Visualization
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds




Thursday, October 18, 12
Alerting
                           Flexible, Scalable, Self-Service



     • Static vs Dynamic
             Thresholds
     • Compare to
             history




Thursday, October 18, 12
For Example ...
                           Last 3 hours’ core_tools.core_cag_api




                                         What the ...




Thursday, October 18, 12
For Example ...
                                  Visualization (Continued)

                           Last 4 days’ core_tools.core_cag_api




                                    even more questions!



Thursday, October 18, 12
For Example ...
                                   Visualization (Continued)

                           Last 10 days’ core_tools.core_cag_api




                                   What caused the spike?


Thursday, October 18, 12
For Example ...
                                 Visualization (Continued)

                           Show alert volume per application




                             Someone had a rough few days...


Thursday, October 18, 12
Don’t Like Surprises...
                 {
                           "alerts": [
                               {
                                   "applyTo": "cluster",
                                   "condition": {
                                       "minPercent": 90.0,
                                       "noise" : .2,
                                       "maxPercent": 25.0,
                                       "type": "DoubleExponential"
                                   },
                                   "metricName": "core_cag_api",
                                   "severity": "major"
                               }
                           ],
                           "clusters": [
                               "core_tools"
                           ]
                 }




Thursday, October 18, 12
Threshold Tuning


                     • An Abbreviated History ...



Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)




                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT


                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket

                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                                               (in the beginning)

                     • Systems owned by IT
                     • Want an alert? Submit a ticket
                     • Want to tune an alert? Submit a ticket
                    Some priests offer their prayers to alien creatures best left
                    forgotten. This ill-advised worship twists their minds in odd
                    ways. Overlords find these warped men useful due to the
                    unnatural powers they can channel. The dark priests most
                    favored by their strange gods have powerful protections, and
                    defeating one of them is sure to bring down a terrible curse
                    upon the victor.
                      - http://www.descentinthedark.com/_d_/dark_priests.php


Thursday, October 18, 12
Threshold Tuning
                               (It gets better)




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold




Thursday, October 18, 12
Threshold Tuning
                                  (It gets better)

                     • You get to configure your own threshold
                     • Freedom!




Thursday, October 18, 12
Threshold Tuning
                                        (It gets better)

                     • You get to configure your own threshold
                     • Freedom!
                     • Also, you have to configure your own
                           thresholds




Thursday, October 18, 12
Threshold Tuning
                              (Are we there yet?)




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference




Thursday, October 18, 12
Threshold Tuning
                                  (Are we there yet?)

                     • Play with historical data
                     • Huge difference
                     • Still falls short



Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
Threshold Tuning
                             (Yeah, that’s the ticket)




Thursday, October 18, 12
Threshold Tuning
                               (Yeah, that’s the ticket)

                     • Computers can be good at this




Thursday, October 18, 12
If Time Allows ...



Thursday, October 18, 12
Events vs Metrics




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval




Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time



Thursday, October 18, 12
Events vs Metrics

                     • Irregular Interval
                     • Point in time
                     • Lack magnitude


Thursday, October 18, 12
Why Build It?




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control




Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?


Thursday, October 18, 12
Why Build It?

                     • Change management
                           •   Vs Change control

                     • What Changed?
                     • Better Alerting

Thursday, October 18, 12
Chronos




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying
                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters   •   Something happened

                     •     Easy querying            •   ... X times in Y minutes

                     •     Alarming                 •   Something didn’t happen




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume




Thursday, October 18, 12
Chronos
                     •     Rapidly Prototyped
                     •     Adapters and reporters
                     •     Easy querying
                     •     Alarming
                     •     Medium volume
                     •     Recursive
                           •   Recursive



Thursday, October 18, 12
End Result




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI




Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes



Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR


Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments

Thursday, October 18, 12
End Result
                     • Massive decrease in change control tickets
                      • Not talking about SOX or PCI
                     • Better visibility into changes
                     • Decreased TTR
                      • Especially for bad code deployments
                     • You should do this
Thursday, October 18, 12
I Didn’t Mention

                     • End-to-end testing and alerting
                     • External availability and performance
                     • Open Connect
                     • Jobs

Thursday, October 18, 12
Questions?




Thursday, October 18, 12

Weitere ähnliche Inhalte

Andere mochten auch

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Alois Reitbauer
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in ProductionAlois Reitbauer
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixExtract Data Conference
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Andere mochten auch (19)

Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
PyGotham 2016
PyGotham 2016PyGotham 2016
PyGotham 2016
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days. Ruxit - How we launched a global monitoring platform on AWS in 80 days.
Ruxit - How we launched a global monitoring platform on AWS in 80 days.
 
Monitoring Docker Application in Production
Monitoring Docker Application in ProductionMonitoring Docker Application in Production
Monitoring Docker Application in Production
 
Anomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at NetflixAnomaly Detection for Global Scale at Netflix
Anomaly Detection for Global Scale at Netflix
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Ähnlich wie Cloud Tech III: Actionable Metrics

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Daum DNA
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloudJill Mee
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...Rodrigo Laiola Guimarães
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack Foundation
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm__lucas
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o HerokuFilipe Ximenes
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)Empatika
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureAtlassian
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applicationsLuke Cawood
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkRichard Tuin
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipelineKen Collins
 

Ähnlich wie Cloud Tech III: Actionable Metrics (20)

Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012Falling in Love with Frontend Exception | Devon 2012
Falling in Love with Frontend Exception | Devon 2012
 
Internship dotCloud
Internship dotCloudInternship dotCloud
Internship dotCloud
 
April JavaScript Tools
April JavaScript ToolsApril JavaScript Tools
April JavaScript Tools
 
What is SCRUM?
What is SCRUM?What is SCRUM?
What is SCRUM?
 
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
“Let Me Comment on Your Video”: Supporting Personalized End-User Comments wit...
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
Cloudsearch @ ex.fm
Cloudsearch @ ex.fmCloudsearch @ ex.fm
Cloudsearch @ ex.fm
 
hello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdfhello-my-name-is-software-testing-v2-pdf
hello-my-name-is-software-testing-v2-pdf
 
KubeSecOps
KubeSecOpsKubeSecOps
KubeSecOps
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Migrando do App Engine para o Heroku
Migrando do App Engine para o HerokuMigrando do App Engine para o Heroku
Migrando do App Engine para o Heroku
 
App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)App in the Air - Product Demo (Sep 2012)
App in the Air - Product Demo (Sep 2012)
 
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research CultureRetro-Fitting Atlassian Products into a Code-Cowboy Research Culture
Retro-Fitting Atlassian Products into a Code-Cowboy Research Culture
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applications
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Phpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and MinkPhpday - Automated acceptance testing with Behat and Mink
Phpday - Automated acceptance testing with Behat and Mink
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
 

Kürzlich hochgeladen

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Cloud Tech III: Actionable Metrics

  • 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.com Thursday, October 18, 12
  • 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1195 days (3y:3m:1w) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. ) Thursday, October 18, 12
  • 7. Metrics Humor % of instances with even public IP addresses Thursday, October 18, 12
  • 9. Technology Overview • SoA, REST, Mostly Java Thursday, October 18, 12
  • 10. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 11. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 12. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: Thursday, October 18, 12
  • 14. Culture Overview • Freedom and Responsibility Thursday, October 18, 12
  • 15. Culture Overview • Freedom and Responsibility • Distributed Operations Thursday, October 18, 12
  • 16. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers Thursday, October 18, 12
  • 18. The Metric Lifecycle • Send Thursday, October 18, 12
  • 19. The Metric Lifecycle • Send • Look Thursday, October 18, 12
  • 20. The Metric Lifecycle • Send • Look • Alert Thursday, October 18, 12
  • 21. Systems • Flexible • Scalable • Self-Service Thursday, October 18, 12
  • 22. Telemetry Flexible, Scalable, Self-Service import netflix.metrics [...] self.nm = netflix.metrics.Metrics("core_cag") [...] def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application) [...] Thursday, October 18, 12
  • 23. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 24. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 25. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 26. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 27. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 28. Visualization Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 29. Alerting Flexible, Scalable, Self-Service Thursday, October 18, 12
  • 30. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds Thursday, October 18, 12
  • 31. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Compare to history Thursday, October 18, 12
  • 32. For Example ... Last 3 hours’ core_tools.core_cag_api What the ... Thursday, October 18, 12
  • 33. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions! Thursday, October 18, 12
  • 34. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike? Thursday, October 18, 12
  • 35. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days... Thursday, October 18, 12
  • 36. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ] } Thursday, October 18, 12
  • 37. Threshold Tuning • An Abbreviated History ... Thursday, October 18, 12
  • 38. Threshold Tuning (in the beginning) Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 39. Threshold Tuning (in the beginning) • Systems owned by IT Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 40. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 41. Threshold Tuning (in the beginning) • Systems owned by IT • Want an alert? Submit a ticket • Want to tune an alert? Submit a ticket Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor. - http://www.descentinthedark.com/_d_/dark_priests.php Thursday, October 18, 12
  • 42. Threshold Tuning (It gets better) Thursday, October 18, 12
  • 43. Threshold Tuning (It gets better) • You get to configure your own threshold Thursday, October 18, 12
  • 44. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! Thursday, October 18, 12
  • 45. Threshold Tuning (It gets better) • You get to configure your own threshold • Freedom! • Also, you have to configure your own thresholds Thursday, October 18, 12
  • 46. Threshold Tuning (Are we there yet?) Thursday, October 18, 12
  • 47. Threshold Tuning (Are we there yet?) • Play with historical data Thursday, October 18, 12
  • 48. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference Thursday, October 18, 12
  • 49. Threshold Tuning (Are we there yet?) • Play with historical data • Huge difference • Still falls short Thursday, October 18, 12
  • 50. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 51. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 52. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 53. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 54. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 55. Threshold Tuning (Yeah, that’s the ticket) Thursday, October 18, 12
  • 56. Threshold Tuning (Yeah, that’s the ticket) • Computers can be good at this Thursday, October 18, 12
  • 57. If Time Allows ... Thursday, October 18, 12
  • 58. Events vs Metrics Thursday, October 18, 12
  • 59. Events vs Metrics • Irregular Interval Thursday, October 18, 12
  • 60. Events vs Metrics • Irregular Interval • Point in time Thursday, October 18, 12
  • 61. Events vs Metrics • Irregular Interval • Point in time • Lack magnitude Thursday, October 18, 12
  • 62. Why Build It? Thursday, October 18, 12
  • 63. Why Build It? • Change management • Vs Change control Thursday, October 18, 12
  • 64. Why Build It? • Change management • Vs Change control • What Changed? Thursday, October 18, 12
  • 65. Why Build It? • Change management • Vs Change control • What Changed? • Better Alerting Thursday, October 18, 12
  • 67. Chronos • Rapidly Prototyped Thursday, October 18, 12
  • 68. Chronos • Rapidly Prototyped • Adapters and reporters Thursday, October 18, 12
  • 69. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying Thursday, October 18, 12
  • 70. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • Alarming Thursday, October 18, 12
  • 71. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming Thursday, October 18, 12
  • 72. Chronos • Rapidly Prototyped • Adapters and reporters • Something happened • Easy querying • ... X times in Y minutes • Alarming • Something didn’t happen Thursday, October 18, 12
  • 73. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume Thursday, October 18, 12
  • 74. Chronos • Rapidly Prototyped • Adapters and reporters • Easy querying • Alarming • Medium volume • Recursive • Recursive Thursday, October 18, 12
  • 76. End Result • Massive decrease in change control tickets Thursday, October 18, 12
  • 77. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI Thursday, October 18, 12
  • 78. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes Thursday, October 18, 12
  • 79. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR Thursday, October 18, 12
  • 80. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments Thursday, October 18, 12
  • 81. End Result • Massive decrease in change control tickets • Not talking about SOX or PCI • Better visibility into changes • Decreased TTR • Especially for bad code deployments • You should do this Thursday, October 18, 12
  • 82. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Open Connect • Jobs Thursday, October 18, 12