SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Rainmakers
How Netflix Operates Clouds for Maximum Freedom and Agility



          Jeremy Edberg
     Reliability Architect, Netflix
Do you have...

                          •     A release Engineer?
                          •     A QA department?
                          •     Chef or Puppet to
                                manage your systems?



Tweet @jedberg with feedback!
Do you have...


                      •    Upwards of 100 releases a
                           day?




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
With more than 30 million streaming
 members in the United States, Canada,
Latin America, the United Kingdom, Ireland
   and the Nordics, Netflix is the world's
  leading internet subscription service for
     enjoying movies and TV programs
 streamed over the internet to PCs, Macs
                  and TV.
                                  Source: http://ir.netflix.com



  Tweet @jedberg with feedback!
The Netflix Way
                 •     Everything is “built for three”
                 •     Fully automated build tools to test
                       and make packages
                 •     Fully automated machine image
                       bakery
                 •     Fully automated image deployment
                 •     Independent teams responsible for
                       both Dev and Ops
Tweet @jedberg with feedback!
Philosophy




Tweet @jedberg with feedback!
Automate all the things!




Tweet @jedberg with feedback!
Automate all the things!

                    •     Application startup
                    •     Configuration
                    •     Code deployment
                    •     System deployment



Tweet @jedberg with feedback!
Automation

                     •     Standard base image
                     •     Tools to manage all the
                           systems
                     •     Automated code deployment



Tweet @jedberg with feedback!
Shared state should be
          stored in a shared
               service

          Data on an instance
         should be replicated to
            other instances
Tweet @jedberg with feedback!
“Build for Three”
   We hold a boot camp for new engineers to teach them
     how to build for a highly distributed environment.




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix on AWS
                                2012   2012   2012
                                IPv6   IPv6   IPv6




           Open Connect

Tweet @jedberg with feedback!
Highly aligned, loosely
                 coupled
                 •     Services are built by different
                       teams who work together to figure
                       out what each service will provide.
                 •     The service owner publishes an
                       API that anyone can use.



Tweet @jedberg with feedback!
Advantages to a Service
       Oriented Architecture
             •     Easier auto-scaling
             •     Easier capacity planning
             •     Identify problematic code-paths more
                   easily
             •     Narrow in the effects of a change
             •     More efficient local caching

Tweet @jedberg with feedback!
Freedom and Responsibility

                  •    Developers deploy when they
                       want
                  •    They also manage their own
                       capacity and autoscaling
                  •    And fix anything that breaks at
                       4am!


Tweet @jedberg with feedback!
All systems choices assume
 some part will fail at some
           point.



Tweet @jedberg with feedback!
The Monkey Theory
                          • Simulate
                                things that go
                                wrong
                          • Find things
                                that are
                                different
Tweet @jedberg with feedback!
Execution




                                      Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
Netflix built a global PaaS

               • Service Oriented
                     Architecture
               • HTTP/Rest interfaces
                     between services

Tweet @jedberg with feedback!
Netflix PaaS features
           •     Supports all regions and zones
           •     Multiple accounts
           •     Cross region/account replication
           •     Internationalized, localized and GeoIP
                 routed
           •     Advanced key management
           •     Autoscaling with 1000s of instances
           •     Monitoring and alerting on millions of
                 metrics
Tweet @jedberg with feedback!
What AWS Provides
      •    Instances
      •    Machine Images
      •    Elastic IPs
      •    Load Balancers
      •    Security groups / Autoscaling groups
      •    Availability zones and regions

Tweet @jedberg with feedback!
Linux Base AMI (CentOS or
                   Ubuntu)

     Optional
                    Java (JDK 6 or 7)
     Apache
                      Appdynamics
                       App Agent
    Monitoring
                       monitoring
                                    Tomcat
   Log Rotation
      to S3                         Application war file, base
                                                                 Healthcheck, status
                         GC and        servlet, platform,
                                                               servelets, JMX interface,
  Appdynamics             thread        interface jars for
                                                                   Servo autoscale
    Machine                dump       dependent services
     Agent               logging


Tweet @jedberg with feedback!
The Netflix Platform
          Discovery
                                    Circut Breakers
     (Eureka)Entrypoints
                                        (Hystrix)
     (Edda)Configuration
                                Cassandra (Priam &
         (Archaius)
                                      Astyanax &
    Zookeeper (Exhibitor)
                               CassJMeter) Cryptex
   logging (Blitz4j & Honu)
                                    AKMSEvCache
            NIWS
                                      Proxiesi18n
             Geo
                                          L10n
            Base        Open Source
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Open Source at Netflix




                                Governator
                                  Blitz4j
                                  Edda
Tweet @jedberg with feedback!    Hystrix
Finding things
     •     Discovery (Eureka)
          •   Application to instance mapping
          •   Heartbeat to keep track of health
     •     Entrypoints (Edda)
          •   Local database of AWS resources
     •     NIWS (Netflix Internal Web Service)
          •   On instance software load balancer
          •   Handles retry logic
     •     Geo (Geolocation library)
          •   Provides IP to Lat/Lon mapping for any service that
              needs it.

Tweet @jedberg with feedback!
Entrypoints (Edda)
               •    REST API
                   •     GET /REST/v2/instance/$id
               •    Keeps track of all resources
                   •     Autoscaling groups, EIPs,
                         Instances, Applications, Clusters,
                         History


Tweet @jedberg with feedback!
Entrypoints Exploration
              Find all active       GET /REST/v2/view/instances
                instances

         Find all instances in a    GET /REST/v2/group/clusters
                 cluster

                                        /v2/aws/autoScalingGroups/edda-
        Show only ASG name,        v123;_pp:(autoScalingGroupName,instances:(
        instance ID and health            instanceId,lifecycleState))



        Which ASG contains a       /v2/aws/autoScalingGroups;instances.instanceId=i
                                                       -96f3ca3a
         particular instance?

Tweet @jedberg with feedback!
Keeping it all Straight
        •     Configuration (Archaius)
             •  Global variables (Fast properties)
        •     Base
             •  Base system. Prod vs. Test, etc
        •     Zookeeper (Curator)
             •  Locks, other similar coordination
        •     Logging (Blitz4j and Honu)
             •  Keep track of what happened and store it
                for post analysis.
Tweet @jedberg with feedback!
Keeping it Secure
      •    Cryptex
          •     Service for key management
          •     High, medium and low value keys
      •    AKMS (Amazon Key Management System)
          •     Hands out keys to instances (and dev boxes)
                so they don’t have to store the key on the
                instance

Tweet @jedberg with feedback!           For more info, see SEC201: Security Panel
Storing it
•    Cassandra (Priam, astyanax)
    •     Configure and access Cassandra
    •     Provide OO abstractions handle
          connection pooling, discovery of
          hosts
•    EVCache (Eccentric Volatile Cache)
    •     Wrapper for memcached to handle
          zone awareness and replication
•    Proxies
    •     Get data out of the datacenter and
          into the cloud.
Tweet @jedberg with feedback!
Data
                          What do we do with it all?




Tweet @jedberg with feedback!
We store it!

                   • Cache
                         (memcached)
                   • Cassandra
                   • RDS (MySql)
Tweet @jedberg with feedback!
Cassandra




Tweet @jedberg with feedback!
Why Cassandra?
               • Availability over
                    consistency
               • Writes over reads
               • We know Java
               • Open source + support
Tweet @jedberg with feedback!
Using Cassandra at Netflix
         •    Priam
             •   Zero touch auto-config
             •   State management
             •   Token assignment
             •   Node replacement
             •   Backup/restore to/from S3
         •    Astyanax
             •   OO abstraction to
                 Cassandra
             •   Multi-region support

Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Cassandra Architecture




Tweet @jedberg with feedback!
Cassandra Architecture




Tweet @jedberg with feedback!   For more info, see DAT202: Optimizing your Cassandra Database on AWS
Tools
             •     Asgard
             •     AWS usage
             •     Atlas
             •     Chronos
             •     Build system
             •     Explorers (Cassandra and SimpleDB)

Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Elastic Load
                                   Balancer
     Auto Scaling
        Group




                                Security
                                                 Instances
                                 Group


   Launch
 Configuration

                                Amazon Machine
Tweet @jedberg with feedback!       Image
api-frontend




    api-usprod-v007                            api-usprod-v008




Tweet @jedberg with feedback!
api-frontend




    api-usprod-v007                            api-usprod-v008




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix has moved the
            granularity from the
          instance to the cluster

Tweet @jedberg with feedback!
Why Bake?
  Traditional:
  •launch OS
  •install                Generic AMI
                                            Instance
  packages
  •install app



  Netflix:
  •launch OS+app
                           App AMI          Instance




Tweet @jedberg with feedback!
Getting Baked
                                                   Artifactory
                                                                                              app bundles
                      Ivy
                                                                         snapshot / release
                                           libraries
                                                                           libraries / apps


Jenkins
                           resolve                        test                 publish

      sync                            compile                         build               report
           source



          Perforce / Git             Ant targets                 Groovy all over


 Tweet @jedberg with feedback!
Base
     Image                                              S3 / EBS


     Baking                                         foundation
                                                       AMI
      Linux: CentOS, Fedora, Ubuntu
                                                                    base
                                                                    AMI
                                                mount              snapshot

                                                                               Ready
                                                                                 for
          Yum / Apt                                                             app
                                          install       Bakery                  bake
                                                                         AWS
      RPMs: Apache, Java...


                                  ec2 slave instances
Tweet @jedberg with feedback!
App Image
 Baking                                                    S3 / EBS



                                                       base AMI
       Linux, Apache, Java, Tomcat

                                                                        app
                                                                        AMI
                                                   mount              snapshot


      Jenkins / Yum /                                                               Ready
        Artifactory                                                               to launch!
                                             install       Bakery
                                                                            AWS
           app bundle


                                     ec2 slave instances
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or
                   Ubuntu)

     Optional
                    Java (JDK 6 or 7)
     Apache
                      Appdynamics
                       App Agent
    Monitoring
                       monitoring
                                    Tomcat
   Log Rotation
      to S3                         Application war file, base
                                                                 Healthcheck, status
                         GC and        servlet, platform,
                                                               servelets, JMX interface,
  Appdynamics             thread        interface jars for
                                                                   Servo autoscale
    Machine                dump       dependent services
     Agent               logging


Tweet @jedberg with feedback!
Linux Base AMI (CentOS or
                   Ubuntu)

     Optional
                    Java (JDK 6 or 7)
     Apache
                      Appdynamics
                       App Agent
    Monitoring
                       monitoring
                                    JBoss
   Log Rotation
      to S3                         Application war file, base
                                                                 Healthcheck, status
                         GC and        servlet, platform,
                                                               servelets, JMX interface,
  Appdynamics             thread        interface jars for
                                                                   Servo autoscale
    Machine                dump       dependent services
     Agent               logging


Tweet @jedberg with feedback!
Linux Base AMI (CentOS or
                   Ubuntu)

     Optional
                      Python
     Apache


                       monitoring
    Monitoring                      Django
   Log Rotation
      to S3                         Application file, base
                                      server, platform,
  Appdynamics                         interface libs for
                         logging
    Machine                         dependent services
     Agent


Tweet @jedberg with feedback!
The Monkey Theory
                          • Simulate
                                things that go
                                wrong
                          • Find things
                                that are
                                different
Tweet @jedberg with feedback!
•
                  The simian army
                      Chaos -- Kills random instances

                •     Chaos Gorilla -- Kills zones

                •     Chaos Kong -- Kills regions

                •     Latency -- Degrades network and injects faults

                •     Conformity -- Looks for outliers

                •     Circus -- Kills and launches instances to maintain zone
                      balance

                •     Doctor -- Fixes unhealthy resources

                •     Janitor -- Cleans up unused resources

                •     Howler -- Yells about bad things like Amazon limit
                      violations

                •     Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback!                 For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
What’s going on?!




Tweet @jedberg with feedback!
Atlas




Tweet @jedberg with feedback!
{
    "clusters": [
       "epic_aggregator",
       "epic_aggregator-dev"
    ],                                                                      {
    "alerts": [                                                                         "metricName": "EpicPlugin_MetricCount",
       // you can use javascript style comments in the config                           "applyTo": "instance",
       {                                                                                "description": "${instanceId} is reporting too many metrics",
          "metricName": "EpicPlugin_NumDropped",                                        "condition": {
          "applyTo": "cluster",                                                           "type": "NumOccurrences",
          "condition": {                                                                  "num": 4,
             "type": "StaticThreshold",                                                   "condition": {
             "max": 0.0                                                                     "type": "StaticThreshold",
          },                                                                                "max": 0.0
          "severity": "major",                                                            }
          "description": "plugin is dropping metrics"                                   },
       },                                                                               "additionalDetails": {
       {                                                                                  "statusUrl": "http://${publicDnsName}:7001/Status",
          "metricName": "EpicPlugin_NumDropped_Instance",                                 "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"
          "applyTo": "instance",                                                        }
          "condition": {                                                                "overrides": {
             "type": "NumOccurrences",                                                    "subject": "${instanceId} is reporting too many metrics",
             "num": 4,                                                                    "incident_key": "${metricName}:${instanceId}",
             "condition": {                                                               "service_key_override": "12345",
               "type": "StaticThreshold",                                                 "email_override": "devnull@netflix.com"
               "max": 0.0                                                               },
             }                                                                          "severity": "minor"
          },                                                                        }
          "overrides": {                                                        ]
             "service_key_override": "12345",                               }
             "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],
             "email_override": "devnull@netflix.com"
          },
          "severity": "minor"
       },




                                                                                                 Example Alert Config

                 Tweet @jedberg with feedback!
Alert Tuning




Tweet @jedberg with feedback!
Alert Systems
                                         CORE
                                          Event   Paging
   Atlas                                 Gatewa   Service
     alerting                               y

                                CORE
  Appdynamics                   Agent             Amazon
                                                   SES
       api




                                CORE
                                Agent
       api




                                 Other
                                Team’s
                                 Agent



Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Chronos




Tweet @jedberg with feedback!
Data Collection Pipeline




                           Data Processing Pipeline
                                    Text

Tweet @jedberg with feedback!           For more info, see BDT303: Data Science with Elastic MapReduce
Chuckwa/Honu messages /
          min
                                63 billion
                                messages
                                  a day




Tweet @jedberg with feedback!
Best Practices




Tweet @jedberg with feedback!
Incident Reviews
                         Ask the key questions:
                •     What went wrong?
                •     How could we have detected it
                      sooner?
                •     How could we have prevented it?
                •     How can we prevent this class of
                      problem in the future?
                •     How can we improve our behavior
                      for next time?
Tweet @jedberg with feedback!
Best Practices for Data
        •     Have multiple copies of all data
        •     Keep those copies in multiple AZs
        •     Avoid keeping state on a single instance
        •     Take frequent snapshots of EBS disks
        •     No secret keys on the instance

Tweet @jedberg with feedback!
Netflix autoscaling
       2
                                Deployment


                                     Text
       1




                                 Traffic Peak


Tweet @jedberg with feedback!
AWS Usage
                          Dollar amounts have been carefully removed




Tweet @jedberg with feedback!
Going multi-zone




Tweet @jedberg with feedback!
Benefits of Amazon’s Zones

           •     Loosely connected
           •     Low latency between zones
           •     99.95% uptime guarantee per region




Tweet @jedberg with feedback!
Going Multi-region




Tweet @jedberg with feedback!
Leveraging Multi-region

         •     100% uptime is theoretically possible.
         •     You have to replicate your data
         •     This will cost money




Tweet @jedberg with feedback!
Circuit Breakers (Hystrix)
     Be liberal in what you accept, strict in what you send




Tweet @jedberg with feedback!
Just a quick reminder...


                   •     (Some of) Netflix is open source:
                          •     https://github.com/netflix




Tweet @jedberg with feedback!
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.
Questions?




Tweet @jedberg with feedback!
Getting in touch
          Email: jedberg@{gmail,netflix}.com
          Twitter: @jedberg
          Web: www.jedberg.net
          Facebook: facebook.com/jedberg
          Linkedin:
          www.linkedin.com/in/jedberg
Tweet @jedberg with feedback!

Weitere ähnliche Inhalte

Was ist angesagt?

Netflix Moving To Cloud
Netflix Moving To CloudNetflix Moving To Cloud
Netflix Moving To Cloud
Hien Luu
 

Was ist angesagt? (20)

NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
Netflix Cloud Platform Building Blocks
Netflix Cloud Platform Building BlocksNetflix Cloud Platform Building Blocks
Netflix Cloud Platform Building Blocks
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
Netflix Moving To Cloud
Netflix Moving To CloudNetflix Moving To Cloud
Netflix Moving To Cloud
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
 
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
 

Andere mochten auch

TheNudge - Spirit - vF
TheNudge - Spirit - vFTheNudge - Spirit - vF
TheNudge - Spirit - vF
Atul Satija
 
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa CarlsonAWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
Amazon Web Services
 
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - SageAWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
Amazon Web Services
 

Andere mochten auch (20)

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
the spirit of The/Nudge Foundation
the spirit of The/Nudge Foundationthe spirit of The/Nudge Foundation
the spirit of The/Nudge Foundation
 
TheNudge - Spirit - vF
TheNudge - Spirit - vFTheNudge - Spirit - vF
TheNudge - Spirit - vF
 
Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...
Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...
Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...
 
Using AWS CloudFront with S3 at SMARTSTUDY
Using AWS CloudFront with S3 at SMARTSTUDYUsing AWS CloudFront with S3 at SMARTSTUDY
Using AWS CloudFront with S3 at SMARTSTUDY
 
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
DAT202 Optimizing your Cassandra Database on AWS - AWS re: Invent 2012
 
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
 
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa CarlsonAWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
 
Advanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAdvanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorks
 
Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013
 
Monetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFrontMonetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFront
 
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
 
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPCAWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
 
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
 
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPCAWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
 
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - SageAWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
 
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPCAWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
 
AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013
 

Ähnlich wie RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012

VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
QAware GmbH
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service Architecture
Eduards Sizovs
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
Marc Sturlese
 

Ähnlich wie RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012 (20)

Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 
Cloud Native Camel Riding
Cloud Native Camel RidingCloud Native Camel Riding
Cloud Native Camel Riding
 
Coding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE frameworkCoding Secure Infrastructure in the Cloud using the PIE framework
Coding Secure Infrastructure in the Cloud using the PIE framework
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
A View on eScience
A View on eScienceA View on eScience
A View on eScience
 
Amazon Deep Learning
Amazon Deep LearningAmazon Deep Learning
Amazon Deep Learning
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Google Developer Days Brazil 2009 - Java Appengine
Google Developer Days Brazil 2009 -  Java AppengineGoogle Developer Days Brazil 2009 -  Java Appengine
Google Developer Days Brazil 2009 - Java Appengine
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service Architecture
 
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
Fuse integration-services
Fuse integration-servicesFuse integration-services
Fuse integration-services
 
DevOpsCon 2015 - DevOps in Mobile Games
DevOpsCon 2015 - DevOps in Mobile GamesDevOpsCon 2015 - DevOps in Mobile Games
DevOpsCon 2015 - DevOps in Mobile Games
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012

  • 1. Rainmakers How Netflix Operates Clouds for Maximum Freedom and Agility Jeremy Edberg Reliability Architect, Netflix
  • 2. Do you have... • A release Engineer? • A QA department? • Chef or Puppet to manage your systems? Tweet @jedberg with feedback!
  • 3. Do you have... • Upwards of 100 releases a day? Tweet @jedberg with feedback!
  • 5. With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom, Ireland and the Nordics, Netflix is the world's leading internet subscription service for enjoying movies and TV programs streamed over the internet to PCs, Macs and TV. Source: http://ir.netflix.com Tweet @jedberg with feedback!
  • 6. The Netflix Way • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment • Independent teams responsible for both Dev and Ops Tweet @jedberg with feedback!
  • 8. Automate all the things! Tweet @jedberg with feedback!
  • 9. Automate all the things! • Application startup • Configuration • Code deployment • System deployment Tweet @jedberg with feedback!
  • 10. Automation • Standard base image • Tools to manage all the systems • Automated code deployment Tweet @jedberg with feedback!
  • 11. Shared state should be stored in a shared service Data on an instance should be replicated to other instances Tweet @jedberg with feedback!
  • 12. “Build for Three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment. Tweet @jedberg with feedback!
  • 13. Tweet @jedberg with feedback!
  • 14. Netflix on AWS 2012 2012 2012 IPv6 IPv6 IPv6 Open Connect Tweet @jedberg with feedback!
  • 15. Highly aligned, loosely coupled • Services are built by different teams who work together to figure out what each service will provide. • The service owner publishes an API that anyone can use. Tweet @jedberg with feedback!
  • 16. Advantages to a Service Oriented Architecture • Easier auto-scaling • Easier capacity planning • Identify problematic code-paths more easily • Narrow in the effects of a change • More efficient local caching Tweet @jedberg with feedback!
  • 17. Freedom and Responsibility • Developers deploy when they want • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am! Tweet @jedberg with feedback!
  • 18. All systems choices assume some part will fail at some point. Tweet @jedberg with feedback!
  • 19. The Monkey Theory • Simulate things that go wrong • Find things that are different Tweet @jedberg with feedback!
  • 20. Execution Photo from I, Robot, copyright 20th Century Fox Tweet @jedberg with feedback!
  • 21. Netflix built a global PaaS • Service Oriented Architecture • HTTP/Rest interfaces between services Tweet @jedberg with feedback!
  • 22. Netflix PaaS features • Supports all regions and zones • Multiple accounts • Cross region/account replication • Internationalized, localized and GeoIP routed • Advanced key management • Autoscaling with 1000s of instances • Monitoring and alerting on millions of metrics Tweet @jedberg with feedback!
  • 23. What AWS Provides • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling groups • Availability zones and regions Tweet @jedberg with feedback!
  • 24. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent logging Tweet @jedberg with feedback!
  • 25. The Netflix Platform Discovery Circut Breakers (Eureka)Entrypoints (Hystrix) (Edda)Configuration Cassandra (Priam & (Archaius) Astyanax & Zookeeper (Exhibitor) CassJMeter) Cryptex logging (Blitz4j & Honu) AKMSEvCache NIWS Proxiesi18n Geo L10n Base Open Source Tweet @jedberg with feedback!
  • 26. Tweet @jedberg with feedback!
  • 27. Open Source at Netflix Governator Blitz4j Edda Tweet @jedberg with feedback! Hystrix
  • 28. Finding things • Discovery (Eureka) • Application to instance mapping • Heartbeat to keep track of health • Entrypoints (Edda) • Local database of AWS resources • NIWS (Netflix Internal Web Service) • On instance software load balancer • Handles retry logic • Geo (Geolocation library) • Provides IP to Lat/Lon mapping for any service that needs it. Tweet @jedberg with feedback!
  • 29. Entrypoints (Edda) • REST API • GET /REST/v2/instance/$id • Keeps track of all resources • Autoscaling groups, EIPs, Instances, Applications, Clusters, History Tweet @jedberg with feedback!
  • 30. Entrypoints Exploration Find all active GET /REST/v2/view/instances instances Find all instances in a GET /REST/v2/group/clusters cluster /v2/aws/autoScalingGroups/edda- Show only ASG name, v123;_pp:(autoScalingGroupName,instances:( instance ID and health instanceId,lifecycleState)) Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i -96f3ca3a particular instance? Tweet @jedberg with feedback!
  • 31. Keeping it all Straight • Configuration (Archaius) • Global variables (Fast properties) • Base • Base system. Prod vs. Test, etc • Zookeeper (Curator) • Locks, other similar coordination • Logging (Blitz4j and Honu) • Keep track of what happened and store it for post analysis. Tweet @jedberg with feedback!
  • 32. Keeping it Secure • Cryptex • Service for key management • High, medium and low value keys • AKMS (Amazon Key Management System) • Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance Tweet @jedberg with feedback! For more info, see SEC201: Security Panel
  • 33. Storing it • Cassandra (Priam, astyanax) • Configure and access Cassandra • Provide OO abstractions handle connection pooling, discovery of hosts • EVCache (Eccentric Volatile Cache) • Wrapper for memcached to handle zone awareness and replication • Proxies • Get data out of the datacenter and into the cloud. Tweet @jedberg with feedback!
  • 34. Data What do we do with it all? Tweet @jedberg with feedback!
  • 35. We store it! • Cache (memcached) • Cassandra • RDS (MySql) Tweet @jedberg with feedback!
  • 37. Why Cassandra? • Availability over consistency • Writes over reads • We know Java • Open source + support Tweet @jedberg with feedback!
  • 38. Using Cassandra at Netflix • Priam • Zero touch auto-config • State management • Token assignment • Node replacement • Backup/restore to/from S3 • Astyanax • OO abstraction to Cassandra • Multi-region support Tweet @jedberg with feedback!
  • 39. Tweet @jedberg with feedback!
  • 40. Tweet @jedberg with feedback!
  • 42. Cassandra Architecture Tweet @jedberg with feedback! For more info, see DAT202: Optimizing your Cassandra Database on AWS
  • 43. Tools • Asgard • AWS usage • Atlas • Chronos • Build system • Explorers (Cassandra and SimpleDB) Tweet @jedberg with feedback!
  • 44. Tweet @jedberg with feedback!
  • 45. Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Tweet @jedberg with feedback! Image
  • 46. api-frontend api-usprod-v007 api-usprod-v008 Tweet @jedberg with feedback!
  • 47. api-frontend api-usprod-v007 api-usprod-v008 Tweet @jedberg with feedback!
  • 48. Tweet @jedberg with feedback!
  • 49. Tweet @jedberg with feedback!
  • 50. Tweet @jedberg with feedback!
  • 51. Netflix has moved the granularity from the instance to the cluster Tweet @jedberg with feedback!
  • 52. Why Bake? Traditional: •launch OS •install Generic AMI Instance packages •install app Netflix: •launch OS+app App AMI Instance Tweet @jedberg with feedback!
  • 53. Getting Baked Artifactory app bundles Ivy snapshot / release libraries libraries / apps Jenkins resolve test publish sync compile build report source Perforce / Git Ant targets Groovy all over Tweet @jedberg with feedback!
  • 54. Base Image S3 / EBS Baking foundation AMI Linux: CentOS, Fedora, Ubuntu base AMI mount snapshot Ready for Yum / Apt app install Bakery bake AWS RPMs: Apache, Java... ec2 slave instances Tweet @jedberg with feedback!
  • 55. App Image Baking S3 / EBS base AMI Linux, Apache, Java, Tomcat app AMI mount snapshot Jenkins / Yum / Ready Artifactory to launch! install Bakery AWS app bundle ec2 slave instances Tweet @jedberg with feedback!
  • 56. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent logging Tweet @jedberg with feedback!
  • 57. Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring JBoss Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent logging Tweet @jedberg with feedback!
  • 58. Linux Base AMI (CentOS or Ubuntu) Optional Python Apache monitoring Monitoring Django Log Rotation to S3 Application file, base server, platform, Appdynamics interface libs for logging Machine dependent services Agent Tweet @jedberg with feedback!
  • 59. The Monkey Theory • Simulate things that go wrong • Find things that are different Tweet @jedberg with feedback!
  • 60. The simian army Chaos -- Kills random instances • Chaos Gorilla -- Kills zones • Chaos Kong -- Kills regions • Latency -- Degrades network and injects faults • Conformity -- Looks for outliers • Circus -- Kills and launches instances to maintain zone balance • Doctor -- Fixes unhealthy resources • Janitor -- Cleans up unused resources • Howler -- Yells about bad things like Amazon limit violations • Security -- Finds security issues and expiring certificates Tweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
  • 61. What’s going on?! Tweet @jedberg with feedback!
  • 63. { "clusters": [ "epic_aggregator", "epic_aggregator-dev" ], { "alerts": [ "metricName": "EpicPlugin_MetricCount", // you can use javascript style comments in the config "applyTo": "instance", { "description": "${instanceId} is reporting too many metrics", "metricName": "EpicPlugin_NumDropped", "condition": { "applyTo": "cluster", "type": "NumOccurrences", "condition": { "num": 4, "type": "StaticThreshold", "condition": { "max": 0.0 "type": "StaticThreshold", }, "max": 0.0 "severity": "major", } "description": "plugin is dropping metrics" }, }, "additionalDetails": { { "statusUrl": "http://${publicDnsName}:7001/Status", "metricName": "EpicPlugin_NumDropped_Instance", "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}" "applyTo": "instance", } "condition": { "overrides": { "type": "NumOccurrences", "subject": "${instanceId} is reporting too many metrics", "num": 4, "incident_key": "${metricName}:${instanceId}", "condition": { "service_key_override": "12345", "type": "StaticThreshold", "email_override": "devnull@netflix.com" "max": 0.0 }, } "severity": "minor" }, } "overrides": { ] "service_key_override": "12345", } "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"], "email_override": "devnull@netflix.com" }, "severity": "minor" }, Example Alert Config Tweet @jedberg with feedback!
  • 64. Alert Tuning Tweet @jedberg with feedback!
  • 65. Alert Systems CORE Event Paging Atlas Gatewa Service alerting y CORE Appdynamics Agent Amazon SES api CORE Agent api Other Team’s Agent Tweet @jedberg with feedback!
  • 66. Tweet @jedberg with feedback!
  • 68. Data Collection Pipeline Data Processing Pipeline Text Tweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce
  • 69. Chuckwa/Honu messages / min 63 billion messages a day Tweet @jedberg with feedback!
  • 71. Incident Reviews Ask the key questions: • What went wrong? • How could we have detected it sooner? • How could we have prevented it? • How can we prevent this class of problem in the future? • How can we improve our behavior for next time? Tweet @jedberg with feedback!
  • 72. Best Practices for Data • Have multiple copies of all data • Keep those copies in multiple AZs • Avoid keeping state on a single instance • Take frequent snapshots of EBS disks • No secret keys on the instance Tweet @jedberg with feedback!
  • 73. Netflix autoscaling 2 Deployment Text 1 Traffic Peak Tweet @jedberg with feedback!
  • 74. AWS Usage Dollar amounts have been carefully removed Tweet @jedberg with feedback!
  • 76. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per region Tweet @jedberg with feedback!
  • 78. Leveraging Multi-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost money Tweet @jedberg with feedback!
  • 79. Circuit Breakers (Hystrix) Be liberal in what you accept, strict in what you send Tweet @jedberg with feedback!
  • 80. Just a quick reminder... • (Some of) Netflix is open source: • https://github.com/netflix Tweet @jedberg with feedback!
  • 81. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  • 83. Getting in touch Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg Tweet @jedberg with feedback!