SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Keeping Movies Running
         Amid Thunderstorms
         Fault-tolerant Systems @ Netflix

         Sid Anand (@r39132)
         QCon SF 2011




                                           1

Thursday, November 17, 2011
Backgrounder
         Netflix Then and Now




                               2

Thursday, November 17, 2011
Netflix Then and Now
        Netflix prior to circa 2009                  Netflix post circa 2009

        Users watched DVDs at home                  Users watch streaming at home

             Peak days : Friday, Saturday, Sunday      Peak days : Friday, Saturday, Sunday

        Users returned DVDs & Updated their Qs         Off-Peak days see many orders of
                                                       magnitude more traffic than prior to
             Peak days : Sunday, Monday                2009

        We shipped the next DVDs                       User expectation is that streaming is
                                                       always available
             Peak days : Monday, Tuesday


                                                    No Scheduled Site Downtimes
        Scheduled Site Downtimes on alternate
        Wednesdays                                  Fault Tolerance is a top design concern

                                                                                               3

Thursday, November 17, 2011
Netflix DC Architecture
         A Simple System




                                  4

Thursday, November 17, 2011
Netflix’s DC Architecture
         Components                                           H/W Load Balancer

         1 Netscaler H/W Load Balancer
                                           Apache + Tomcat    Apache + Tomcat     Apache + Tomcat
         ~20 “WWW” Apache+Tomcat servers

         3 Oracle DBs & 1 MySQL DB

         Cache Servers
                                           Cinematch System   MySQL     Oracle     Cache Servers
         Cinematch Recommendation System




                                                                                                    5

Thursday, November 17, 2011
Netflix’s DC Architecture
         Types of Production Issues                                  H/W Load Balancer

         Java Garbage Collection problems,
         which would would result in slower       Apache + Tomcat    Apache + Tomcat     Apache + Tomcat
         WWW pages

         Deadlocks in our multi-threaded Java
         application would cause web page
         loading to timeout
                                                  Cinematch System   MySQL     Oracle     Cache Servers
         Transaction locking in the DB would
         result in the similar web page loading
         timeouts

         Under-optimized SQL or DB would
         cause slower web pages (e.g. DB
         optimizer picks a sub-optimal the
         execution plan)


                                                                                                           6

Thursday, November 17, 2011
Netflix’s DC Architecture
                                                                        H/W Load Balancer
         Architecture Pros

         As serious as these sound, they were
                                                     Apache + Tomcat    Apache + Tomcat     Apache + Tomcat
         typically single-system failure scenarios

         Single-system failures are relatively
         easy to resolve

         Architecture Cons                           Cinematch System   MySQL     Oracle     Cache Servers

         Not horizontally scalable

               Weʼre constrained by what can fit on
               a single box

         Not conducive to high-velocity
         development and deployment


                                                                                                              7

Thursday, November 17, 2011
Netflix’s Cloud Architecture
         A Less Simple System




                                       8

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                          ELB                                                     ELB



                                                   NES           NES                                       NES           NES
      Components

      Many (~100) applications, organized in                                     Discovery

      clusters
                                                   NMTS          NMTS                                      NMTS          NMTS

      Clusters can be at different levels in the
      call stack
                                                                               NMTS          NMTS

      Clusters can call each other


                                                          NBES                                                    NBES




                                                                        IAAS          IAAS          IAAS




                                                                                                                                9

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                       ELB                                                     ELB

      Levels
                                                NES           NES                                       NES           NES

      NES : Netflix Edge Services
                                                                              Discovery
      NMTS : Netflix Mid-tier Services
                                                NMTS          NMTS                                      NMTS          NMTS
      NBES : Netflix Back-end Services

      IAAS : AWS IAAS Services                                              NMTS          NMTS


      Discovery : Help services discover NMTS
      and NBES services
                                                       NBES                                                    NBES




                                                                     IAAS          IAAS          IAAS




                                                                                                                             10

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                              ELB                                                     ELB
       Components (NES)
                                                       NES           NES                                       NES           NES
       Overview

             Any service that browsers and streaming                                 Discovery

             devices connect to over the internet
                                                       NMTS          NMTS                                      NMTS          NMTS

             They sit behind AWS Elastic Load
             Balancers (a.k.a. ELB)
                                                                                   NMTS          NMTS
             They call clusters at lower levels


                                                              NBES                                                    NBES




                                                                            IAAS          IAAS          IAAS




                                                                                                                                    11

Thursday, November 17, 2011
Netflix’s Cloud Architecture
        Components (NES)                                               ELB                                                     ELB



        Examples                                                NES           NES                                       NES           NES


             API Servers
                                                                                              Discovery

                   Support the video browsing experience
                                                                NMTS          NMTS                                      NMTS          NMTS

                   Also allows users to modify their Q

             Streaming Control Servers                                                      NMTS          NMTS


                   Support streaming video playback

                   Authenticate your Wii, PS3, etc...
                                                                       NBES                                                    NBES

                   Download DRM to the Wii, PS3, etc...

                   Return a list of CDN urls to the Wii, PS3,
                   etc...
                                                                                     IAAS          IAAS          IAAS




                                                                                                                                             12

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                             ELB                                                     ELB



        Components (NMTS)                             NES           NES                                       NES           NES



        Overview
                                                                                    Discovery

             Can call services at the same or lower   NMTS          NMTS                                      NMTS          NMTS
             levels

                   Other NMTS
                                                                                  NMTS          NMTS

                   NBES, IAAS

                   Not NES
                                                             NBES                                                    NBES

             Exposed through our Discovery service




                                                                           IAAS          IAAS          IAAS




                                                                                                                                   13

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                                   ELB                                                     ELB

        Components (NMTS)
                                                            NES           NES                                       NES           NES

        Examples
                                                                                          Discovery
             Netflix Queue Servers
                                                            NMTS          NMTS                                      NMTS          NMTS
                   Modify items in the usersʼ movie queue

             Viewing History Servers
                                                                                        NMTS          NMTS

                   Record and track all streaming movie
                   watching

             SIMS Servers                                          NBES                                                    NBES


                   Compute and serve user-to-user and
                   movie-to-movie similarities


                                                                                 IAAS          IAAS          IAAS




                                                                                                                                         14

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                                  ELB                                                     ELB


        Components (NBES)
                                                           NES           NES                                       NES           NES

        Overview
                                                                                         Discovery
             A back-end, usually 3rd party, open-source
             service                                       NMTS          NMTS                                      NMTS          NMTS


             Leaf in the call tree. Cannot call anything
             else
                                                                                       NMTS          NMTS




                                                                  NBES                                                    NBES




                                                                                IAAS          IAAS          IAAS




                                                                                                                                        15

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                                        ELB                                                     ELB



        Components (NBES)                                        NES           NES                                       NES           NES


        Examples
                                                                                               Discovery

             Cassandra Clusters
                                                                 NMTS          NMTS                                      NMTS          NMTS

                   Our new cloud database is Cassandra and
                   stores all sorts of data to support
                   application needs                                                         NMTS          NMTS


             Zookeeper Clusters

                   Our distributed lock service and sequence
                                                                        NBES                                                    NBES
                   generator

             Memcached Clusters

                   Typically caches things that we store in S3
                   but need to access quickly or often                                IAAS          IAAS          IAAS




                                                                                                                                              16

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                                          ELB                                                     ELB
        Components (IAAS)
                                                                   NES           NES                                       NES           NES
        Examples

             AWS S3                                                                              Discovery


                   Large-sized data (e.g. video encodes,           NMTS          NMTS                                      NMTS          NMTS
                   application logs, etc...) is stored here, not
                   Cassandra
                                                                                               NMTS          NMTS
             AWS SQS

                   Amazonʼs message queue to send events
                   (e.g. Facebook network updates are
                   processed asynchronously over SQS)                     NBES                                                    NBES




                                                                                        IAAS          IAAS          IAAS




                                                                                                                                                17

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                       ELB                                                     ELB

      Types of Production Issues
                                                NES           NES                                       NES           NES

      A user-issued call will pass through
      multiple levels during normal operation                                 Discovery


      We are now exposed to multi-system        NMTS          NMTS                                      NMTS          NMTS
      coincident failures, a.k.a. coordinated
      failures
                                                                            NMTS          NMTS




                                                       NBES                                                    NBES




                                                                     IAAS          IAAS          IAAS




                                                                                                                             18

Thursday, November 17, 2011
Netflix’s Cloud Architecture
                                                             ELB                                                     ELB

      Architecture Pros
                                                      NES           NES                                       NES           NES
      Horizontally scalable at every level

            Should give us maximum availability                                     Discovery


                                                      NMTS          NMTS                                      NMTS          NMTS
      Supports high-velocity development and
      deployment

      Architecture Cons                                                           NMTS          NMTS



      A user-issued call will pass through multiple
      levels (a.k.a. hops) during normal operation
                                                             NBES                                                    NBES

            Latency can be a concern

      We are now exposed to multi-system
      coincident failures, a.k.a. coordinated
                                                                           IAAS          IAAS          IAAS
      failures

      A lot of moving parts
                                                                                                                                   19

Thursday, November 17, 2011
Issue 1
         Capacity Planning




                              20

Thursday, November 17, 2011
Issue 1

                                                             X       X           Y       Y

• Service X and Service Y, each made up of 2 instances,
      call Service A, also made up of 2 instance

• If either of these services expect a large increase in                 A   A
      traffic, they need to let the owner of Service A know

• Service A can then scale up ahead of the traffic            X       X           Y       Y
      increase




                              Disaster Avoided ??                A       A   A       A       A   A




                                                                                                     21

Thursday, November 17, 2011
Issue 1
• A given application owner may need to contact 20 other
      application owners each time he expects to get a large
      increase in traffic
                                                                X       X           Y       Y

            • Too much human coordination

• A few options
                                                                            A   A
            • Some service owners vastly over-provision for
                 their application
                                                                X       X           Y       Y
                  • Not cost effective

            • Auto-scaling

                  • We want to generalize the model first            A       A   A       A       A   A

                       proved by our Streaming Control Server
                       (a.k.a. NCCP) team




                                                                                                        22

Thursday, November 17, 2011
ELB AutoScaling Interlude
      How to use an ELB

      An elastic-load balancer (ELB) routes
      traffic to your EC2 instances

            e.g. of an ELB : nccp-wii-11111111.us-
            east-1.elb.amazonaws.com

      Netflix maps a CNAME to this ELB

            e.g. : nccp.wii.netflix.com

      Netflix then registers the API Service’s
      EC2 instances with this ELB

      The ELB periodically polls attached EC2
      instances to ensure the instances are
      healthy


                                                     23

Thursday, November 17, 2011
ELB AutoScaling Interlude

      Taking this a bit further

      The NCCP servers can publish metrics to
      AWS CloudWatch

      We can set up an alarm in Cloud Watch
      on a metric (e.g. CPU)

      We can associate an auto scale policy with
      that alarm (e.g. if CPU > 60%, add 3
      more instances)

      When a metric goes above a limit, an
      alarm is triggered, causing auto-scaling,
      which grows our pool



                                                   24

Thursday, November 17, 2011
ELB AutoScaling Interlude
                                                       Cloud
            NCCP              EC2 instances publish    Watch
                                CPU data to CW        (Alarms)



                                                           CloudWatch alarms
                                                           trigger ASG policies



                                                        Auto
                  EC2 Instances                        Scaling
                                                       Service
                 Added/Removed                        (Policies)



                                                                              25

Thursday, November 17, 2011
ELB AutoScaling Interlude
                               Scale Out Event    Average CPU > 60% for 5 minutes



                               Scale In Event     Average CPU < 30% FOR 5 minutes



                              Cool Down Period              10 minutes



                              Auto-Scale Alerts         DLAutoScaleEvents



                                                                                    26

Thursday, November 17, 2011
27
      @r39132                 23
Thursday, November 17, 2011
Issue 1

                                                          X       X           Y       Y




      Summary
                                                                      A   A

      We would like to have auto-scaling at all levels.

                                                          X       X           Y       Y




                                                              A       A   A       A       A   A




                                                                                                  28

Thursday, November 17, 2011
Issue 2
         Thundering herds to NMTS




                                    29

Thursday, November 17, 2011
Issue 2
                                                          X            X                             Y           Y




                                                                               A           A
                                                                                                                     Step 1


       Step 1
                                                          X            X                             Y           Y

       Service X and Service Y, each made up of 2
       instances, call Service A, also made up of 2
       instance
                                                                               A           A
       Step 2a                                                                                                       Step 2a



       Service Y overwhelms Service A                     X            X                             Y           Y




                                                                       s
                                                                    ut




                                                                                                 ts
                                                                   o
       Step 3




                                                                                               ou
                                                                           e
                                                                                 m


                                                                                           e
                                                                               Ti

                                                                                        m
       Services X & Y experience read and




                                                                                      Ti
                                                                               A           A
       connection timeouts against an overwhelmed                                                                    Step 3
       Service A

       Step 4                                         X            X                             Y           Y
                                                                  s
                                                               ut




                                                                                             ts
                                                              o




                                                                                           ou
                                                                       e
       Service Aʼs tier get 2 more machines
                                                                             m


                                                                                       e
                                                                           Ti

                                                                                       m
                                                                                     Ti
                                                      JUST A
                                                           ADDED           A          A              A
                                                                                                JUST ADDED
                                                                                                                     Step 4
                                                                                                                               30

Thursday, November 17, 2011
Issue 2
                                                           X        X                    Y       Y




                 Step 5

                  • New requests + Retries cause               A          A      A           A
                                                                                                        Step 5
                       request storms (a.k.a. thundering
                       herds)

                  • If Service A can be grown to exceed
                       retry storm steady-state traffic                                               VICIOUS CYCLE
                       volume, we can exit this vicious
                       cycle

                 Step 6
                                                           X        X                    Y       Y




                                                                     ts
                  • Else, more timeouts, and VC


                                                                   ou




                                                                                        ts
                                                                                      ou
                       continues



                                                                        e
                                                                            m


                                                                                  e
                                                                          Ti

                                                                                  m
                                                                                Ti
                                                               A          A      A           A
                                                                                                        Step 6




                                                                                                                 31

Thursday, November 17, 2011
Issue 2
                                                           X            X                             Y           Y




                                                                                A           A
                                                                                                                       Step 1


      Step 1
                                                           X            X                             Y           Y

      Service X and Service Y, each made up of 2
      instances, call Service A, also made up of 2
      instance
                                                                                A           A
      Step 2b                                                                       SLOW                               Step 2b



      Service A experiences slowness                       X            X                             Y           Y




                                                                        s
                                                                     ut




                                                                                                  ts
                                                                    o
      Step 3




                                                                                                ou
                                                                            e
                                                                                  m


                                                                                            e
                                                                                Ti

                                                                                         m
      Services X & Y experience read and




                                                                                       Ti
                                                                                A           A
      connection timeouts against a slower Service A                                                                   Step 3


      Step 4
                                                       X         ts X                             Y           Y
                                                               ou




                                                                                              ts
      If the slowness can be fixed by adding more




                                                                                            ou
                                                                        e
      machines to Service Aʼs tier, then do so
                                                                              m


                                                                                        e
                                                                            Ti

                                                                                        m
                                                                                                                      OPTIONAL
                                                                                      Ti
                                                       JUST A
                                                            ADDED           A          A              A
                                                                                                 JUST ADDED
                                                                                                                       Step 4
                                                                                                                                 32

Thursday, November 17, 2011
Issue 2
                                                           X        X                    Y       Y




                 Step 5

                  • New requests + Retries cause               A          A      A           A
                                                                                                        Step 5
                       request storms (a.k.a. thundering
                       herds)

                  • If Service A can be grown to exceed
                       retry storm steady-state traffic                                               VICIOUS CYCLE
                       volume, we can exit this vicious
                       cycle

                 Step 6
                                                           X        X                    Y       Y




                                                                     ts
                  • Else, more timeouts, and VC


                                                                   ou




                                                                                        ts
                                                                                      ou
                       continues



                                                                        e
                                                                            m


                                                                                  e
                                                                          Ti

                                                                                  m
                                                                                Ti
                                                               A          A      A           A
                                                                                                        Step 6




                                                                                                                 33

Thursday, November 17, 2011
Issue 2
                                                                                                                    S1

     Potential Causes of Thundering Herd




                                                                                                                         Thundering Herd (Downstream)
      • Service Y sends more traffic to Service A, without checking if Service A has




                                                                                             Time Outs (Upstream)
           available capacity

      • Service A slows down                                                                                        S2



      • Service Yʻs time outs against Service A are set too low

      • Service Yʻs retries against Service A are too aggressive

      • Natural organic growth in traffic hit a tipping point in the system -- in Service A                          S2

           in this case




                                                                                                                                                        34

Thursday, November 17, 2011
Solutions to Issue 2
         Thundering herds to NMTS




                                    35

Thursday, November 17, 2011
Solutions to Issue 2
                                                                                       X
       The Platform Solution
                                                                              NIWS Throttle Layer
        • Every service at Netflix sits on the platform.jar
                                                                                     NIWS
        • The platform.jar offers 2 components of interest here:

             • NIWS library : the client-side of Netflix Inter-Web
                  Service calls. Handles retry, failover, thundering-herd
                  prevention, & fast failure

             • BaseServer library : a set of Tomcat servlet filters          BaseServer Throttle Layer
                  that protect the underlying application servlet stack.
                  In this context, it throttles traffic                       BaseServer Filter Chain



                                                                                       A




                                                                                                        36

Thursday, November 17, 2011
Solutions to Issue 2
                                                                                       X
               The Platform Solution
                                                                              NIWS Throttle Layer
          • NIWS library
                                                                                     NIWS
               • Fair Retry Logic : e.g. exponential bounded backoff

               • Takes 2 configuration params per client:
                              •   Max_Num_of_Requests (a.k.a. MNR)

                              •   Sample_Interval_in_seconds (a.k.a. SI)
                                                                            BaseServer Throttle Layer

               • Ensures that a client does not send more than MNR/          BaseServer Filter Chain
                     SI requests/s, else throttles requests at the client

                                                                                       A




                                                                                                        37

Thursday, November 17, 2011
Solutions to Issue 2
                                                                                     X
       The Platform Solution
                                                                            NIWS Throttle Layer
        • BaseServer
                                                                                   NIWS
             • As an additional fail-safe, the server can set throttles
                  that are not client specific (i.e. the limits apply to
                  total inbound traffic, regardless of client)

             • Takes 1 configuration parameter:
                        •     Max_Num_of_Concurrent_Requests (a.k.a.
                                                                          BaseServer Throttle Layer
                              MNCR)
                                                                           BaseServer Filter Chain
             • Ensures that a server does not handle more than
                  MNCR requests at any instant
                                                                                     A

             • If the traffic exceeds the limits, reject excess calls at
                  the server (i.e. 503s)




                                                                                                      38

Thursday, November 17, 2011
Solutions to Issue 2
       The Platform Solution                                                         X



        • Graceful Degradation                                              NIWS Throttle Layer


             • Any client that is throttled at either the NIWS                     NIWS
                  Throttle Layer or the BaseServer Throttle Layer
                  need to implement graceful degradation

             • Netflixʼs Web Scale Traffic falls in 2 categories:

                        • Users get a personalized set of movies to
                              pick from (i.e. via API Edge Server path)   BaseServer Throttle Layer


                              • GD : Show popular movies, not              BaseServer Filter Chain

                                  personalized movies
                                                                                     A
                        • Users can start watching a movie (i.e. via
                              NCCP Edge Server path)

                              • GD : tougher problem to solve

                                  • When device leases expire, we
                                      honor them if we are unable to
                                      generate a new one for them
                                                                                                      39

Thursday, November 17, 2011
Solutions to Issue 2
       This all sounds great!

 • But, what if developers do not use these
       built-in features of the platform or neglect to
       set their configuration appropriately?

        • (i.e. the default RPS limit in the NIWS
             client is Integer.MAX_VALUE)




                                                         40

Thursday, November 17, 2011
Solutions to Issue 2


       We have a little help




                                41

Thursday, November 17, 2011
Simian Army
         Prevention is the best medicine




                                           42

Thursday, November 17, 2011
Simian Army
 • Chaos Monkey
        • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group)
             • Similar to how EC2 instances can be killed by AWS with little warning
        • Tests clientsʼ ability to gracefully deal with broken connections, interrupted calls, etc...
        • Verifies that all services are running within the protection of AWS Auto Scale Groups, which
             reincarnates killed instances

             • If not, the Chaos monkey will win!




                                                                                                         43

Thursday, November 17, 2011
Simian Army
 • Latency Monkey
        • Simulates soft failures -- i.e. a service gets slower
        • Injects random delays in NIWS (client-side) or BaseServer (server-side) of a client-server
             interaction in production

        • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder
             problem of delays, that leads to thundering herd and timeouts




                                                                                                                44

Thursday, November 17, 2011
Simian Army


                              Does this solve all of our issues?




                                                                   45

Thursday, November 17, 2011
Simian Army

       The infinite cloud is infinite when your needs are
       moderate!


       To ensure fairness among tenants, AWS meters or limits every resource

       Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to
       turn around and raise the limit -- a few hours!




                                                                                                     46

Thursday, November 17, 2011
Simian Army
 • Limits Monkey
        • Checks once a day whether we are approaching one of our limits and triggers alerts for us to
             proactively reach out to AWS!




 • Conformity & Janitor Monkeys
        • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG,
             unreferenced security groups, ELBs, ASGs, etc...) to increase head-room

             • Buys us more time before we run out of resources and also saves us $$$$




                                                                                                         47

Thursday, November 17, 2011
Simian Army


           The Simian Army fills the gap created by an absence
           of process and a need to ensure fault-tolerance and
                    efficient operation of our systems




                                                                 48

Thursday, November 17, 2011
Fast Rollback
         Fault-tolerant deployment




                                     49

Thursday, November 17, 2011
Fast Rollback

             What is the point of having Fault-Tolerant layers if
               deployments of a bug can take them down?




                                                                    50

Thursday, November 17, 2011
Fast Rollback




                              51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages




                                        51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages
              Production traffic is unique




                                            51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages
              Production traffic is unique
              Keep old version running




                                            51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages
              Production traffic is unique
              Keep old version running
              Switch traffic to new version




                                             51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages
              Production traffic is unique
              Keep old version running
              Switch traffic to new version
              Monitor results




                                             51

Thursday, November 17, 2011
Fast Rollback
              Optimism causes outages
              Production traffic is unique
              Keep old version running
              Switch traffic to new version
              Monitor results
              Revert traffic quickly



                                             51

Thursday, November 17, 2011
Fast Rollback




                              52

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007




                                                               52

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 52

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 52

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 52

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 52

Thursday, November 17, 2011
Fast Rollback


                              api-frontend




                                             api-usprod-v008




                                                               52

Thursday, November 17, 2011
Fast Rollback




                              53

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007




                                                               53

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 53

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 53

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007                  api-usprod-v008




                                                                                 53

Thursday, November 17, 2011
Fast Rollback


                                                api-frontend




                              api-usprod-v007




                                                               53

Thursday, November 17, 2011
Acknowledgements
         Platform Engineering
         • Sudhir Tonse
         • Pradeep Kamath
         Engineering Tools
         • Joe Sondow
         Streaming Server
         • Ranjit Mavinkurve
                                54

Thursday, November 17, 2011
Questions?
         Sid Anand


                 @r39132




                              55

Thursday, November 17, 2011

Weitere ähnliche Inhalte

Was ist angesagt?

Netflix cloud architecture...continued
Netflix cloud architecture...continuedNetflix cloud architecture...continued
Netflix cloud architecture...continuedCloud Genius
 
Netflix Cloud Platform Building Blocks
Netflix Cloud Platform Building BlocksNetflix Cloud Platform Building Blocks
Netflix Cloud Platform Building BlocksSudhir Tonse
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
Millicomputing Ignite Talk
Millicomputing Ignite TalkMillicomputing Ignite Talk
Millicomputing Ignite TalkAdrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionAdrian Cockcroft
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source PlatformRuslan Meshenberg
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSFAdrian Cockcroft
 
Millicomputing Usenix 2008
Millicomputing Usenix 2008Millicomputing Usenix 2008
Millicomputing Usenix 2008Adrian Cockcroft
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013CA API Management
 
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 TutorialCloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 TutorialStuart Charlton
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
 
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...Amazon Web Services
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Adrian Cockcroft
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAcquia
 
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...Amazon Web Services
 
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...Amazon Web Services
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureNuno Godinho
 

Was ist angesagt? (20)

Netflix cloud architecture...continued
Netflix cloud architecture...continuedNetflix cloud architecture...continued
Netflix cloud architecture...continued
 
Netflix Cloud Platform Building Blocks
Netflix Cloud Platform Building BlocksNetflix Cloud Platform Building Blocks
Netflix Cloud Platform Building Blocks
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Millicomputing Ignite Talk
Millicomputing Ignite TalkMillicomputing Ignite Talk
Millicomputing Ignite Talk
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source Platform
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Millicomputing Usenix 2008
Millicomputing Usenix 2008Millicomputing Usenix 2008
Millicomputing Usenix 2008
 
CMS on AWS Deep Dive
CMS on AWS Deep DiveCMS on AWS Deep Dive
CMS on AWS Deep Dive
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
 
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 TutorialCloud Computing for Developers and Architects - QCon 2008 Tutorial
Cloud Computing for Developers and Architects - QCon 2008 Tutorial
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...
AWS for Start-ups - Architectural Best Practices & Automating Your Infrastruc...
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...
AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...
 
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
[AWS LA Media & Entertainment Event 2015]: Shoot the Bird: Linear Broadcast o...
 
Architecture Best Practices on Windows Azure
Architecture Best Practices on Windows AzureArchitecture Best Practices on Windows Azure
Architecture Best Practices on Windows Azure
 

Andere mochten auch

Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
 
10 tips to get started with html5 games
10 tips to get started with html5 games10 tips to get started with html5 games
10 tips to get started with html5 gamesGregory Kukolj
 
DSP開発におけるSpark MLlibの活用
DSP開発におけるSpark MLlibの活用DSP開発におけるSpark MLlibの活用
DSP開発におけるSpark MLlibの活用Kotaro Tanahashi
 
Operational Insight: Concepts and Examples
Operational Insight: Concepts and ExamplesOperational Insight: Concepts and Examples
Operational Insight: Concepts and Examplesroyrapoport
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Technology stack of social networks [MTS]
Technology stack of social networks [MTS]Technology stack of social networks [MTS]
Technology stack of social networks [MTS]philmaweb
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inceptionroyrapoport
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metricsroyrapoport
 
Lean Engineering: How to make Engineering a full Lean UX partner
Lean Engineering: How to make Engineering a full Lean UX partnerLean Engineering: How to make Engineering a full Lean UX partner
Lean Engineering: How to make Engineering a full Lean UX partnerBill Scott
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport
 
Operational Insight: Concepts and Examples (w/o Presenter Notes)
Operational Insight: Concepts and Examples (w/o Presenter Notes)Operational Insight: Concepts and Examples (w/o Presenter Notes)
Operational Insight: Concepts and Examples (w/o Presenter Notes)royrapoport
 
Netflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming mediaNetflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming mediaAshish Arora
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 

Andere mochten auch (20)

Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Culture
CultureCulture
Culture
 
10 tips to get started with html5 games
10 tips to get started with html5 games10 tips to get started with html5 games
10 tips to get started with html5 games
 
DSP開発におけるSpark MLlibの活用
DSP開発におけるSpark MLlibの活用DSP開発におけるSpark MLlibの活用
DSP開発におけるSpark MLlibの活用
 
Operational Insight: Concepts and Examples
Operational Insight: Concepts and ExamplesOperational Insight: Concepts and Examples
Operational Insight: Concepts and Examples
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Technology stack of social networks [MTS]
Technology stack of social networks [MTS]Technology stack of social networks [MTS]
Technology stack of social networks [MTS]
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inception
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
 
Lean Engineering: How to make Engineering a full Lean UX partner
Lean Engineering: How to make Engineering a full Lean UX partnerLean Engineering: How to make Engineering a full Lean UX partner
Lean Engineering: How to make Engineering a full Lean UX partner
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014
 
Operational Insight: Concepts and Examples (w/o Presenter Notes)
Operational Insight: Concepts and Examples (w/o Presenter Notes)Operational Insight: Concepts and Examples (w/o Presenter Notes)
Operational Insight: Concepts and Examples (w/o Presenter Notes)
 
Netflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming mediaNetflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming media
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 

Ähnlich wie Keeping Movies Running Amid Thunderstorms!

Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud SecurityJason Chan
 
eZ publish - Instant Publishing and Greater Traffic
eZ publish - Instant Publishing and Greater TrafficeZ publish - Instant Publishing and Greater Traffic
eZ publish - Instant Publishing and Greater TrafficGauthier Garnier
 
Windsor: Domain 0 Disaggregation for XenServer and XCP
	Windsor: Domain 0 Disaggregation for XenServer and XCP	Windsor: Domain 0 Disaggregation for XenServer and XCP
Windsor: Domain 0 Disaggregation for XenServer and XCPThe Linux Foundation
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use itHenrik Ingo
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...rhatr
 
Dc tco in_a_nutshell
Dc tco in_a_nutshellDc tco in_a_nutshell
Dc tco in_a_nutshellerjosito
 
Pm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackPm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackOpenCity Community
 
Openstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyOpenstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyHui Cheng
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practiceOpenCity Community
 
Deploying large payloads at scale
Deploying large payloads at scaleDeploying large payloads at scale
Deploying large payloads at scaleramonvanalteren
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloadspittmantony
 
Membase Meetup - San Diego
Membase Meetup - San DiegoMembase Meetup - San Diego
Membase Meetup - San DiegoMembase
 
Stampede.io CoreOS + Digital Ocean Meetup
Stampede.io CoreOS + Digital Ocean MeetupStampede.io CoreOS + Digital Ocean Meetup
Stampede.io CoreOS + Digital Ocean MeetupDarren Shepherd
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a Richard Elling
 

Ähnlich wie Keeping Movies Running Amid Thunderstorms! (20)

Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
eZ Accelerator v1
eZ Accelerator v1eZ Accelerator v1
eZ Accelerator v1
 
eZ publish - Instant Publishing and Greater Traffic
eZ publish - Instant Publishing and Greater TrafficeZ publish - Instant Publishing and Greater Traffic
eZ publish - Instant Publishing and Greater Traffic
 
Windsor: Domain 0 Disaggregation for XenServer and XCP
	Windsor: Domain 0 Disaggregation for XenServer and XCP	Windsor: Domain 0 Disaggregation for XenServer and XCP
Windsor: Domain 0 Disaggregation for XenServer and XCP
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use it
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
 
Dc tco in_a_nutshell
Dc tco in_a_nutshellDc tco in_a_nutshell
Dc tco in_a_nutshell
 
Openstorage Openstack
Openstorage OpenstackOpenstorage Openstack
Openstorage Openstack
 
Pm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackPm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstack
 
Openstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyOpenstorage with OpenStack, by Bradley
Openstorage with OpenStack, by Bradley
 
Learn OpenStack from trystack.cn ——Folsom in practice
Learn OpenStack from trystack.cn  ——Folsom in practiceLearn OpenStack from trystack.cn  ——Folsom in practice
Learn OpenStack from trystack.cn ——Folsom in practice
 
Deploying large payloads at scale
Deploying large payloads at scaleDeploying large payloads at scale
Deploying large payloads at scale
 
eZ Publish nextgen
eZ Publish nextgeneZ Publish nextgen
eZ Publish nextgen
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Vm13 vnx mixed workloads
Vm13 vnx mixed workloadsVm13 vnx mixed workloads
Vm13 vnx mixed workloads
 
Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier?
 
Membase Meetup - San Diego
Membase Meetup - San DiegoMembase Meetup - San Diego
Membase Meetup - San Diego
 
Stampede.io CoreOS + Digital Ocean Meetup
Stampede.io CoreOS + Digital Ocean MeetupStampede.io CoreOS + Digital Ocean Meetup
Stampede.io CoreOS + Digital Ocean Meetup
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
 

Mehr von Sid Anand

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionSid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)Sid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with MavenSid Anand
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 

Mehr von Sid Anand (20)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 

Kürzlich hochgeladen

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Keeping Movies Running Amid Thunderstorms!

  • 1. Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011
  • 2. Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011
  • 3. Netflix Then and Now Netflix prior to circa 2009 Netflix post circa 2009 Users watched DVDs at home Users watch streaming at home Peak days : Friday, Saturday, Sunday Peak days : Friday, Saturday, Sunday Users returned DVDs & Updated their Qs Off-Peak days see many orders of magnitude more traffic than prior to Peak days : Sunday, Monday 2009 We shipped the next DVDs User expectation is that streaming is always available Peak days : Monday, Tuesday No Scheduled Site Downtimes Scheduled Site Downtimes on alternate Wednesdays Fault Tolerance is a top design concern 3 Thursday, November 17, 2011
  • 4. Netflix DC Architecture A Simple System 4 Thursday, November 17, 2011
  • 5. Netflix’s DC Architecture Components H/W Load Balancer 1 Netscaler H/W Load Balancer Apache + Tomcat Apache + Tomcat Apache + Tomcat ~20 “WWW” Apache+Tomcat servers 3 Oracle DBs & 1 MySQL DB Cache Servers Cinematch System MySQL Oracle Cache Servers Cinematch Recommendation System 5 Thursday, November 17, 2011
  • 6. Netflix’s DC Architecture Types of Production Issues H/W Load Balancer Java Garbage Collection problems, which would would result in slower Apache + Tomcat Apache + Tomcat Apache + Tomcat WWW pages Deadlocks in our multi-threaded Java application would cause web page loading to timeout Cinematch System MySQL Oracle Cache Servers Transaction locking in the DB would result in the similar web page loading timeouts Under-optimized SQL or DB would cause slower web pages (e.g. DB optimizer picks a sub-optimal the execution plan) 6 Thursday, November 17, 2011
  • 7. Netflix’s DC Architecture H/W Load Balancer Architecture Pros As serious as these sound, they were Apache + Tomcat Apache + Tomcat Apache + Tomcat typically single-system failure scenarios Single-system failures are relatively easy to resolve Architecture Cons Cinematch System MySQL Oracle Cache Servers Not horizontally scalable Weʼre constrained by what can fit on a single box Not conducive to high-velocity development and deployment 7 Thursday, November 17, 2011
  • 8. Netflix’s Cloud Architecture A Less Simple System 8 Thursday, November 17, 2011
  • 9. Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 9 Thursday, November 17, 2011
  • 10. Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 10 Thursday, November 17, 2011
  • 11. Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers (a.k.a. ELB) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 11 Thursday, November 17, 2011
  • 12. Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Streaming Control Servers NMTS NMTS Support streaming video playback Authenticate your Wii, PS3, etc... NBES NBES Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, etc... IAAS IAAS IAAS 12 Thursday, November 17, 2011
  • 13. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 13 Thursday, November 17, 2011
  • 14. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the usersʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 14 Thursday, November 17, 2011
  • 15. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 15 Thursday, November 17, 2011
  • 16. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 16 Thursday, November 17, 2011
  • 17. Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data (e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc...) is stored here, not Cassandra NMTS NMTS AWS SQS Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS) NBES NBES IAAS IAAS IAAS 17 Thursday, November 17, 2011
  • 18. Netflix’s Cloud Architecture ELB ELB Types of Production Issues NES NES NES NES A user-issued call will pass through multiple levels during normal operation Discovery We are now exposed to multi-system NMTS NMTS NMTS NMTS coincident failures, a.k.a. coordinated failures NMTS NMTS NBES NBES IAAS IAAS IAAS 18 Thursday, November 17, 2011
  • 19. Netflix’s Cloud Architecture ELB ELB Architecture Pros NES NES NES NES Horizontally scalable at every level Should give us maximum availability Discovery NMTS NMTS NMTS NMTS Supports high-velocity development and deployment Architecture Cons NMTS NMTS A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation NBES NBES Latency can be a concern We are now exposed to multi-system coincident failures, a.k.a. coordinated IAAS IAAS IAAS failures A lot of moving parts 19 Thursday, November 17, 2011
  • 20. Issue 1 Capacity Planning 20 Thursday, November 17, 2011
  • 21. Issue 1 X X Y Y • Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance • If either of these services expect a large increase in A A traffic, they need to let the owner of Service A know • Service A can then scale up ahead of the traffic X X Y Y increase Disaster Avoided ?? A A A A A A 21 Thursday, November 17, 2011
  • 22. Issue 1 • A given application owner may need to contact 20 other application owners each time he expects to get a large increase in traffic X X Y Y • Too much human coordination • A few options A A • Some service owners vastly over-provision for their application X X Y Y • Not cost effective • Auto-scaling • We want to generalize the model first A A A A A A proved by our Streaming Control Server (a.k.a. NCCP) team 22 Thursday, November 17, 2011
  • 23. ELB AutoScaling Interlude How to use an ELB An elastic-load balancer (ELB) routes traffic to your EC2 instances e.g. of an ELB : nccp-wii-11111111.us- east-1.elb.amazonaws.com Netflix maps a CNAME to this ELB e.g. : nccp.wii.netflix.com Netflix then registers the API Service’s EC2 instances with this ELB The ELB periodically polls attached EC2 instances to ensure the instances are healthy 23 Thursday, November 17, 2011
  • 24. ELB AutoScaling Interlude Taking this a bit further The NCCP servers can publish metrics to AWS CloudWatch We can set up an alarm in Cloud Watch on a metric (e.g. CPU) We can associate an auto scale policy with that alarm (e.g. if CPU > 60%, add 3 more instances) When a metric goes above a limit, an alarm is triggered, causing auto-scaling, which grows our pool 24 Thursday, November 17, 2011
  • 25. ELB AutoScaling Interlude Cloud NCCP EC2 instances publish Watch CPU data to CW (Alarms) CloudWatch alarms trigger ASG policies Auto EC2 Instances Scaling Service Added/Removed (Policies) 25 Thursday, November 17, 2011
  • 26. ELB AutoScaling Interlude Scale Out Event Average CPU > 60% for 5 minutes Scale In Event Average CPU < 30% FOR 5 minutes Cool Down Period 10 minutes Auto-Scale Alerts DLAutoScaleEvents 26 Thursday, November 17, 2011
  • 27. 27 @r39132 23 Thursday, November 17, 2011
  • 28. Issue 1 X X Y Y Summary A A We would like to have auto-scaling at all levels. X X Y Y A A A A A A 28 Thursday, November 17, 2011
  • 29. Issue 2 Thundering herds to NMTS 29 Thursday, November 17, 2011
  • 30. Issue 2 X X Y Y A A Step 1 Step 1 X X Y Y Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance A A Step 2a Step 2a Service Y overwhelms Service A X X Y Y s ut ts o Step 3 ou e m e Ti m Services X & Y experience read and Ti A A connection timeouts against an overwhelmed Step 3 Service A Step 4 X X Y Y s ut ts o ou e Service Aʼs tier get 2 more machines m e Ti m Ti JUST A ADDED A A A JUST ADDED Step 4 30 Thursday, November 17, 2011
  • 31. Issue 2 X X Y Y Step 5 • New requests + Retries cause A A A A Step 5 request storms (a.k.a. thundering herds) • If Service A can be grown to exceed retry storm steady-state traffic VICIOUS CYCLE volume, we can exit this vicious cycle Step 6 X X Y Y ts • Else, more timeouts, and VC ou ts ou continues e m e Ti m Ti A A A A Step 6 31 Thursday, November 17, 2011
  • 32. Issue 2 X X Y Y A A Step 1 Step 1 X X Y Y Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance A A Step 2b SLOW Step 2b Service A experiences slowness X X Y Y s ut ts o Step 3 ou e m e Ti m Services X & Y experience read and Ti A A connection timeouts against a slower Service A Step 3 Step 4 X ts X Y Y ou ts If the slowness can be fixed by adding more ou e machines to Service Aʼs tier, then do so m e Ti m OPTIONAL Ti JUST A ADDED A A A JUST ADDED Step 4 32 Thursday, November 17, 2011
  • 33. Issue 2 X X Y Y Step 5 • New requests + Retries cause A A A A Step 5 request storms (a.k.a. thundering herds) • If Service A can be grown to exceed retry storm steady-state traffic VICIOUS CYCLE volume, we can exit this vicious cycle Step 6 X X Y Y ts • Else, more timeouts, and VC ou ts ou continues e m e Ti m Ti A A A A Step 6 33 Thursday, November 17, 2011
  • 34. Issue 2 S1 Potential Causes of Thundering Herd Thundering Herd (Downstream) • Service Y sends more traffic to Service A, without checking if Service A has Time Outs (Upstream) available capacity • Service A slows down S2 • Service Yʻs time outs against Service A are set too low • Service Yʻs retries against Service A are too aggressive • Natural organic growth in traffic hit a tipping point in the system -- in Service A S2 in this case 34 Thursday, November 17, 2011
  • 35. Solutions to Issue 2 Thundering herds to NMTS 35 Thursday, November 17, 2011
  • 36. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • Every service at Netflix sits on the platform.jar NIWS • The platform.jar offers 2 components of interest here: • NIWS library : the client-side of Netflix Inter-Web Service calls. Handles retry, failover, thundering-herd prevention, & fast failure • BaseServer library : a set of Tomcat servlet filters BaseServer Throttle Layer that protect the underlying application servlet stack. In this context, it throttles traffic BaseServer Filter Chain A 36 Thursday, November 17, 2011
  • 37. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • NIWS library NIWS • Fair Retry Logic : e.g. exponential bounded backoff • Takes 2 configuration params per client: • Max_Num_of_Requests (a.k.a. MNR) • Sample_Interval_in_seconds (a.k.a. SI) BaseServer Throttle Layer • Ensures that a client does not send more than MNR/ BaseServer Filter Chain SI requests/s, else throttles requests at the client A 37 Thursday, November 17, 2011
  • 38. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • BaseServer NIWS • As an additional fail-safe, the server can set throttles that are not client specific (i.e. the limits apply to total inbound traffic, regardless of client) • Takes 1 configuration parameter: • Max_Num_of_Concurrent_Requests (a.k.a. BaseServer Throttle Layer MNCR) BaseServer Filter Chain • Ensures that a server does not handle more than MNCR requests at any instant A • If the traffic exceeds the limits, reject excess calls at the server (i.e. 503s) 38 Thursday, November 17, 2011
  • 39. Solutions to Issue 2 The Platform Solution X • Graceful Degradation NIWS Throttle Layer • Any client that is throttled at either the NIWS NIWS Throttle Layer or the BaseServer Throttle Layer need to implement graceful degradation • Netflixʼs Web Scale Traffic falls in 2 categories: • Users get a personalized set of movies to pick from (i.e. via API Edge Server path) BaseServer Throttle Layer • GD : Show popular movies, not BaseServer Filter Chain personalized movies A • Users can start watching a movie (i.e. via NCCP Edge Server path) • GD : tougher problem to solve • When device leases expire, we honor them if we are unable to generate a new one for them 39 Thursday, November 17, 2011
  • 40. Solutions to Issue 2 This all sounds great! • But, what if developers do not use these built-in features of the platform or neglect to set their configuration appropriately? • (i.e. the default RPS limit in the NIWS client is Integer.MAX_VALUE) 40 Thursday, November 17, 2011
  • 41. Solutions to Issue 2 We have a little help 41 Thursday, November 17, 2011
  • 42. Simian Army Prevention is the best medicine 42 Thursday, November 17, 2011
  • 43. Simian Army • Chaos Monkey • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group) • Similar to how EC2 instances can be killed by AWS with little warning • Tests clientsʼ ability to gracefully deal with broken connections, interrupted calls, etc... • Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances • If not, the Chaos monkey will win! 43 Thursday, November 17, 2011
  • 44. Simian Army • Latency Monkey • Simulates soft failures -- i.e. a service gets slower • Injects random delays in NIWS (client-side) or BaseServer (server-side) of a client-server interaction in production • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays, that leads to thundering herd and timeouts 44 Thursday, November 17, 2011
  • 45. Simian Army Does this solve all of our issues? 45 Thursday, November 17, 2011
  • 46. Simian Army The infinite cloud is infinite when your needs are moderate! To ensure fairness among tenants, AWS meters or limits every resource Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours! 46 Thursday, November 17, 2011
  • 47. Simian Army • Limits Monkey • Checks once a day whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS! • Conformity & Janitor Monkeys • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room • Buys us more time before we run out of resources and also saves us $$$$ 47 Thursday, November 17, 2011
  • 48. Simian Army The Simian Army fills the gap created by an absence of process and a need to ensure fault-tolerance and efficient operation of our systems 48 Thursday, November 17, 2011
  • 49. Fast Rollback Fault-tolerant deployment 49 Thursday, November 17, 2011
  • 50. Fast Rollback What is the point of having Fault-Tolerant layers if deployments of a bug can take them down? 50 Thursday, November 17, 2011
  • 51. Fast Rollback 51 Thursday, November 17, 2011
  • 52. Fast Rollback Optimism causes outages 51 Thursday, November 17, 2011
  • 53. Fast Rollback Optimism causes outages Production traffic is unique 51 Thursday, November 17, 2011
  • 54. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running 51 Thursday, November 17, 2011
  • 55. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version 51 Thursday, November 17, 2011
  • 56. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results 51 Thursday, November 17, 2011
  • 57. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results Revert traffic quickly 51 Thursday, November 17, 2011
  • 58. Fast Rollback 52 Thursday, November 17, 2011
  • 59. Fast Rollback api-frontend api-usprod-v007 52 Thursday, November 17, 2011
  • 60. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52 Thursday, November 17, 2011
  • 61. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52 Thursday, November 17, 2011
  • 62. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52 Thursday, November 17, 2011
  • 63. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52 Thursday, November 17, 2011
  • 64. Fast Rollback api-frontend api-usprod-v008 52 Thursday, November 17, 2011
  • 65. Fast Rollback 53 Thursday, November 17, 2011
  • 66. Fast Rollback api-frontend api-usprod-v007 53 Thursday, November 17, 2011
  • 67. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53 Thursday, November 17, 2011
  • 68. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53 Thursday, November 17, 2011
  • 69. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53 Thursday, November 17, 2011
  • 70. Fast Rollback api-frontend api-usprod-v007 53 Thursday, November 17, 2011
  • 71. Acknowledgements Platform Engineering • Sudhir Tonse • Pradeep Kamath Engineering Tools • Joe Sondow Streaming Server • Ranjit Mavinkurve 54 Thursday, November 17, 2011
  • 72. Questions? Sid Anand @r39132 55 Thursday, November 17, 2011