SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Dynamic Reconfiguration of ZooKeeper

             Alex Shraer
    (presented by Benjamin Reed)
Why ZooKeeper?




•
    Lots of servers
•
    Lots of processes
•
    High volumes of data
•
    Highly complex software systems
•
    … mere mortal developers
What ZooKeeper gives you
●   Simple programming model
●   Coordination of distributed processes
●   Fast notification of changes
●   Elasticity
●   Easy setup
●   High availability
ZooKeeper Configuration

• Membership
• Role of each server
  – E.g., follower or observer
• Quorum System spec
  – Zookeeper: majority or hierarchical
• Network addresses & ports
• Timeouts, directory paths, etc.
Zookeeper - distributed and replicated
                                 ZooKeeper Service
                                    Leader

             Server     Server        Server            Server        Server




    Client   Client   Client     Client        Client        Client     Client   Client


• All servers store a copy of the data (in memory)
• A leader is elected at startup
• Reads served by followers, all updates go through leader
• Update acked when a quorum of servers have persisted the
  change (on disk)
• Zookeeper uses ZAB - its own atomic broadcast protocol
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
   – Cloud computing: adapt to changing load, don’t pre-allocate!
   – Failures: replacing failed nodes with healthy ones
   – Upgrades: replacing out-of-date nodes with up-to-date ones
   – Free up storage space: decreasing the number of replicas
   – Moving nodes: within the network or the data center
   – Increase resilience by changing the set of servers
  Example: asynch. replication works as long as > #servers/2 operate:
Hazards of Manual Reconfiguration
                                     E
       A

                       C


        {A, B, C}

        B              {A, B, C}     D




           {A, B, C}


       • Goal: add servers E and D
Hazards of Manual Reconfiguration
                                           E
        A

                           C


       {A, B, C, D, E}                      {A, B, C, D, E}

         B               {A, B, C, D, E}   D




       {A, B, C, D, E}
                                           {A, B, C, D, E}


        • Goal: add servers E and D
        • Change Configuration
Hazards of Manual Reconfiguration
                                           E
        A

                           C


       {A, B, C, D, E}                      {A, B, C, D, E}

         B               {A, B, C, D, E}   D




       {A, B, C, D, E}
                                           {A, B, C, D, E}


        • Goal: add servers E and D
        • Change Configuration
        • Restart Servers
Hazards of Manual Reconfiguration
                                           E
        A

                           C


       {A, B, C, D, E}                      {A, B, C, D, E}

         B               {A, B, C, D, E}   D




       {A, B, C, D, E}
                                           {A, B, C, D, E}


        • Goal: add servers E and D
        • Change Configuration
        • Restart Servers
Hazards of Manual Reconfiguration
                                           E
        A

                           C


       {A, B, C, D, E}                      {A, B, C, D, E}

         B               {A, B, C, D, E}   D




       {A, B, C, D, E}
                                           {A, B, C, D, E}


        • Goal: add servers E and D
        • Change Configuration
        • Restart Servers
Hazards of Manual Reconfiguration
                                           E
        A

                           C


       {A, B, C, D, E}                      {A, B, C, D, E}

         B               {A, B, C, D, E}   D




       {A, B, C, D, E}
                                           {A, B, C, D, E}


        •    Goal: add servers E and D
        •    Change Configuration
        •    Restart Servers
        •    Lost    and    !
18

          Just use a coordination service!
     • Zookeeper is the coordination service
        – Don’t want to deploy another system to coordinate it!


     • Who will reconfigure that system ?
        – GFS has 3 levels of coordination services


     • More system components -> more management overhead


     • Use Zookeeper to reconfigure itself!
        – Other systems store configuration information in Zookeeper
        – Can we do the same??
        – Only if there are no failures
Recovery in Zookeeper

                  C               E

                           B


setData(/x, 5)

                                  D
                      A
Recovery in Zookeeper

                  C               E

                           B


setData(/x, 5)

                                  D
                      A
Recovery in Zookeeper

                  C               E

                           B


setData(/x, 5)

                                  D
                      A
Recovery in Zookeeper

                  C               E

                           B


setData(/x, 5)

                                  D
                      A
This doesn’t work for reconfigurations!
                                                                        E
                               C


                                                     B
                               {A, B, C, D, E}                          {A, B, C, D, E}


setData(/zookeeper/config, {A, B, F})
                                                      {A, B, C, D, E}   D
      remove C, D, E add F



             F
                                                                        {A, B, C, D, E}
                                        A




                                         {A, B, C, D, E}
This doesn’t work for reconfigurations!
                                                                          E
                               C


                                                        B
                               {A, B, C, D, E}                            {A, B, C, D, E}


setData(/zookeeper/config, {A, B, F})
                                                        {A, B, C, D, E}   D
      remove C, D, E add F



             F
                                                                          {A, B, C, D, E}
                                        A



 {A, B, F}
                                            {A, B, F}
This doesn’t work for reconfigurations!
                                                                              E
                                   C


                                                            B
                                   {A, B, C, D, E}                            {A, B, C, D, E}


    setData(/zookeeper/config, {A, B, F})
                                                            {A, B, C, D, E}   D
          remove C, D, E add F



                  F
                                                                              {A, B, C, D, E}
                                            A



      {A, B, F}
                                                {A, B, F}

•   Must persist the decision to reconfigure in the old
    config before activating the new config!
•   Once such decision is reached, must not allow further
    ops to be committed in old config
Our Solution
•   Correct
•   Fully automatic
•   No external services or additional components
•   Minimal changes to Zookeeper
•   Usually unnoticeable to clients
    – Pause operations only in rare circumstances
    – Clients work with a single configuration
• Rebalances clients across servers in new configuration

• Reconfigures immediately

• Speculative Reconfiguration
    – Reconfiguration (and commands that follow it) speculatively sent out by the
      primary, similarly to all other updates
Principles
●   Commit reconfig in a quorum of the old ensemble
    –   Submit reconfig op just like any other update
●   Make sure new ensemble has latest state before
    becoming active
    –   Get quorum of synced followers from new config
    –   Get acks from both old and new ensembles before committing
        updates proposed between reconfig op and activation
    –   Activate new configuration when reconfig commits
●   Once new ensemble active old ensemble cannot commit
    or propose new updates
●   Gossip activation through leader election and syncing
●   Verify configuration id of leader and follower
Failure free flow
Reconfiguration scenario 1
                                 E
   A

                   C


    {A, B, C}                        {A, B, C}

    B              {A, B, C}     D




       {A, B, C}
                                      {A, B, C}


   • Goal: add servers E and D
Reconfiguration scenario 1
                               E
   A

                   C


    {A, B, C}

    B              {A, B, C}   D




       {A, B, C}


   • Goal: add servers E and D
   •    doesn't commit until quorums of
     both ensembles ack
Reconfiguration scenario 1
                               E
   A

                   C


    {A, B, C}                      {A, B, C}

    B              {A, B, C}   D




       {A, B, C}
                                    {A, B, C}


   • Goal: add servers E and D
   •    doesn't commit until quorums of
     both ensembles ack
Reconfiguration scenario 1
                               E
   A

                   C


    {A, B, C}                      {A, B, C}

    B              {A, B, C}   D




       {A, B, C}
                                    {A, B, C}


   • Goal: add servers E and D
   •    doesn't commit until quorums of
     both ensembles ack
Reconfiguration scenario 1
                                 E
    A

                     C


   {A, B, C, D, E}               {A, B, C, D, E}

     B               {A, B, C}   D




   {A, B, C, D, E}
                                  {A, B, C, D, E}


    • Goal: add servers E and D
    •    doesn't commit until quorums of
      both ensembles ack
Reconfiguration scenario 1
                                 E
    A

                     C


   {A, B, C, D, E}               {A, B, C, D, E}

     B               {A, B, C}   D




   {A, B, C, D, E}
                                  {A, B, C, D, E}


    • Goal: add servers E and D
    •    doesn't commit until quorums of
      both ensembles ack
    • E and D gossip new configuration
      to C
Reconfiguration scenario 1
                                       E
    A

                       C


   {A, B, C, D, E}                     {A, B, C, D, E}

     B               {A, B, C, D, E}   D




   {A, B, C, D, E}
                                        {A, B, C, D, E}


    • Goal: add servers E and D
    •    doesn't commit until quorums of
      both ensembles ack
    • E and D gossip new configuration
      to C
Example - reconfig using CLI
reconfig -add 1=host1.com:1234:1235:observer;1239

         -add 2=host2.com:1236:1237:follower;1231 -remove 5
●
    Change follower 1 to an observer and change its ports
●
    Add follower 2 to the ensemble
●
    Remove follower 5 from the ensemble

reconfig -file myNewConfig.txt -v 234547
●
    Change the current config to the one in myNewConfig.txt
●
    But only if current config version is 234547

getConfig -w -c
●
    set a watch on /zookeeper/config
●
    -c means we only want the new connection string for clients
When it will not work
●   Quorum of new ensemble must be in sync
●   Another reconfig in progress
●   Version condition check fails
How do you know you are done
●   Write something somewhere
The “client side” of reconfiguration
• When system changes, clients need to stay connected
   – The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
   – Migration should be proportional to change in membership




                  X 10   X 10   X 10
The “client side” of reconfiguration
• When system changes, clients need to stay connected
   – The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
   – Migration should be proportional to change in membership




                   X 10   X 10   X 10
Our approach - Probabilistic Load Balancing
• Example 1 :


                X 10   X 10   X 10
Our approach - Probabilistic Load Balancing
• Example 1 :


                X 10   X 10   X 10
Our approach - Probabilistic Load Balancing
• Example 1 :


                       X6       X6       X6        X6     X6

   –   Each client moves to a random new server with probability 0.4
   –   1 – 3/5 = 0.4

   –   Exp. 40% clients will move off of each server
Our approach - Probabilistic Load Balancing
• Example 1 :


                        X6       X6       X6        X6     X6

    –   Each client moves to a random new server with probability 0.4
    –   1 – 3/5 = 0.4

    –   Exp. 40% clients will move off of each server
●
    Example 2 :



                        X6       X6      X6         X6     X6
Our approach - Probabilistic Load Balancing
• Example 1 :


                        X6       X6       X6        X6     X6

    –   Each client moves to a random new server with probability 0.4
    –   1 – 3/5 = 0.4

    –   Exp. 40% clients will move off of each server
●
    Example 2 :



                        X6       X6      X6         X6     X6
Our approach - Probabilistic Load Balancing
• Example 1 :


                         X6         X6       X6           X6      X6
    –   Each client moves to a random new server with probability 0.4
    –   1 – 3/5 = 0.4

    –   Exp. 40% clients will move off of each server
●
    Example 2 :
                                                   4/18        4/18    10/18




                         X6        X6       X6            X6      X6

    –   Connected clients don’t move
    –   Disconnected clients move to old servers with prob 4/18 and new one with prob
        10/18
    –   Exp. 8 clients will move from A, B, C to D, E and 10 to F
Our approach - Probabilistic Load Balancing
• Example 1 :


                         X6         X6       X6           X6      X6
    –   Each client moves to a random new server with probability 0.4
    –   1 – 3/5 = 0.4

    –   Exp. 40% clients will move off of each server
●
    Example 2 :
                                                   4/18        4/18      10/18




                                                        X 10      X 10      X 10

    –   Connected clients don’t move
    –   Disconnected clients move to old servers with prob 4/18 and new one with prob
        10/18
    –   Exp. 8 clients will move from A, B, C to D, E and 10 to F
Current Load Balancing
ProbabilisticCurrent Load Balancing
 When moving from config. S to S’:
E (load (i, S ' )) = load (i, S ) +     ∑ load ( j, S ) ⋅ Pr( j → i ) − load (i, S ) ∑ Pr(i → j )
                                      j∈S ∧ j ≠i                                 j∈S ' ∧ j ≠i

 expected #clients       #clients
 connected to i in S’   connected                      #clients
(10 in last example)     to i in S                                               #clients
                                                    moving to i from         moving from i to
                                                   other servers in S       other servers in S’
 Solving for Pr we get case-specific probabilities.
 Input: each client answers locally
 Question 1: Are there more servers now or less ?
 Question 2: Is my server being removed?
 Output: 1) disconnect or stay connected to my server
          if disconnect 2) Pr(connect to one of the old servers)
                 and Pr(connect to newly added server)
Implementation
• Implemented in Zookeeper (Java & C), integration ongoing
   – 3 new Zookeeper API calls: reconfig, getConfig, updateServerList
   – feature requested since 2008, expected in 3.5.0 release (july 2012)
• Dynamic changes to:
   –   Membership
   –   Quorum System
   –   Server roles
   –   Addresses & ports
• Reconfiguration modes:
   – Incremental (add servers E and D, remove server B)
   – Non-incremental (new config = {A, C, D, E})
   – Blind or conditioned (reconfig only if current config is #5)
• Subscriptions to config changes
   – Client can invoke client-side re-balancing upon change
52

                                      Summary
     • Design and implementation of reconfiguration for Apache Zookeeper
        – being contributed into Zookeeper codebase


     • Much simpler than state of the art, using properties already provided by Zookeeper

     • Many nice features:
        – Doesn’t limit concurrency
        – Reconfigures immediately
        – Preserves primary order
        – Doesn’t stop client ops
        – Zookeeper used by online systems, any delay must be avoided
        – Clients work with a single configuration at a time
        – No external services
        – Includes client-side rebalancing

Weitere ähnliche Inhalte

Was ist angesagt?

これからはじめるインフラエンジニア
これからはじめるインフラエンジニアこれからはじめるインフラエンジニア
これからはじめるインフラエンジニア
外道 父
 
C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努
Insight Technology, Inc.
 

Was ist angesagt? (20)

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
【配信!Veeam情報局】バックアップ容量の最適化、ストレージ節約や拡張方法を解説!
【配信!Veeam情報局】バックアップ容量の最適化、ストレージ節約や拡張方法を解説!【配信!Veeam情報局】バックアップ容量の最適化、ストレージ節約や拡張方法を解説!
【配信!Veeam情報局】バックアップ容量の最適化、ストレージ節約や拡張方法を解説!
 
HA環境構築のベスト・プラクティス
HA環境構築のベスト・プラクティスHA環境構築のベスト・プラクティス
HA環境構築のベスト・プラクティス
 
XenDesktop&XenApp環境の印刷を極める
XenDesktop&XenApp環境の印刷を極めるXenDesktop&XenApp環境の印刷を極める
XenDesktop&XenApp環境の印刷を極める
 
トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイント
 
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
オレ流のOpenJDKの開発環境(JJUG CCC 2019 Fall講演資料)
 
Scalar DL Technical Overview
Scalar DL Technical OverviewScalar DL Technical Overview
Scalar DL Technical Overview
 
JVMのGCアルゴリズムとチューニング
JVMのGCアルゴリズムとチューニングJVMのGCアルゴリズムとチューニング
JVMのGCアルゴリズムとチューニング
 
これからはじめるインフラエンジニア
これからはじめるインフラエンジニアこれからはじめるインフラエンジニア
これからはじめるインフラエンジニア
 
LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版
 
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
 
Maxscale switchover, failover, and auto rejoin
Maxscale switchover, failover, and auto rejoinMaxscale switchover, failover, and auto rejoin
Maxscale switchover, failover, and auto rejoin
 
C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努C16 45分でわかるPostgreSQLの仕組み by 山田努
C16 45分でわかるPostgreSQLの仕組み by 山田努
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
JVMに裏から手を出す!JVMTIに触れてみよう(オープンソースカンファレンス2020 Online/Hiroshima 講演資料)
JVMに裏から手を出す!JVMTIに触れてみよう(オープンソースカンファレンス2020 Online/Hiroshima 講演資料)JVMに裏から手を出す!JVMTIに触れてみよう(オープンソースカンファレンス2020 Online/Hiroshima 講演資料)
JVMに裏から手を出す!JVMTIに触れてみよう(オープンソースカンファレンス2020 Online/Hiroshima 講演資料)
 
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
Kubernetes 基盤における非機能試験の deepdive(Kubernetes Novice Tokyo #17 発表資料)
 
Hive on Tezのベストプラクティス
Hive on TezのベストプラクティスHive on Tezのベストプラクティス
Hive on Tezのベストプラクティス
 

Ähnlich wie Dynamic Reconfiguration of Apache ZooKeeper

Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
Takahiko Ito
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
Fei Dong
 

Ähnlich wie Dynamic Reconfiguration of Apache ZooKeeper (20)

Understanding histogramppt.prn
Understanding histogramppt.prnUnderstanding histogramppt.prn
Understanding histogramppt.prn
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculator
 
C++ unit-1-part-11
C++ unit-1-part-11C++ unit-1-part-11
C++ unit-1-part-11
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
UKOUG2018 - I Know what you did Last Summer [in my Database].pptx
UKOUG2018 - I Know what you did Last Summer [in my Database].pptxUKOUG2018 - I Know what you did Last Summer [in my Database].pptx
UKOUG2018 - I Know what you did Last Summer [in my Database].pptx
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Real World Optimization
Real World OptimizationReal World Optimization
Real World Optimization
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
What's new in Doctrine
What's new in DoctrineWhat's new in Doctrine
What's new in Doctrine
 
Eventually Consistent Data Structures (from strangeloop12)
Eventually Consistent Data Structures (from strangeloop12)Eventually Consistent Data Structures (from strangeloop12)
Eventually Consistent Data Structures (from strangeloop12)
 
Eventually-Consistent Data Structures
Eventually-Consistent Data StructuresEventually-Consistent Data Structures
Eventually-Consistent Data Structures
 
Towards a General Approach for Symbolic Model-Checker Prototyping
Towards a General Approach for Symbolic Model-Checker PrototypingTowards a General Approach for Symbolic Model-Checker Prototyping
Towards a General Approach for Symbolic Model-Checker Prototyping
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Dynamic Reconfiguration of Apache ZooKeeper

  • 1. Dynamic Reconfiguration of ZooKeeper Alex Shraer (presented by Benjamin Reed)
  • 2. Why ZooKeeper? • Lots of servers • Lots of processes • High volumes of data • Highly complex software systems • … mere mortal developers
  • 3. What ZooKeeper gives you ● Simple programming model ● Coordination of distributed processes ● Fast notification of changes ● Elasticity ● Easy setup ● High availability
  • 4. ZooKeeper Configuration • Membership • Role of each server – E.g., follower or observer • Quorum System spec – Zookeeper: majority or hierarchical • Network addresses & ports • Timeouts, directory paths, etc.
  • 5. Zookeeper - distributed and replicated ZooKeeper Service Leader Server Server Server Server Server Client Client Client Client Client Client Client Client • All servers store a copy of the data (in memory) • A leader is elected at startup • Reads served by followers, all updates go through leader • Update acked when a quorum of servers have persisted the change (on disk) • Zookeeper uses ZAB - its own atomic broadcast protocol
  • 6. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 7. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 8. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 9. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 10. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 11. Dynamic Membership Changes • Necessary in every long-lived system! • Examples: – Cloud computing: adapt to changing load, don’t pre-allocate! – Failures: replacing failed nodes with healthy ones – Upgrades: replacing out-of-date nodes with up-to-date ones – Free up storage space: decreasing the number of replicas – Moving nodes: within the network or the data center – Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 operate:
  • 12. Hazards of Manual Reconfiguration E A C {A, B, C} B {A, B, C} D {A, B, C} • Goal: add servers E and D
  • 13. Hazards of Manual Reconfiguration E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • Change Configuration
  • 14. Hazards of Manual Reconfiguration E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • Change Configuration • Restart Servers
  • 15. Hazards of Manual Reconfiguration E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • Change Configuration • Restart Servers
  • 16. Hazards of Manual Reconfiguration E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • Change Configuration • Restart Servers
  • 17. Hazards of Manual Reconfiguration E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • Change Configuration • Restart Servers • Lost and !
  • 18. 18 Just use a coordination service! • Zookeeper is the coordination service – Don’t want to deploy another system to coordinate it! • Who will reconfigure that system ? – GFS has 3 levels of coordination services • More system components -> more management overhead • Use Zookeeper to reconfigure itself! – Other systems store configuration information in Zookeeper – Can we do the same?? – Only if there are no failures
  • 19. Recovery in Zookeeper C E B setData(/x, 5) D A
  • 20. Recovery in Zookeeper C E B setData(/x, 5) D A
  • 21. Recovery in Zookeeper C E B setData(/x, 5) D A
  • 22. Recovery in Zookeeper C E B setData(/x, 5) D A
  • 23. This doesn’t work for reconfigurations! E C B {A, B, C, D, E} {A, B, C, D, E} setData(/zookeeper/config, {A, B, F}) {A, B, C, D, E} D remove C, D, E add F F {A, B, C, D, E} A {A, B, C, D, E}
  • 24. This doesn’t work for reconfigurations! E C B {A, B, C, D, E} {A, B, C, D, E} setData(/zookeeper/config, {A, B, F}) {A, B, C, D, E} D remove C, D, E add F F {A, B, C, D, E} A {A, B, F} {A, B, F}
  • 25. This doesn’t work for reconfigurations! E C B {A, B, C, D, E} {A, B, C, D, E} setData(/zookeeper/config, {A, B, F}) {A, B, C, D, E} D remove C, D, E add F F {A, B, C, D, E} A {A, B, F} {A, B, F} • Must persist the decision to reconfigure in the old config before activating the new config! • Once such decision is reached, must not allow further ops to be committed in old config
  • 26. Our Solution • Correct • Fully automatic • No external services or additional components • Minimal changes to Zookeeper • Usually unnoticeable to clients – Pause operations only in rare circumstances – Clients work with a single configuration • Rebalances clients across servers in new configuration • Reconfigures immediately • Speculative Reconfiguration – Reconfiguration (and commands that follow it) speculatively sent out by the primary, similarly to all other updates
  • 27. Principles ● Commit reconfig in a quorum of the old ensemble – Submit reconfig op just like any other update ● Make sure new ensemble has latest state before becoming active – Get quorum of synced followers from new config – Get acks from both old and new ensembles before committing updates proposed between reconfig op and activation – Activate new configuration when reconfig commits ● Once new ensemble active old ensemble cannot commit or propose new updates ● Gossip activation through leader election and syncing ● Verify configuration id of leader and follower
  • 29. Reconfiguration scenario 1 E A C {A, B, C} {A, B, C} B {A, B, C} D {A, B, C} {A, B, C} • Goal: add servers E and D
  • 30. Reconfiguration scenario 1 E A C {A, B, C} B {A, B, C} D {A, B, C} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack
  • 31. Reconfiguration scenario 1 E A C {A, B, C} {A, B, C} B {A, B, C} D {A, B, C} {A, B, C} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack
  • 32. Reconfiguration scenario 1 E A C {A, B, C} {A, B, C} B {A, B, C} D {A, B, C} {A, B, C} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack
  • 33. Reconfiguration scenario 1 E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack
  • 34. Reconfiguration scenario 1 E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack • E and D gossip new configuration to C
  • 35. Reconfiguration scenario 1 E A C {A, B, C, D, E} {A, B, C, D, E} B {A, B, C, D, E} D {A, B, C, D, E} {A, B, C, D, E} • Goal: add servers E and D • doesn't commit until quorums of both ensembles ack • E and D gossip new configuration to C
  • 36. Example - reconfig using CLI reconfig -add 1=host1.com:1234:1235:observer;1239 -add 2=host2.com:1236:1237:follower;1231 -remove 5 ● Change follower 1 to an observer and change its ports ● Add follower 2 to the ensemble ● Remove follower 5 from the ensemble reconfig -file myNewConfig.txt -v 234547 ● Change the current config to the one in myNewConfig.txt ● But only if current config version is 234547 getConfig -w -c ● set a watch on /zookeeper/config ● -c means we only want the new connection string for clients
  • 37. When it will not work ● Quorum of new ensemble must be in sync ● Another reconfig in progress ● Version condition check fails
  • 38. How do you know you are done ● Write something somewhere
  • 39. The “client side” of reconfiguration • When system changes, clients need to stay connected – The usual solution: directory service (e.g., DNS) • Re-balancing load during reconfiguration is also important! • Goal: uniform #clients per server with minimal client migration – Migration should be proportional to change in membership X 10 X 10 X 10
  • 40. The “client side” of reconfiguration • When system changes, clients need to stay connected – The usual solution: directory service (e.g., DNS) • Re-balancing load during reconfiguration is also important! • Goal: uniform #clients per server with minimal client migration – Migration should be proportional to change in membership X 10 X 10 X 10
  • 41. Our approach - Probabilistic Load Balancing • Example 1 : X 10 X 10 X 10
  • 42. Our approach - Probabilistic Load Balancing • Example 1 : X 10 X 10 X 10
  • 43. Our approach - Probabilistic Load Balancing • Example 1 : X6 X6 X6 X6 X6 – Each client moves to a random new server with probability 0.4 – 1 – 3/5 = 0.4 – Exp. 40% clients will move off of each server
  • 44. Our approach - Probabilistic Load Balancing • Example 1 : X6 X6 X6 X6 X6 – Each client moves to a random new server with probability 0.4 – 1 – 3/5 = 0.4 – Exp. 40% clients will move off of each server ● Example 2 : X6 X6 X6 X6 X6
  • 45. Our approach - Probabilistic Load Balancing • Example 1 : X6 X6 X6 X6 X6 – Each client moves to a random new server with probability 0.4 – 1 – 3/5 = 0.4 – Exp. 40% clients will move off of each server ● Example 2 : X6 X6 X6 X6 X6
  • 46. Our approach - Probabilistic Load Balancing • Example 1 : X6 X6 X6 X6 X6 – Each client moves to a random new server with probability 0.4 – 1 – 3/5 = 0.4 – Exp. 40% clients will move off of each server ● Example 2 : 4/18 4/18 10/18 X6 X6 X6 X6 X6 – Connected clients don’t move – Disconnected clients move to old servers with prob 4/18 and new one with prob 10/18 – Exp. 8 clients will move from A, B, C to D, E and 10 to F
  • 47. Our approach - Probabilistic Load Balancing • Example 1 : X6 X6 X6 X6 X6 – Each client moves to a random new server with probability 0.4 – 1 – 3/5 = 0.4 – Exp. 40% clients will move off of each server ● Example 2 : 4/18 4/18 10/18 X 10 X 10 X 10 – Connected clients don’t move – Disconnected clients move to old servers with prob 4/18 and new one with prob 10/18 – Exp. 8 clients will move from A, B, C to D, E and 10 to F
  • 49. ProbabilisticCurrent Load Balancing When moving from config. S to S’: E (load (i, S ' )) = load (i, S ) + ∑ load ( j, S ) ⋅ Pr( j → i ) − load (i, S ) ∑ Pr(i → j ) j∈S ∧ j ≠i j∈S ' ∧ j ≠i expected #clients #clients connected to i in S’ connected #clients (10 in last example) to i in S #clients moving to i from moving from i to other servers in S other servers in S’ Solving for Pr we get case-specific probabilities. Input: each client answers locally Question 1: Are there more servers now or less ? Question 2: Is my server being removed? Output: 1) disconnect or stay connected to my server if disconnect 2) Pr(connect to one of the old servers) and Pr(connect to newly added server)
  • 50. Implementation • Implemented in Zookeeper (Java & C), integration ongoing – 3 new Zookeeper API calls: reconfig, getConfig, updateServerList – feature requested since 2008, expected in 3.5.0 release (july 2012) • Dynamic changes to: – Membership – Quorum System – Server roles – Addresses & ports • Reconfiguration modes: – Incremental (add servers E and D, remove server B) – Non-incremental (new config = {A, C, D, E}) – Blind or conditioned (reconfig only if current config is #5) • Subscriptions to config changes – Client can invoke client-side re-balancing upon change
  • 51. 52 Summary • Design and implementation of reconfiguration for Apache Zookeeper – being contributed into Zookeeper codebase • Much simpler than state of the art, using properties already provided by Zookeeper • Many nice features: – Doesn’t limit concurrency – Reconfigures immediately – Preserves primary order – Doesn’t stop client ops – Zookeeper used by online systems, any delay must be avoided – Clients work with a single configuration at a time – No external services – Includes client-side rebalancing