SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Downloaden Sie, um offline zu lesen
From Object Replication to
  Database Replication

       Fernando Pedone
      University of Lugano
          Switzerland
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •Final remarks



© F. Pedone                         2
Motivation

     •From object replication...
         ‣ Replication in the distributed systems
           community
         ‣ Processes, non-transactional objects
     •...to database replication
         ‣ Replication in the database community
         ‣ Transactions, databases


© F. Pedone                                         3
Motivation

     •Object replication is important
         ‣ Fault tolerance
                              File server
                               Replica #1



                   Client 1
                              File server
                              File server
                               Replica #2
                   Client 2


                              File server
                               Replica #3
© F. Pedone                                 4
Motivation

     • Database replication is important
         ‣ Fault tolerance
                                                Client 3
         ‣ Performance        Database server
                                 Replica #1
                                                Client 4

                   Client 1
                              Database server
                                 Replica #2
                   Client 2

                                                Client 5
                              Database server
                                 Replica #3
© F. Pedone
                                                Client 6   5
Motivation

     •Object vs. database replication (recap)
         ‣ Different goals
              -   Fault tolerance vs. Performance & Fault tolerance

         ‣ Different models
              -   Objects, non-transactional vs. transactions

         ‣ Different algorithms
              -   to a certain extent...



© F. Pedone                                                           6
Motivation

     •Group communication
         ‣ Messages addressed to a group of nodes
         ‣ Initially used for object replication
         ‣ Later used for database replication

  node 1
  node 2
  node 3
               point-to-point          group
© F. Pedone   communication         communication   7
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •Final remarks



© F. Pedone                         8
Replication model

     •Object model
         ‣ Clients: c1, c2,...
         ‣ Servers: s1, ..., sn
         ‣ Operations: read and write
              -   read(x) returns the value of object x
              -   write(x,v) updates the value of x with v;
                  returns acknowledgement



© F. Pedone                                                   9
Replication model

     •Object consistency
         ‣ Consistency criteria
              -   Defines the behavior of object operations
              -   Simple for single-client-single-server case
              -   But more complex in the general case
                  • Multiple clients
                  • Multiple servers
                  • Failures

© F. Pedone                                                     10
Replication model

         ‣ Consistency criteria
              -   Simple case: single client, single server, no failures

                   write(x,1)   ack(x)        read(x)    return(1)

 client




server
                     x=0        x=1
© F. Pedone                                                                11
Replication model

         ‣ Consistency criteria
              -   Multiple clients, multiple servers

          write(x,1)           ack(x)

        c1
                                read(x)   return(0)     read(x)   return(1)

        c2

        s1
                  x=0    x=1

        s2
© F. Pedone       x=0                                 x=1                     12
Replication model

         ‣ Consistency criteria
              -
              Ideally, defines system behavior regardless of
              implementation details and the operation semantics
           - Is the ack(x)
          write(x,1) following execution intuitive to clients?

        c1
                       read(x)   return(0)   read(x)   return(1)

        c2

        s1

        s2
© F. Pedone                                                        13
Replication model

         ‣ Consistency criteria
              -   Two consistency criteria for objects
                  (among several others)
                  • Linearizability
                  • Sequential consistency




© F. Pedone                                              14
Replication model

     •Linearizability
         ‣ A concurrent execution is linearizable if
           there is a sequential way to reorder the client
           operations such that:
              (1) it respects the semantics of the objects, as
                determined in their sequential specs
              (2) it respects the order of non-overlapping operations
                among all clients




© F. Pedone                                                         15
Replication model

     •Linearizability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
                           read(x)   return(0)

      c2


sequential
execution

                      Problem: Does not respect
© F. Pedone           the semantics of the object              16
Replication model

     •Linearizability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
                           read(x)   return(0)

      c2


sequential
execution

                 Problem: Does not respect the order
© F. Pedone        or non-overlapping operations               17
Replication model

     •Linearizability
         write(x,1)             ack(x)              read(x)          return(1)

      c1
                      read(x)                   return(0)

      c2
                        overlapping
      read(x)              return(0)                            read(x)          return(1)
sequential
execution
                                   write(x,1)               ack(x)

                         Linearizable execution!
© F. Pedone                                                                                  18
Replication model

     •Sequential consistency
         ‣ A concurrent execution is sequentially
           consistent if there is a sequential way to
           reorder the client operations such that:
              (1) it respects the semantics of the objects, as
                determined in their sequential specs
              (2) it respects the order of operations at the client
                that issued the operations




© F. Pedone                                                           19
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(1)

      c1
                           read(x)   return(0)

      c2


sequential
execution


                      Sequentially consistent!
© F. Pedone                                                    20
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(0)

      c1
                           read(x)   return(0)

      c2


sequential
execution

                 Problem: Does not respect the order
© F. Pedone        or non-overlapping operations               21
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(0)

      c1
                           read(x)   return(0)

      c2


sequential
execution

                 Problem: Does not respect the order
© F. Pedone             or operations at c1                    22
Replication model

     •Database model
         ‣ Clients: c1, c2,...
         ‣ Servers: s1, ..., sn
         ‣ Operations: read, write, commit and abort
         ‣ Transaction: group of read and/or write
           operations followed by commit or abort
              -   ACID properties (see next)



© F. Pedone                                            23
Replication model

     •Transaction properties
         ‣ Atomicity: Either all transaction operations
           are executed or none are executed
         ‣ Consistency: A transaction is a correct
           transformation of the state
         ‣ Isolation: Serializability (next slide)
         ‣ Durability: Once a transaction commits, its
           changes to the state survive failures

© F. Pedone                                               24
Replication model

     •Serializability (1-copy SR)
         ‣ A concurrent execution is serializable if it is
           equivalent to a serial execution with the same
           transactions
         ‣ Two executions are (view) equivalent if:
              -   Their transactions preserve the same “reads-from”
                  relationships (t reads x from t’ in both executions)
              -   They have the same final writes



© F. Pedone                                                              25
Replication model

     •Serializability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
                           read(x)   return(0)

      c2


sequential
execution


                               Serializable!
© F. Pedone                                                    26
Replication model

     •Serializability
         write(x,1)   ack(x) read(x)   return(1)

      c1
                                             read(x)   return(0)

      c2


sequential
execution


                              Serializable!
© F. Pedone                                                        27
Replication model

     •Serializability
         write(x,1)   ack(x) read(x)   return(1) read(x)   return(0)

      c1

      c2


sequential
execution


                             Serializable!!!
© F. Pedone                                                            28
Replication model

     •Serializability
         write(x,1)        ack(x)          read(y)    return(0)

      c1
              write(y,1)       ack(y)            read(x)    return(0)

      c2


sequential
execution


                                Non serializable!
© F. Pedone                                                             29
Replication model

     •Object and database consistency criteria
         ‣ Assume single-operation transactions

                    Objects          Database

                                      Strong
                 Linearizability
                                   Serializability
                                                        equivalent
                   Sequential         Session        if transactions
                                                      have only one
                  Consistency      Serializability       operation


                       ?           Serializability      equivalent
                                                     if transactions
                                                      have only one
                                                         operation
© F. Pedone                                                            30
Replication model

     •Serializable, not session-serializable

              write(x,1)   ack(x)   read(x)   return(0)

      c1
                Transaction 1        Transaction 2
      c2


sequential
execution

© F. Pedone                                               31
Replication model

     •Session-serializable, not strongly
        serializable
              write(x,1)   ack(x)                    read(x)   return(1)

      c1
                Transaction 1       read(x)   return(0) Transaction   2
      c2
                                     Transaction 3


sequential
execution

© F. Pedone                                                                32
Replication model

     •Strongly serializable

              write(x,1)   ack(x)                    read(x)   return(1)

      c1
                Transaction 1       read(x)   return(1) Transaction   2
      c2
                                     Transaction 3


sequential
execution

© F. Pedone                                                                33
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •Final remarks



© F. Pedone                         34
From objects to databases

     •Fundamentals
         ‣ Model considerations
         ‣ Generic functional model
     •Object replication
         ‣ Passive replication (primary-backup)
         ‣ Active replication (state-machine replication)
         ‣ Multi-primary passive replication


© F. Pedone                                                 35
From objects to databases

     •Model considerations
         ‣ Client and server processes
         ‣ Communication by message passing
         ‣ Crash failures only
              -   No Byzantine failures




© F. Pedone                                   36
Replication model

     •Generic functional model
              Phase 1:   Phase 2:       Phase 3:    Phase 4:       Phase 5:
              Client     Server         Execution   Agreement      Client
              Request    Coordination               Coordination   Response


    Client

Server 1

Server 2

Server 3
© F. Pedone                                                                   37
From objects to databases

     •Passive replication
         ‣ aka Primary-backup replication
         ‣ Fail-stop failure model
              -   a process follows its spec until it crashes
              -   a crash is detected by every correct process
              -   no process is suspected of having crashed until after
                  it actually crashes

         ‣ Algorithm ensures linearizability

© F. Pedone                                                               38
From objects to databases

         ‣ Passive replication algorithm

                   Phase 1:   Phase 2:       Phase 3:    Phase 4:       Phase 5:
                   Client     ServerOnly     Execution   Agreement      Client
                   Request    Coordination               Coordination   Response
                                 primary
                                 executes
    Client                       requests
              Primary
Server 1
                                                                         Backups
              Backup 1                                                  apply new
Server 2                                                                  state
              Backup 2
Server 3
© F. Pedone                                                                         39
From objects to databases

     •Passive replication (Linearizability)
         ‣ Perfect failure detection ensures that at most
           one primary exists at all times
         ‣ A failed primary is detected by the backups
         ‣ Eventually one backup replaces failed primary




© F. Pedone                                                 40
From objects to databases

     •Active replication
         ‣ aka State-machine replication
         ‣ Crash failure model
              -   a process follows its spec until it crashes
              -   a crash is detected by every correct process
              -   correct processes may be erroneously suspected

         ‣ Algorithm ensures linearizability


© F. Pedone                                                        41
From objects to databases

         ‣ Atomic broadcast
              -   Group communication abstraction
              -   Primitives: broadcast(m) and deliver(m)
              -   Properties
                  • Agreement: Either all servers deliver m or no
                    server delivers m
                  • Total order: Any two servers deliver messages m
                    and m’ in the same order



© F. Pedone                                                           42
From objects to databases

         ‣ Atomic broadcast      Application

                m             broadcast(m) deliver(m)
                                  Group
 Client 1
                               Communication
                       m’       send(m) receive(m)
Client 2
                                    Network
Server 1
                                Architectural
Server 2                            View

Server 3

© F. Pedone                                          43
From objects to databases

         ‣ Active replication algorithm

                Phase 1:   Phase 2:       Phase 3:     Phase 4:       Phase 5:
                Client     Server         Execution    Agreement      Client
                Request    Coordination                Coordination   Response
                                                  Deterministic
                                                   execution
    Client

Server 1                                              All servers
                                                      execute all
Server 2                                               requests


Server 3
© F. Pedone                                                                      44
From objects to databases

         ‣ Why it works
           write(x,1)    ack(x)               read(x)      return(1)

       c1
                               read(x)   return(0)         return(1)
       broadcast
       c 2
       operation
      “write(x,1)”

        s1
              x=0       x=1

        s2
              x=0        x=1

        s3
© F. Pedone   x=0                                    x=1               45
From objects to databases

     •Active and passive replication
         ‣ Target fault tolerance only
         ‣ Not good for performance
              -   Active replication:
                  All servers execute all requests
              -   Passive replication:
                  Only one server executes requests




© F. Pedone                                           46
From objects to databases

     •Multi-primary passive replication
         ‣ Targets both fault tolerance and performance
         ‣ Does not require perfect failure detection
         ‣ Same resilience as active replication, but...
         ‣ Better performance
              -   Distinguishes read from write operations
              -   Only one server executes each request


© F. Pedone                                                  47
From objects to databases

     •Multi-primary passive replication
         ‣ Read requests
              -   Broadcast to all servers
              -   Executed by only one server
              -   Response is sent to the client




© F. Pedone                                        48
From objects to databases

     •Multi-primary passive replication
         ‣ Write requests
              -   Executed by one server
              -   State changes (“diff”) are broadcast to all servers
              -   If the changes are “compatible” with the previous
                  installed states, then a new state is installed, and
                  the result of the request is returned to the client
              -   Otherwise the request is re-executed



© F. Pedone                                                              49
From objects to databases

     •Multi-primary passive replication
                write(x,1)      ack(x)

      c1
                                         read(x)   return(1)

      c2         broadcast
                state change
                    “x=1”
       s1
              x=0               x=1

       s2
              x=0                x=1

        s3
© F. Pedone                                                    50
              x=0              x=1
From objects to databases

     •Multi-primary passive replication
               write(x,1)                                      ack(x)

      c1
                               write(x,2)                  ack(x)
                                             state
      c2                                    change
                                             “x=2”

                       state
       s1             change
              x=0      “x=1”                           x=2 x=1

       s2
              x=0                                      x=2      x=1

        s3
© F. Pedone                                                             51
              x=0                                    x=2        x=1
From objects to databases

     •Multi-primary passive replication
                inc(x,1)                                     ack(x)

      c1
                               inc(x,2)                  ack(x)
                                           state
      c2                                  change
                                           “x=2”

                       state
       s1             change
              x=0      “x=1”                         x=2 x=1

       s2
              x=0                                    x=2      x=1

        s3
© F. Pedone                                                           52
              x=0                                  x=2        x=1
From objects to databases

     •Multi-primary passive replication
         ‣ Write requests
              -   Executed by one server
              -   State changes (“diff”) are broadcast to all servers
              -   If the changes are “compatible” with the previous
                  installed states, then a new state is installed, and
                  the result of the request is returned to the client
              -   Otherwise the request is re-executed


© F. Pedone                                                              53
From objects to databases

     •Multi-primary passive replication
              -   Compatible states                           Si
                                                                   α     Si’
                  • Let Si be the state at server N before it executes
                    operation α, and Si’ the state after α
                  • Let h=S0°...°Sj be the sequence of installed states at
                    a server N’ when it tries to install the new state Si’
                  • Si’ is compatible with h if
                    - (a) Sj = Si or
                    - (b) α does not read any variables modified in
                      Si+1, Si+2, ..., Sj
© F. Pedone                                                                54
From objects to databases
              -    Compatible states

              x=1    write(x,8)   x=8
 N            y=2
              z=3
                                  y=2
                                  z=3

              Si                  Si’


                                        x=1   x=8
                          (a)     N’    y=2   y=2
                                        z=3   z=3

                                        Si    Si+1

© F. Pedone                                          55
From objects to databases
              -    Compatible states

              x=1    write(x,8)     x=8
 N            y=2
              z=3
                                    y=2
                                    z=3

              Si                    Si’


                         x=1      write(y,3)   x=1    write(z,5)   x=1   x=8
     (b)       N’        y=2                   y=3                 y=3   y=3
                         z=3                   z=3                 z=5   z=5

                          Si                   Si+1                Sj    Sj+1

© F. Pedone                                                                     56
From objects to databases

     •Multi-primary passive replication
                inc(x,1)                                           ack(x)

      c1
                               inc(x,2)                  ack(x)
                                           state
      c2                                  change
                                           “x=2”

                       state
       s1             change
              x=0      “x=1”                         x=2          x=3

       s2
              x=0                                    x=2          x=3

        s3
© F. Pedone                                                                 57
              x=0                                  x=2            x=3
From objects to databases

     •Multi-primary passive replication
         ‣ Optimistic approach
         ‣ May or not rely on deterministic execution
              -   It depends on how re-executions are dealt with

         ‣ Installing a new state is cheaper than
           executing the request (that creates the state)




© F. Pedone                                                        58
From objects to databases

     •Multi-primary passive replication
         ‣ Provides both fault tolerance and
           performance
         ‣ Suitable for database replication!




© F. Pedone                                     59
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •Final remarks



© F. Pedone                         60
Deferred update replication

     •General idea (I)
              Phase 1:   Phase 2:       Phase 3:    Phase 4:       Phase 5:
              Client     Server         Execution   Agreement      Client
              Request    Coordination               Coordination   Response


    Client

Server 1

Server 2

Server 3
© F. Pedone                                                                   61
Deferred update replication

     •General idea (II)

              Phase 1:   Phase 3:    Phase 5:         Phase 1:   Phase 3:    Phase 4:       Phase 5:
              Client     Execution   Client           Client     Execution   Agreement      Client
              Request                Response         Request                Coordination   Response
                                                                     Certification
                                                ...                   & Commit/
    Client                                                              Abort

Server 1

Server 2
                                                       Communication
Server 3                                                  Protocol

© F. Pedone                                                                                            62
Deferred update replication

     •The protocol in detail
         ‣ Transactions (read-only and update) are
           submitted and executed by one database
           server                              allowed by
                                                          serializability
         ‣ At commit time:
              -   Read-only transactions are committed immediately
              -   Update transactions are propagated to the other
                  replicas for certification, and possibly commit



© F. Pedone                                                                 63
Deferred update replication

     •Read-only transactions
              read(X) return(1)   read(Y) return(2)   commit   ok
Client 1

Server 1

Server 2

Server 3

 Client 2
              read(Y) return(2)   read(X) return(1)   commit   ok

© F. Pedone                                                         64
Deferred update replication

     •Update transactions
                                                                    ok /
                                                                   abort
              read(X) return(1) write(Y,3)   ack(Y) commit
Client 1

Server 1

Server 2

Server 3
                                                      If commit,
 Client 2                                           ...
                                                      writes are
              read(Y) return(2) write(X,3) ack(X)
                                                        applied
© F. Pedone                                                                65
Deferred update replication

     •Termination based on Atomic Broadcast
         ‣ Similar to multi-primary passive replication
         ‣ Deterministic certification test
         ‣ Lower abort rate than atomic commit based
           technique




© F. Pedone                                               66
Deferred update replication

         ‣ Transaction states
              -   Executing(t,s): transitory state
              -   Committing(t,s): transitory state
              -   Committed(t,s): final state
              -   Aborted(t,s): final state

                                                     Committed(t,s)

                  Executing(t,s)   Committing(t,s)

                                                      Aborted(t,s)

© F. Pedone                                                           77
Deferred update replication

         ‣ Transaction states
                                                    Certify
                                                  transaction
              write(y,3) ack(y) commit(t)                                ok
Client 1                               Atomic
                                      Broadcast
                                                          Commit(t,s1)
Server 1
                    Executing(t,s1)         Committing(t,s1)


Server 2

Server 3
                                            Transaction termination
 Client 2

© F. Pedone                                                                   68
Deferred update replication

         ‣ Transaction states

                commit(t1)      ok
Client 1

Server 1

Server 2

Server 3

 Client 2                            ok
                  commit(t2)

© F. Pedone                               69
therefore, all servers reach the same outcome, committed or aborted, without

  Deferred update replication
     further communication [1, 8].
         When a server s delivers t’s readset, writeset, and updates, Committing(t, s)
     holds. Server s certifies transactions in the order they are delivered. To certify
     t, server s checks whether each transaction it knows about in the committed
     state either precedes t or does not have a writeset that intersects t’s readset.
    •If Atomic conditions holds, s locally commits t. We define the condition
        one of these broadcast-based certification
     for the transition from the committing state to the committed state more
     formally as follows:


        ∀t ∀s : Committing(t, s) ; Committed(t, s) ≡
                                                                     
                                          ∀t s.t. Committed(t , s) :
                                                                      (1.3)
                                          t → t ∨ W S(t ) ∩ RS(t) = ∅

          If transaction t passes to the committed state at s, its updates should be
                              t’→t, t’ precedes t:
       applied to the database following the order imposed by the t’ does not
                                                                     atomic broadcast
                              changes made by t’                  update any item
       primitive, that is, if t is delivered before t and both pass the reads
                                                                    that t
                                                                           certification
                               could be seen by t
       test, then t’s updates should be applied to the database before those of t .
                                (i.e,. read by t)
          Intuitively, condition 1.3 checks whether transactions can be serialized
       following their delivery order. Let t and t be two transactions such that t
© F. Pedone                                                                           70
       is delivered and committed before t. There are two cases in which t will pass
       certification: (i) t committed before t started its execution, in which case any
Deferred update replication

     •Atomic broadcast-based certification
         ‣ Example
              -   Transaction t: read(x); write(y,-)
              -   Transaction t’: read(y); write(x,-)
              -   t and t’ are concurrent (i.e., neither t→t’ nor t’→t)
              -   All servers validate t and t’ in the same order
              -   Assume servers validate t and then t’
              -   What transaction(s) commit/abort?

© F. Pedone                                                               71
Deferred update replication

     •Reordering-based certification
         ‣ Let t and t’ be two concurrent transactions
           - t executes read(x); write(y,-) and
           - t’ executes read(y); write(z,-)
         ‣ (a) t is delivered and certified before t’
           - t passes certification, but
           - t’ does not: RS(t’) ∩ WS(t) = { y } ≠ ∅
         ‣ (b) t’ is delivered and certified before t
           - t’ passes certification and
           - t passes certification(!): RS(t) ∩ WS(t’) = ∅
© F. Pedone                                                 72
Deferred update replication

         ‣ Serialization order of transactions
              -   Without reordering         -   With reordering
                  • Order given by abcast        • Either t1 followed by t2
                  • Ex.: t1 followed by t2         or t2 followed by t1

                                    t1                  t2
Server 1

Server 2

 Server 3
© F. Pedone                                                                   73
Deferred update replication

     •Reordering-based certification
                                          serialization
                                              order

                Reorder List
               (committed transactions)
                                                          new delivered
                                                           transaction


          t1    t2       t3       t4                           t

    Condition for committing t (simplified):
    RS(t) ∩ (WS(t1) ∪ WS(t2) ∪ WS(t3) ∪ WS(t4)) = ∅

© F. Pedone                                                               74
Deferred update replication

     •Reordering-based certification
                                          serialization
                                              order

                Reorder List
               (committed transactions)
                                                          new delivered
                                                           transaction


          t1    t2                t3        t4                 t

    Condition for committing t (simplified):
    RS(t) ∩ (WS(t1) ∪ WS(t2)) = ∅ and
    WS(t) ∩ (RS(t3) ∪ RS(t4)) = ∅
© F. Pedone                                                               75
mitted transactions that have not been seen by transactions in execution since
       their relative order may change. The number of transactions in the Reorder
       List is limited by a predetermined threshold, the R                    . Whenever
  Deferred update replication
       the Reorder Factor is reached, one transaction in the Reorder List is removed
       and its updates are applied to the database.
          Let RLs = t0 ; t1 ; . . . ; tcount−1 be the Reorder List at server s containing
       count transactions and pos(x) be a function that returns the position of
       transaction x in RLs . We represent the condition for the state transition of
    •Reordering-based certification
       transaction t from the committing state to the committed state more formally
       as follows:


         ∀t∀s : Committing(t, s) ; Committed(t, s) ≡
                                                                       
                          ∃i, 0 ≤ i < count, s.t. ∀t ∈ RLs :
                                                                       
                                                                       
                         pos(t ) < i ⇒ t → t ∨ W S(t ) ∩ RS(t) = ∅ ∧ 
                                                                       
                                                                       
                                                ∧                     (1.4)
                                                                        
                                         (t → t ∨ W S(t ) ∩ RS(t) = ∅) 
                                                                       
                         pos(t ) ≥ i ⇒                ∧              
                                               W S(t) ∩ RS(t ) = ∅

             If t passes the certification test, it is included in RLs at position i, which
         means that transactions in positions in [i .. count−1] are rightshifted. If more
         than one position satisfies the condition in equation 1.4, then to ensure that
© F. Pedone servers apply transactions to the database in the same order, a deter- 76
         all
         ministic procedure has to be used to choose the insertion position (e.g., the
Deferred update replication

     •Generalized reordering
                   Reorder List               Reorder List

              t1   t2   t3   t4    t   t1          t t2   t3   t4

                   Reorder List               Reorder List

              t1   t2   t3    t t4          t t1    t2    t3 t4

                   Reorder List               Reorder List

              t1   t2    t t3     t4   t4     Abort t?
                                              t t2 t1 t3

© F. Pedone                                                         77
Deferred update replication

     •Termination based on Generic Broadcast
         ‣ Generic broadcast
              -   Conflict relation ∼ between messages
              -   Properties
                  • Agreement: Either all servers deliver m or no
                    server delivers m
                  • Total order: If messages m and m’ conflict, then
                    any two servers deliver them in the same order


© F. Pedone                                                           78
Deferred update replication

         ‣ Generic broadcast
              -   Motivation
                  • Ordering messages is more expensive than not
                    ordering messages
                  • Messages should only be ordered when needed
                    (as defined by the application)
              -   Atomic broadcast is a special case of generic
                  broadcast where all messages must be ordered




© F. Pedone                                                        79
Deferred update replication

         ‣ Conflict relation between transactions
              -   Assume transactions t1 and t2 don’t conflict


                            t1                      t2
 Server 1
                                  t1           t2
 Server 2
                                              t1
 Server 3
                             t2

© F. Pedone                                                     80
We show now that the ordering imposed by atomic broadcast is not always needed.
         Deferred update replication
Consider the termination of two transactions t and t with non-intersecting readsets
and writesets. In such a case, both t and t will pass the certification test and be
committed, regardless of the order in which they are delivered. This shows that
atomic broadcast is sometimes too strong, and can be favorably replaced by generic
broadcast (see Chapter 3). Generic broadcast is similar to atomic broadcast, with the
              ‣ Conflict relation between transactions
exception that applications can define order constraints on the delivery of messages:
                 - m:t means message m where transaction t
two messages are ordered only if they conflict,relays the conflict relation is defined
by the application semantics.
                 - if t conflicts with t’ (i.e., m:t ∼ m’:t’), then they are
   Based on the example above, one could definethe same order
                   delivered and certified in the following conflict relation ∼
among messages, where m : t means that message m relays transaction t:
                                                             
                                       RS(t) ∩W S(t ) = 0  /
                                                ∨            
                                                             
                                      W S(t) ∩ RS(t ) = 0 
                       m :t ∼ m :t ≡                      /                 (11.5)
                                                ∨            
                                       W S(t) ∩W S(t ) = 0  /

Notice that the conflict relation ∼ should account for write-write conflicts to make
sure thatF.transactions that update common data items are ordered, preventing the
       © Pedone                                                                     81
case in which two servers end up in different states after applying such transactions
Deferred update replication

         ‣ Read-only transactions
              -   Local execution without certification is not permitted
              -   Different states of the database could be observed

                                                            x=1
                         write(x,1)   t1                 t3 y=0         t2
                   x=0                                                       x=1
 Server 1          y=0                                                       y=2
                                               t1              t2
                   x=0                                                       x=1
 Server 2          y=0                                                       y=2
                   x=0
                         write(y,2)                                t1        x=1
 Server 3          y=0                                                       y=2
                                      t2                t4 x=0
                                                             y=2
                                           read(x);read(y)
© F. Pedone                                                                        82
Deferred update replication

         ‣ Read-only transactions
              -   Optimistic solution
                  • Broadcast and certify read-only transactions
              -   Pessimistic solution
                  • Pre-declare items to be read (readset)
                  • Broadcast transaction before execution
                  • Executed by one server only
                  • Never aborted


© F. Pedone                                                        83
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •Final remarks



© F. Pedone                         84
Final remarks

     •Object and database replication
         ‣ Different goals
              -   Fault tolerance   fault tolerance & performance

         ‣ Different consistency models
              -   Linearizability & sequential consistency
                  serializability
         ‣ Different algorithms
              -   Primary-backup & active replication
                  deferred update replication

© F. Pedone                                                         85
Final remarks

     •Consistency models
         ‣ Relationship between L/SC and SR
     •Replication algorithms
         ‣ Unifying framework for object and database
           replication
         ‣ Multi-primary passive replication




© F. Pedone                                             86
Final remarks

     •Replication and group communication
         ‣ A happy union
         ‣ Group communication -- atomic broadcast --
           leads to modular and efficient protocols
           (e.g., fewer aborts than atomic commit)
         ‣ Replication has motivated more powerful and
           efficient group communication protocols
           (e.g., generic and optimistic primitives)


© F. Pedone                                              87
References

     •Chapter 1:
        Consistency Models for
        Replicated Data
        A. Fekete and K. Ramamritham
     •Chapter 2:
        Replication Techniques for
        Availability
        R. van Renesse and R. Guerraoui
     •Chapter 11:
        From Object Replication to
        Database Replication
        F. Pedone and A. Schiper
© F. Pedone                               88

Weitere ähnliche Inhalte

Ähnlich wie 20100522 from object_to_database_replication_pedone_lecture01-02

Thoughts on consistency models
Thoughts on consistency modelsThoughts on consistency models
Thoughts on consistency modelsrogerbodamer
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSupun Dissanayake
 
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin Forge
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin ForgeBeyond Disaster Recovery: Restoring Production Workloads with PlateSpin Forge
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin ForgeNovell
 
Ch-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptxCh-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptxKabindra Koirala
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zIBM India Smarter Computing
 
Use Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldUse Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldMongoDB
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryadbutest
 
Refactoring Ruby on Rails Applications
Refactoring Ruby on Rails ApplicationsRefactoring Ruby on Rails Applications
Refactoring Ruby on Rails ApplicationsJonathan Weiss
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2fungfung Chen
 
Session 1 introduction concurrent programming
Session 1 introduction  concurrent programmingSession 1 introduction  concurrent programming
Session 1 introduction concurrent programmingEric Verhulst
 
Detecting Occurrences of Refactoring with Heuristic Search
Detecting Occurrences of Refactoring with Heuristic SearchDetecting Occurrences of Refactoring with Heuristic Search
Detecting Occurrences of Refactoring with Heuristic SearchShinpei Hayashi
 
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...antopensource
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...SQLExpert.pl
 
Basics of Node.js
Basics of Node.jsBasics of Node.js
Basics of Node.jsAlper Unal
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 
Pycvf
PycvfPycvf
Pycvftranx
 

Ähnlich wie 20100522 from object_to_database_replication_pedone_lecture01-02 (20)

XS Oracle 2009 Just Run It
XS Oracle 2009 Just Run ItXS Oracle 2009 Just Run It
XS Oracle 2009 Just Run It
 
Thoughts on consistency models
Thoughts on consistency modelsThoughts on consistency models
Thoughts on consistency models
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive Programming
 
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin Forge
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin ForgeBeyond Disaster Recovery: Restoring Production Workloads with PlateSpin Forge
Beyond Disaster Recovery: Restoring Production Workloads with PlateSpin Forge
 
Ch-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptxCh-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptx
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
 
Use Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open WorldUse Case: Apollo Group at Oracle Open World
Use Case: Apollo Group at Oracle Open World
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
NoSQL and ACID
NoSQL and ACIDNoSQL and ACID
NoSQL and ACID
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
 
Refactoring Ruby on Rails Applications
Refactoring Ruby on Rails ApplicationsRefactoring Ruby on Rails Applications
Refactoring Ruby on Rails Applications
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2
 
Session 1 introduction concurrent programming
Session 1 introduction  concurrent programmingSession 1 introduction  concurrent programming
Session 1 introduction concurrent programming
 
Detecting Occurrences of Refactoring with Heuristic Search
Detecting Occurrences of Refactoring with Heuristic SearchDetecting Occurrences of Refactoring with Heuristic Search
Detecting Occurrences of Refactoring with Heuristic Search
 
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
 
Basics of Node.js
Basics of Node.jsBasics of Node.js
Basics of Node.js
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
Pycvf
PycvfPycvf
Pycvf
 

Mehr von Computer Science Club

20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12Computer Science Club
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01Computer Science Club
 

Mehr von Computer Science Club (20)

20141223 kuznetsov distributed
20141223 kuznetsov distributed20141223 kuznetsov distributed
20141223 kuznetsov distributed
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04
 
20140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-0320140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-03
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01
 
20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 

20100522 from object_to_database_replication_pedone_lecture01-02

  • 1. From Object Replication to Database Replication Fernando Pedone University of Lugano Switzerland
  • 2. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 2
  • 3. Motivation •From object replication... ‣ Replication in the distributed systems community ‣ Processes, non-transactional objects •...to database replication ‣ Replication in the database community ‣ Transactions, databases © F. Pedone 3
  • 4. Motivation •Object replication is important ‣ Fault tolerance File server Replica #1 Client 1 File server File server Replica #2 Client 2 File server Replica #3 © F. Pedone 4
  • 5. Motivation • Database replication is important ‣ Fault tolerance Client 3 ‣ Performance Database server Replica #1 Client 4 Client 1 Database server Replica #2 Client 2 Client 5 Database server Replica #3 © F. Pedone Client 6 5
  • 6. Motivation •Object vs. database replication (recap) ‣ Different goals - Fault tolerance vs. Performance & Fault tolerance ‣ Different models - Objects, non-transactional vs. transactions ‣ Different algorithms - to a certain extent... © F. Pedone 6
  • 7. Motivation •Group communication ‣ Messages addressed to a group of nodes ‣ Initially used for object replication ‣ Later used for database replication node 1 node 2 node 3 point-to-point group © F. Pedone communication communication 7
  • 8. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 8
  • 9. Replication model •Object model ‣ Clients: c1, c2,... ‣ Servers: s1, ..., sn ‣ Operations: read and write - read(x) returns the value of object x - write(x,v) updates the value of x with v; returns acknowledgement © F. Pedone 9
  • 10. Replication model •Object consistency ‣ Consistency criteria - Defines the behavior of object operations - Simple for single-client-single-server case - But more complex in the general case • Multiple clients • Multiple servers • Failures © F. Pedone 10
  • 11. Replication model ‣ Consistency criteria - Simple case: single client, single server, no failures write(x,1) ack(x) read(x) return(1) client server x=0 x=1 © F. Pedone 11
  • 12. Replication model ‣ Consistency criteria - Multiple clients, multiple servers write(x,1) ack(x) c1 read(x) return(0) read(x) return(1) c2 s1 x=0 x=1 s2 © F. Pedone x=0 x=1 12
  • 13. Replication model ‣ Consistency criteria - Ideally, defines system behavior regardless of implementation details and the operation semantics - Is the ack(x) write(x,1) following execution intuitive to clients? c1 read(x) return(0) read(x) return(1) c2 s1 s2 © F. Pedone 13
  • 14. Replication model ‣ Consistency criteria - Two consistency criteria for objects (among several others) • Linearizability • Sequential consistency © F. Pedone 14
  • 15. Replication model •Linearizability ‣ A concurrent execution is linearizable if there is a sequential way to reorder the client operations such that: (1) it respects the semantics of the objects, as determined in their sequential specs (2) it respects the order of non-overlapping operations among all clients © F. Pedone 15
  • 16. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Problem: Does not respect © F. Pedone the semantics of the object 16
  • 17. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or non-overlapping operations 17
  • 18. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 overlapping read(x) return(0) read(x) return(1) sequential execution write(x,1) ack(x) Linearizable execution! © F. Pedone 18
  • 19. Replication model •Sequential consistency ‣ A concurrent execution is sequentially consistent if there is a sequential way to reorder the client operations such that: (1) it respects the semantics of the objects, as determined in their sequential specs (2) it respects the order of operations at the client that issued the operations © F. Pedone 19
  • 20. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Sequentially consistent! © F. Pedone 20
  • 21. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(0) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or non-overlapping operations 21
  • 22. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(0) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or operations at c1 22
  • 23. Replication model •Database model ‣ Clients: c1, c2,... ‣ Servers: s1, ..., sn ‣ Operations: read, write, commit and abort ‣ Transaction: group of read and/or write operations followed by commit or abort - ACID properties (see next) © F. Pedone 23
  • 24. Replication model •Transaction properties ‣ Atomicity: Either all transaction operations are executed or none are executed ‣ Consistency: A transaction is a correct transformation of the state ‣ Isolation: Serializability (next slide) ‣ Durability: Once a transaction commits, its changes to the state survive failures © F. Pedone 24
  • 25. Replication model •Serializability (1-copy SR) ‣ A concurrent execution is serializable if it is equivalent to a serial execution with the same transactions ‣ Two executions are (view) equivalent if: - Their transactions preserve the same “reads-from” relationships (t reads x from t’ in both executions) - They have the same final writes © F. Pedone 25
  • 26. Replication model •Serializability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Serializable! © F. Pedone 26
  • 27. Replication model •Serializability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Serializable! © F. Pedone 27
  • 28. Replication model •Serializability write(x,1) ack(x) read(x) return(1) read(x) return(0) c1 c2 sequential execution Serializable!!! © F. Pedone 28
  • 29. Replication model •Serializability write(x,1) ack(x) read(y) return(0) c1 write(y,1) ack(y) read(x) return(0) c2 sequential execution Non serializable! © F. Pedone 29
  • 30. Replication model •Object and database consistency criteria ‣ Assume single-operation transactions Objects Database Strong Linearizability Serializability equivalent Sequential Session if transactions have only one Consistency Serializability operation ? Serializability equivalent if transactions have only one operation © F. Pedone 30
  • 31. Replication model •Serializable, not session-serializable write(x,1) ack(x) read(x) return(0) c1 Transaction 1 Transaction 2 c2 sequential execution © F. Pedone 31
  • 32. Replication model •Session-serializable, not strongly serializable write(x,1) ack(x) read(x) return(1) c1 Transaction 1 read(x) return(0) Transaction 2 c2 Transaction 3 sequential execution © F. Pedone 32
  • 33. Replication model •Strongly serializable write(x,1) ack(x) read(x) return(1) c1 Transaction 1 read(x) return(1) Transaction 2 c2 Transaction 3 sequential execution © F. Pedone 33
  • 34. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 34
  • 35. From objects to databases •Fundamentals ‣ Model considerations ‣ Generic functional model •Object replication ‣ Passive replication (primary-backup) ‣ Active replication (state-machine replication) ‣ Multi-primary passive replication © F. Pedone 35
  • 36. From objects to databases •Model considerations ‣ Client and server processes ‣ Communication by message passing ‣ Crash failures only - No Byzantine failures © F. Pedone 36
  • 37. Replication model •Generic functional model Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Client Server 1 Server 2 Server 3 © F. Pedone 37
  • 38. From objects to databases •Passive replication ‣ aka Primary-backup replication ‣ Fail-stop failure model - a process follows its spec until it crashes - a crash is detected by every correct process - no process is suspected of having crashed until after it actually crashes ‣ Algorithm ensures linearizability © F. Pedone 38
  • 39. From objects to databases ‣ Passive replication algorithm Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client ServerOnly Execution Agreement Client Request Coordination Coordination Response primary executes Client requests Primary Server 1 Backups Backup 1 apply new Server 2 state Backup 2 Server 3 © F. Pedone 39
  • 40. From objects to databases •Passive replication (Linearizability) ‣ Perfect failure detection ensures that at most one primary exists at all times ‣ A failed primary is detected by the backups ‣ Eventually one backup replaces failed primary © F. Pedone 40
  • 41. From objects to databases •Active replication ‣ aka State-machine replication ‣ Crash failure model - a process follows its spec until it crashes - a crash is detected by every correct process - correct processes may be erroneously suspected ‣ Algorithm ensures linearizability © F. Pedone 41
  • 42. From objects to databases ‣ Atomic broadcast - Group communication abstraction - Primitives: broadcast(m) and deliver(m) - Properties • Agreement: Either all servers deliver m or no server delivers m • Total order: Any two servers deliver messages m and m’ in the same order © F. Pedone 42
  • 43. From objects to databases ‣ Atomic broadcast Application m broadcast(m) deliver(m) Group Client 1 Communication m’ send(m) receive(m) Client 2 Network Server 1 Architectural Server 2 View Server 3 © F. Pedone 43
  • 44. From objects to databases ‣ Active replication algorithm Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Deterministic execution Client Server 1 All servers execute all Server 2 requests Server 3 © F. Pedone 44
  • 45. From objects to databases ‣ Why it works write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) return(1) broadcast c 2 operation “write(x,1)” s1 x=0 x=1 s2 x=0 x=1 s3 © F. Pedone x=0 x=1 45
  • 46. From objects to databases •Active and passive replication ‣ Target fault tolerance only ‣ Not good for performance - Active replication: All servers execute all requests - Passive replication: Only one server executes requests © F. Pedone 46
  • 47. From objects to databases •Multi-primary passive replication ‣ Targets both fault tolerance and performance ‣ Does not require perfect failure detection ‣ Same resilience as active replication, but... ‣ Better performance - Distinguishes read from write operations - Only one server executes each request © F. Pedone 47
  • 48. From objects to databases •Multi-primary passive replication ‣ Read requests - Broadcast to all servers - Executed by only one server - Response is sent to the client © F. Pedone 48
  • 49. From objects to databases •Multi-primary passive replication ‣ Write requests - Executed by one server - State changes (“diff”) are broadcast to all servers - If the changes are “compatible” with the previous installed states, then a new state is installed, and the result of the request is returned to the client - Otherwise the request is re-executed © F. Pedone 49
  • 50. From objects to databases •Multi-primary passive replication write(x,1) ack(x) c1 read(x) return(1) c2 broadcast state change “x=1” s1 x=0 x=1 s2 x=0 x=1 s3 © F. Pedone 50 x=0 x=1
  • 51. From objects to databases •Multi-primary passive replication write(x,1) ack(x) c1 write(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=1 s2 x=0 x=2 x=1 s3 © F. Pedone 51 x=0 x=2 x=1
  • 52. From objects to databases •Multi-primary passive replication inc(x,1) ack(x) c1 inc(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=1 s2 x=0 x=2 x=1 s3 © F. Pedone 52 x=0 x=2 x=1
  • 53. From objects to databases •Multi-primary passive replication ‣ Write requests - Executed by one server - State changes (“diff”) are broadcast to all servers - If the changes are “compatible” with the previous installed states, then a new state is installed, and the result of the request is returned to the client - Otherwise the request is re-executed © F. Pedone 53
  • 54. From objects to databases •Multi-primary passive replication - Compatible states Si α Si’ • Let Si be the state at server N before it executes operation α, and Si’ the state after α • Let h=S0°...°Sj be the sequence of installed states at a server N’ when it tries to install the new state Si’ • Si’ is compatible with h if - (a) Sj = Si or - (b) α does not read any variables modified in Si+1, Si+2, ..., Sj © F. Pedone 54
  • 55. From objects to databases - Compatible states x=1 write(x,8) x=8 N y=2 z=3 y=2 z=3 Si Si’ x=1 x=8 (a) N’ y=2 y=2 z=3 z=3 Si Si+1 © F. Pedone 55
  • 56. From objects to databases - Compatible states x=1 write(x,8) x=8 N y=2 z=3 y=2 z=3 Si Si’ x=1 write(y,3) x=1 write(z,5) x=1 x=8 (b) N’ y=2 y=3 y=3 y=3 z=3 z=3 z=5 z=5 Si Si+1 Sj Sj+1 © F. Pedone 56
  • 57. From objects to databases •Multi-primary passive replication inc(x,1) ack(x) c1 inc(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=3 s2 x=0 x=2 x=3 s3 © F. Pedone 57 x=0 x=2 x=3
  • 58. From objects to databases •Multi-primary passive replication ‣ Optimistic approach ‣ May or not rely on deterministic execution - It depends on how re-executions are dealt with ‣ Installing a new state is cheaper than executing the request (that creates the state) © F. Pedone 58
  • 59. From objects to databases •Multi-primary passive replication ‣ Provides both fault tolerance and performance ‣ Suitable for database replication! © F. Pedone 59
  • 60. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 60
  • 61. Deferred update replication •General idea (I) Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Client Server 1 Server 2 Server 3 © F. Pedone 61
  • 62. Deferred update replication •General idea (II) Phase 1: Phase 3: Phase 5: Phase 1: Phase 3: Phase 4: Phase 5: Client Execution Client Client Execution Agreement Client Request Response Request Coordination Response Certification ... & Commit/ Client Abort Server 1 Server 2 Communication Server 3 Protocol © F. Pedone 62
  • 63. Deferred update replication •The protocol in detail ‣ Transactions (read-only and update) are submitted and executed by one database server allowed by serializability ‣ At commit time: - Read-only transactions are committed immediately - Update transactions are propagated to the other replicas for certification, and possibly commit © F. Pedone 63
  • 64. Deferred update replication •Read-only transactions read(X) return(1) read(Y) return(2) commit ok Client 1 Server 1 Server 2 Server 3 Client 2 read(Y) return(2) read(X) return(1) commit ok © F. Pedone 64
  • 65. Deferred update replication •Update transactions ok / abort read(X) return(1) write(Y,3) ack(Y) commit Client 1 Server 1 Server 2 Server 3 If commit, Client 2 ... writes are read(Y) return(2) write(X,3) ack(X) applied © F. Pedone 65
  • 66. Deferred update replication •Termination based on Atomic Broadcast ‣ Similar to multi-primary passive replication ‣ Deterministic certification test ‣ Lower abort rate than atomic commit based technique © F. Pedone 66
  • 67. Deferred update replication ‣ Transaction states - Executing(t,s): transitory state - Committing(t,s): transitory state - Committed(t,s): final state - Aborted(t,s): final state Committed(t,s) Executing(t,s) Committing(t,s) Aborted(t,s) © F. Pedone 77
  • 68. Deferred update replication ‣ Transaction states Certify transaction write(y,3) ack(y) commit(t) ok Client 1 Atomic Broadcast Commit(t,s1) Server 1 Executing(t,s1) Committing(t,s1) Server 2 Server 3 Transaction termination Client 2 © F. Pedone 68
  • 69. Deferred update replication ‣ Transaction states commit(t1) ok Client 1 Server 1 Server 2 Server 3 Client 2 ok commit(t2) © F. Pedone 69
  • 70. therefore, all servers reach the same outcome, committed or aborted, without Deferred update replication further communication [1, 8]. When a server s delivers t’s readset, writeset, and updates, Committing(t, s) holds. Server s certifies transactions in the order they are delivered. To certify t, server s checks whether each transaction it knows about in the committed state either precedes t or does not have a writeset that intersects t’s readset. •If Atomic conditions holds, s locally commits t. We define the condition one of these broadcast-based certification for the transition from the committing state to the committed state more formally as follows: ∀t ∀s : Committing(t, s) ; Committed(t, s) ≡   ∀t s.t. Committed(t , s) :   (1.3) t → t ∨ W S(t ) ∩ RS(t) = ∅ If transaction t passes to the committed state at s, its updates should be t’→t, t’ precedes t: applied to the database following the order imposed by the t’ does not atomic broadcast changes made by t’ update any item primitive, that is, if t is delivered before t and both pass the reads that t certification could be seen by t test, then t’s updates should be applied to the database before those of t . (i.e,. read by t) Intuitively, condition 1.3 checks whether transactions can be serialized following their delivery order. Let t and t be two transactions such that t © F. Pedone 70 is delivered and committed before t. There are two cases in which t will pass certification: (i) t committed before t started its execution, in which case any
  • 71. Deferred update replication •Atomic broadcast-based certification ‣ Example - Transaction t: read(x); write(y,-) - Transaction t’: read(y); write(x,-) - t and t’ are concurrent (i.e., neither t→t’ nor t’→t) - All servers validate t and t’ in the same order - Assume servers validate t and then t’ - What transaction(s) commit/abort? © F. Pedone 71
  • 72. Deferred update replication •Reordering-based certification ‣ Let t and t’ be two concurrent transactions - t executes read(x); write(y,-) and - t’ executes read(y); write(z,-) ‣ (a) t is delivered and certified before t’ - t passes certification, but - t’ does not: RS(t’) ∩ WS(t) = { y } ≠ ∅ ‣ (b) t’ is delivered and certified before t - t’ passes certification and - t passes certification(!): RS(t) ∩ WS(t’) = ∅ © F. Pedone 72
  • 73. Deferred update replication ‣ Serialization order of transactions - Without reordering - With reordering • Order given by abcast • Either t1 followed by t2 • Ex.: t1 followed by t2 or t2 followed by t1 t1 t2 Server 1 Server 2 Server 3 © F. Pedone 73
  • 74. Deferred update replication •Reordering-based certification serialization order Reorder List (committed transactions) new delivered transaction t1 t2 t3 t4 t Condition for committing t (simplified): RS(t) ∩ (WS(t1) ∪ WS(t2) ∪ WS(t3) ∪ WS(t4)) = ∅ © F. Pedone 74
  • 75. Deferred update replication •Reordering-based certification serialization order Reorder List (committed transactions) new delivered transaction t1 t2 t3 t4 t Condition for committing t (simplified): RS(t) ∩ (WS(t1) ∪ WS(t2)) = ∅ and WS(t) ∩ (RS(t3) ∪ RS(t4)) = ∅ © F. Pedone 75
  • 76. mitted transactions that have not been seen by transactions in execution since their relative order may change. The number of transactions in the Reorder List is limited by a predetermined threshold, the R . Whenever Deferred update replication the Reorder Factor is reached, one transaction in the Reorder List is removed and its updates are applied to the database. Let RLs = t0 ; t1 ; . . . ; tcount−1 be the Reorder List at server s containing count transactions and pos(x) be a function that returns the position of transaction x in RLs . We represent the condition for the state transition of •Reordering-based certification transaction t from the committing state to the committed state more formally as follows: ∀t∀s : Committing(t, s) ; Committed(t, s) ≡   ∃i, 0 ≤ i < count, s.t. ∀t ∈ RLs :      pos(t ) < i ⇒ t → t ∨ W S(t ) ∩ RS(t) = ∅ ∧        ∧  (1.4)   (t → t ∨ W S(t ) ∩ RS(t) = ∅)     pos(t ) ≥ i ⇒  ∧  W S(t) ∩ RS(t ) = ∅ If t passes the certification test, it is included in RLs at position i, which means that transactions in positions in [i .. count−1] are rightshifted. If more than one position satisfies the condition in equation 1.4, then to ensure that © F. Pedone servers apply transactions to the database in the same order, a deter- 76 all ministic procedure has to be used to choose the insertion position (e.g., the
  • 77. Deferred update replication •Generalized reordering Reorder List Reorder List t1 t2 t3 t4 t t1 t t2 t3 t4 Reorder List Reorder List t1 t2 t3 t t4 t t1 t2 t3 t4 Reorder List Reorder List t1 t2 t t3 t4 t4 Abort t? t t2 t1 t3 © F. Pedone 77
  • 78. Deferred update replication •Termination based on Generic Broadcast ‣ Generic broadcast - Conflict relation ∼ between messages - Properties • Agreement: Either all servers deliver m or no server delivers m • Total order: If messages m and m’ conflict, then any two servers deliver them in the same order © F. Pedone 78
  • 79. Deferred update replication ‣ Generic broadcast - Motivation • Ordering messages is more expensive than not ordering messages • Messages should only be ordered when needed (as defined by the application) - Atomic broadcast is a special case of generic broadcast where all messages must be ordered © F. Pedone 79
  • 80. Deferred update replication ‣ Conflict relation between transactions - Assume transactions t1 and t2 don’t conflict t1 t2 Server 1 t1 t2 Server 2 t1 Server 3 t2 © F. Pedone 80
  • 81. We show now that the ordering imposed by atomic broadcast is not always needed. Deferred update replication Consider the termination of two transactions t and t with non-intersecting readsets and writesets. In such a case, both t and t will pass the certification test and be committed, regardless of the order in which they are delivered. This shows that atomic broadcast is sometimes too strong, and can be favorably replaced by generic broadcast (see Chapter 3). Generic broadcast is similar to atomic broadcast, with the ‣ Conflict relation between transactions exception that applications can define order constraints on the delivery of messages: - m:t means message m where transaction t two messages are ordered only if they conflict,relays the conflict relation is defined by the application semantics. - if t conflicts with t’ (i.e., m:t ∼ m’:t’), then they are Based on the example above, one could definethe same order delivered and certified in the following conflict relation ∼ among messages, where m : t means that message m relays transaction t:   RS(t) ∩W S(t ) = 0 /  ∨     W S(t) ∩ RS(t ) = 0  m :t ∼ m :t ≡  /  (11.5)  ∨  W S(t) ∩W S(t ) = 0 / Notice that the conflict relation ∼ should account for write-write conflicts to make sure thatF.transactions that update common data items are ordered, preventing the © Pedone 81 case in which two servers end up in different states after applying such transactions
  • 82. Deferred update replication ‣ Read-only transactions - Local execution without certification is not permitted - Different states of the database could be observed x=1 write(x,1) t1 t3 y=0 t2 x=0 x=1 Server 1 y=0 y=2 t1 t2 x=0 x=1 Server 2 y=0 y=2 x=0 write(y,2) t1 x=1 Server 3 y=0 y=2 t2 t4 x=0 y=2 read(x);read(y) © F. Pedone 82
  • 83. Deferred update replication ‣ Read-only transactions - Optimistic solution • Broadcast and certify read-only transactions - Pessimistic solution • Pre-declare items to be read (readset) • Broadcast transaction before execution • Executed by one server only • Never aborted © F. Pedone 83
  • 84. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 84
  • 85. Final remarks •Object and database replication ‣ Different goals - Fault tolerance fault tolerance & performance ‣ Different consistency models - Linearizability & sequential consistency serializability ‣ Different algorithms - Primary-backup & active replication deferred update replication © F. Pedone 85
  • 86. Final remarks •Consistency models ‣ Relationship between L/SC and SR •Replication algorithms ‣ Unifying framework for object and database replication ‣ Multi-primary passive replication © F. Pedone 86
  • 87. Final remarks •Replication and group communication ‣ A happy union ‣ Group communication -- atomic broadcast -- leads to modular and efficient protocols (e.g., fewer aborts than atomic commit) ‣ Replication has motivated more powerful and efficient group communication protocols (e.g., generic and optimistic primitives) © F. Pedone 87
  • 88. References •Chapter 1: Consistency Models for Replicated Data A. Fekete and K. Ramamritham •Chapter 2: Replication Techniques for Availability R. van Renesse and R. Guerraoui •Chapter 11: From Object Replication to Database Replication F. Pedone and A. Schiper © F. Pedone 88