SlideShare a Scribd company logo
1 of 75
RohitWagle, Henrique Andrade, Kristen Hildrum,
    ChitraVenkatramani and Michael Spicer



                                   1
   Laksri Wijerathna

   HimaliErangika

   Erica Jayasundara

   HariniSirisena




                        2
    Distribute Middleware reliability and Fault
    tolerance support in System S.

   Fault-tolerance technique to implementing
    operations in a large-scale distributed system
    that ensures that all components will
    eventually have a consistent view of the
    system even in the component failure.




                                                     3
   How to develop a reliable large-scale
    distributed system?
   How to ensure that in a large-scale
    distributed system that all the components
    will have a consistent view of the system even
    in a component failure?




                                                     4
   Multiple components are employed in a large
    scale distributed system.

   Failure in any single component can have
    system-wide effects.




                                                  5
   Trigger a chain of activities across several
    tiers of distributed components.

Example:-
Online purchase can trigger
    -Web front-end Component
    -Database system Component
    -Credit card clearinghouse Component




                                                   6
   Failure in one or more component require
    that all state changes related to the current
    operation be rolled back across the
    components.

   This approach is cumbersome and may be
    impossible in cases where components do
    not have the ability to roll back




                                                    7
   Break distributed operation into a series of
    smaller operations (local operations), which is
    called as single component, which are linked
    together.

   The effect of component failure and restart in
    the middle of the multi-component operation
    is limited to that component and its
    immediate neighbors.



                                                      8
   Never roll back once the first local operation
    completed.
   If local operation fails, only that operation
    retried until it completes.
   Ensure that communication between
    components is tolerant to failure and the
    communication protocol implements a retry
    policy.




                                                     9
   Ensure that each component persists enough
    data when restarted after a failure, it
    continues pending requests where the
    predecessor left off.
   If the state of the system changes we adjust
    the operation as appropriate.
   Remote Procedure Calls (RPC) between the
    component-local operations are stored as
    work items in a queue, where queue is also
    saved as part of a local action.
                                                   10
   Comprises a middleware runtime system and
    application development framework.

   System S middleware runtime architecture
    separates the logical system view from the
    physical system view.
   Runtime contains two components
    1. Centralized components
    2. Distributed Management Components



                                                 11
12
Streams Application Manager (SAM)

   Centralized gatekeeper for logical system
    information related to the application running
    on System S.
   System entry point for job management
    tasks.




                                                     13
Streams Resource Manager (SRM)

   Centralized gatekeeper for physical system
    information related to the software and
    hardware components that make up a System
    S instance.
   Middleware bootstrapper which does the
    system initialization, upon administrator
    request.



                                                 14
Scheduler (SCH)

   Responsible for computing placement
    decision for applications to be deployed on
    the runtime systems.




                                                  15
Name Service (NS)

   Centralized component responsible for
    storing service references which enable inter-
    component communication.




                                                     16
Authentication and Authorization Service (AAS)

   Centralized component that provide user
    authentication as well as inter-component
    cross authentication.




                                                 17
Host Controller (HC)

   Component running on every application host
    and is responsible for carrying out all local
    job management tasks like starting, stopping,
    monitoring processing element on behalf of
    the request made by SAM.




                                                    18
Processing Element container (PEC)

   Hosts the application user code embedded in
    a processing element.




                                                  19
   How to achieve wide reliability in System-S.

Two fundamental building blocks required:

    1. Building Block 01

    2. Building Block 02:
Undying inter component communication
          infrastructure must be reliable

How this is achieved ?
 Ensuring that,
    ◦ Remote Procedure Call correctly carried out

    ◦ Failures  convey back to caller

   This is almost satisfied existing technologies and
    some protocols in today,
   But , System S uses  CORBA as Basic RPC
    mechanism
The data storage mechanism must be reliable.


   System S uses  IBM DB2 as the data stores.
Distributed Operation


                                 Convert


   Component Local Transaction
 (connected with Communication
 protocol)

                                 Until they succeed
   Failures can happen due to
    ◦ Component failure
    ◦ Communication failure

   Operations are retried always in the case of
    failures.

   Retries are processed… until
    1. User cancel the operation
    2. System shutdown
    3. logical errors
   Remote operations always executed.
   Failures are seen as transient in nature.
    (i.e failed component restarted quickly and prime with the
    state, they held before the failure)


   Client ability to transparently retry or back
    out from pending remote operations.
1. Devised the Reliability architecture ,to
  deployable as part of the component design
  rather than backing into a particular framework
  as CORBA .
 a challenging task
 Because
 ◦ Distributed system grow organically ,
 ◦ Different components may choose to represent to
   present remote interface with several communication
   mechanisms.
 ◦ Component writers can pick different reliability levels for
   different components
 ◦ Different infrastructure for components
2. Management of component’s internal state.
     •Information
     •<component’s static state>

                                              component
     •<Asynchronous work items >
     •(to carry out the request to external
      components)




   Information that required to be maintained by
    the component for its operation.
   Info  persisted and restored in the case of
    failure to recover back
   For every component that maintain an
    internal state  to restore after failure
   Following information must be store in the
    durable data store,
    1. The components in-core management data
       structure
    2. The serialized asynchronous processing requests
       (Work item in the component work queue).
    3. The repository of completed remote operations
       and their associated results
   Persisting a component’s in-core data
    structures need to be engineer in a way that
    that one
      as it should not tied to a particular durable storage
      solution
   The System-s use a paradigm made popular
    by Hibernate.
                          •Presents Object/relational interface for
            Top layer      wrapping traditional data structure like ,
                           associative maps ,red-black trees


                          •Used to hook up the data storage to
           Lower layer     converts entries map into database
                           records
   Persisting Asynchronous work item is
    achieved by
    ◦ Serializing the work items while maintaining their
      order of submission.
    ◦ Thus, while retrieving them from data store after a
      crash, the work items are scheduled to work in the
      same sequence


                 workitem                                           workitem

                                                         workitem
                        workitem

                                   workitem   workitem




                                              crash
   System-S require to some remote operations
    to be execute at most once.

   That means , same request made multiple
    times…

   Reliable middleware should handle them to
    ensure that they are harmless or re-issue is
    flagged and correctly dealt with.
To handle this type of situations, each of external
 operations is classified as either,

   Idempotent:-
        Multiple invocation do not change remote component’s
         internal state
        But, might be different results.

       (Eg: an operation queering the internal state of a component)
   Non-Idempotent
      An operation invocation will yield an internal state change in
       the remote component.
   Idempotent in safe retries condition  as no
    change
   Concerned much more on  Non idempotent
    operations

   For each Non idempotent operation,
    ◦ (Operation Transaction Identifier(OTID) ) field
      attached to the argument of the interface)
    ◦ This ensure operation is repeated.
X
S   OTID   SESSION   jOBdESC                                NO

A
                                                    Otid
                                                   COMPLE

M
                                                     TE

    submitJob
                                             YES
C
L                                         Retrieve/                          Process
                                          results
I                                                                            request

E                              Returned
                               results
N                                                                Save

T                                           RPC                  results

      JOB ID                               Rapos
                                 submitJob
                                                                  TID              SAM
                                                                  complete
      Output                     Reliability wrapper
      parameter
   Considering Non-Idempotent operation
    states,
   It change the initial state of component
   But does not
        Initiate the request to external components
        Does not carry out asynchronous processing to complete
         the request
        Non-idempotent code are implemented that are wrapped
         within the Database transaction
    ◦ First Consider this simple non idem potent code
      handling…
1. Begin Network Service(oTid)
2.        Non-idempotent code
3.        Log service request result(oTid,results)
4. End Network Service
1.   Begin Network Service(oTid)
2.   DB Transction Begin
3.         Non-idempotent code
4.         Log service request result(oTid,results)
5.   DB Transction End(Commit)

6.   End Network Service
1.       Begin Network Service(oTid)
2.       DB Transction Begin
3.             Non-idempotent code
4.             Log service request result(oTid,results)          Case
5.       DB Transction End(Commit)                                1
6.       4. End Network Service                           Case
                                                           2




        Case 1: if system crashes before 5
                State changes are not committed to durable storage
                Hence maintain consistent state
                Client requesting the remote operation will continue retrying
                  the request until complete
     •   Case 2 : if system crashes after 5, but no result send to the client
               • Then the framework already committed the log of the service
                  request
               • Contains only service otid and the response need to send back
                  to the client
               • Reliable protocol layer will just look at the log and reply back
                  with the original result.
   When middleware performing additional
    operations when using other components.
   Eg: Launching PEs
        Undergo validation of pre condition
        Security check
              Perform synchronously

        Dispatching PEs can be carried out    asynchronously
   System S approach is
    ◦ Processing task only after the database transaction
      under which under which the task was created to
      the to the durable repository.

                                                      repository
   System S approach is better handle the
    problems by
    ◦ Execution of a new unit of work on each thread has
      to go with reliability approach.
    ◦ But quite complicated to implement.
    ◦ Complexity can be reduced by assumption
        Work unit can be scheduled after commit from the original
         request.
        This guarantee work units are executed once.
   Interacting with each other is very important.
   Framework should handle this interaction.
   Interactions due to

    1. user initiated
                                              Component x




    2. System initiated           Component
                                     y




                                                        Component z
   System S job submission process consists of
    6 steps
   1. Accepts the job description from the user
   2.Check the permission
                                     AAS
                       query
                                            No change in the
                                            AAS local state
   3. Determine PE placement.             Check node
                                           availability
                               SSH
                   query                            SRM
                                       No state change
   4. Update the local state
    ◦ Insert job into SAM’s local tables
                                            change in the
   5. register the job with AAS           AAS local state
              (registerJOB operation)

   6. deploy PEs      change the state of the system

      But HCs do not in persistent state  on restart it does
       the state from that.
                             But ,Not a problem
   Consider registerJOB operation..(SAM  AAS)

   What happened if AAS crashes…
    ◦ appears as failed, but two possibilities,

    ◦ 1. AAS complete the JOB
        Error, if JOB is already in the system
    ◦ 2. AAS do not complete the JOB
        SAM must register the JOB,
        if JOB already in  can retry
   What happened if SAM crashes…
    ◦ may leave the distributed system in a inconsistent
      state,
    In the case of
    ◦ Job may not be existed
    ◦ AAS job might be succeeded

    ◦ On restart SAM  retry to submit operation.
    ◦ (while SAM down ,client trying to submit the
      operation).
    ◦ But problem , if re-registering the job again.
1. PEREPATATION PHASE
   1. Accepts the job description from the user
   2.Check the permission
   3. Determine PE placement
   4. Update the local state
    ◦ Insert job into SAM’s local tables
   5. Generate oTid, for AAS registerJob queue registration work
    item with that id.
   Commit current state (SAM’s internal tables and work queue) to
    the database.

   5. register the job with AAS          (registerJOB operation)


   6. deploy PEs
       But HCs do not in persistent state  on restart it does the state from that.
   
          2. REGISTER AND LAUNCH PHASE
   1. Register the job with AAS using already
        generated oTid
   2. Start a local database Transaction
   3. Deploy PEs
   4. Commit current state to the database.
   With in this approach ,
    ◦ Preparation phase
      contains no calls to change the internal state
       harmless
    ◦ Register and Launch phase
      Can repeat many times
      No problem, if SAM fails
          Since Register and Launch retries from the
           beginning
          Since same oTid for same call  no danger for
           registering twice the job
   1. Registering PE
      For failed PEs
   2. Generalizing
      For correcting the proceeding sections
   Retry Policy

  Retry Controller
I.  Bounded retries
II. Unbounded retries
   During the normal operation of the System S
    middleware,     once failures are detected, the
    recovery process is automatically kick-started.
    In System S, failure detection is the responsibility of
    the SRM component.

Failures are detected in two different ways.
 Central components are periodically contacted by
  SRM to ensure their liveliness. This is done using an
  application-level ping operation that is built into all
  the components as a part of our framework.
 Moreover, all distributed components communicate
  their liveliness to SRM via a scalable heartbeat
  mechanism.
 The   recovery process is simple and
    involves only the restartof the failed
    component or components.

    Once a failed component is restarted, its
    state is rebuilt from information in durable
    storage before it starts processing any
    new or pending operations.

 First,  the component in-core structures
    are read from storage.
 Next,    the list of completed operations is
    retrieved, followed by re-populating the
    work queue with any pending asynchronous
    operations.

   Once all the state is populated, the
    component starts accepting new external
    requests and the pending requests start
    being processed.
   Any components trying to contact the
    restarted component will be able to receive
    responses and the system will resume normal
    operation.
 Able to handle multiple component failures at the
  same time without any additional work or
  coordination.
 Failed components can be restarted in any order
  and will begin processing requests as and when
  they are restarted.
 NB: completion of a pending distributed
  operation depends on the availability of all
  components needed to service that operation
 The failure of a component after it has completed
  its part of the distributed operation does not
  affect the completion of the operation.
   Operation Completion Time
   Measure the effect of failures in three
    different mocked-up component-graph
    configurations.
   All experiments were conducted with
    System S running on up to five Linux hosts.
    Each host contains 2 Intel Xeon 3.4 GHz
    CPUs with 16GB RAM5IBM DB2
   Database as durable storage running on a
    separate dedicated host.
   Source-Relay-Sink (SRS)
   Market Data Processing (MDP)
Inspiration

    Berkeley’s Recovery Oriented Computing
    paradigm
   Bug free is impossible
   Lower MTTR (Mean Time To Recover) rather
    than increasing MTTF (Mean Time To Failure)

    Fault Tolerance in 3 Tier Applications –
    Vaysburd, 1999.
Inspiration..

    Fault Tolerance in 3 Tier Applications –
    Vaysburd, 1999.
   Client tier should tag requests
   Server tier should offload state to a database
   Database tier alone should be concerned with
    reliability.
1). Replica and consistency management
   How to physically setup replicas?
   How to switch to a different one?
   How to main consistency?

Disadvantages :
   Overhead of having replicas
   Difficulty of ensuring consistency in the
    presence of non-idempotent operations.
Replica and consistency management ..

1). FT CORBA – OMG, 1998.
 First standardization effort on fault tolerant
  middleware support.
 Handles distributed non-idempotent request
  through service replication and consistency.
Replica and consistency management ..

1). An architecture for Object Replication in
  Distributed Systems – Beedubail et all 1997.
 Hot replicas (multiple copies of a service exist
  in standby)
 Fault tolerance layer, a middleware relays
  state changes from primary replica to
  secondary ones to maintain consistency.
Replica and consistency management ..

2). Exactly once end to end semantics in
  CORBA Invocation across Heterogeneous
  Fault Tolerance ORBs – Vaysburd&Yajnik,
  1999.
 Similar to TID approach, however assumption
  is that in case of failures a replica will pick up
  the request and a multicast mechanism is
  used to notify all replicas of state changes.
Replica and consistency management ..

3). DOORS by Bell Labs – 2000.
 Uses interception to capture inter-component
  interactions.
 FT mainly supported through replication.
Replica and consistency management ..

4). Chubby (Lock Service for loosely coupled
  distributed systems – 2006) & Zookeeper
  (Wait free coordination for Internet scale
  systems -2010).
 Useful for group services (where a set of
  nodes vote to elect a master)
 Replicate servers and databases to provide
  high availability.
2).    Flexible consistency models

   Failure is dealt by relaxing ACID and allowing
    a temporary inconsistent state.
   It has been shown that many applications can
    actually work under such relaxed
    assumptions.
Flexible consistency models..
1). Cluster Based Scalable Network Services –
  Fox et all, 1997)
 BASE (Basically Available, Soft State Eventual
  Consistency) model.
 Doesn’t handle situations where non-
  idempotent requests are carried out.
Flexible consistency models..
2). Neptune – Shen et all, 2003)
 Middleware for clustering support and
  replication management of network services.
 Flexible replication consistency support.
3).    Distributed transaction support

   Allow a distributed transaction to roll back in
    case of failures.
   Done at the expense of central coordination
    and a global roll back mechanism.
   Gave a mechanism for achieving reliability
    and fault tolerance in large scale distributed
    system.
   Used in real world middleware – IBM
    Infosphere Streams.
   This approach avoids complex rollbacks and
    the overhead of maintaining active replicas of
    components.
   Can be implemented as an extension to
    existing low level distributed computing
    technologies (CORBA, DCOM)
   Support for both stateful and stateless
    components allowing the system to grow
    organically while providing different levels of
    reliability for components (global state
    consistency).
   Low MTTR.
   Can incorporate other low cost alternatives
    for ensuring durability(eg: journaling file
    systems).
   Can tolerate or recover from one or more
    concurrent failures.
   Future plan is to experiment with alternate
    durable storage mechanisms and use this
    mechanism in other distributed middleware.
   Good mechanism for implementing FT in a
    distributed system, by using middleware.
   Unlike traditional FT mechanisms, this
    approach focuses on converting a distributed
    operation into component local operations
    and implementing FT in the communication
    protocol (reliable RPC).
   Test results prove reliable FT.
   This mechanism is used in IBM’s Infoshpere
    Streams enterprise platform, which supports
    large scale distribution and can handle
    petabytes of data.

More Related Content

What's hot

9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance4020132038
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance Systemprakashjjaya
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitivesStudent
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentationskadyan1
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler DesignMarimuthu M
 
Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance SHIKHA GAUTAM
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating SystemKathirvel Ayyaswamy
 
Design and implementation of a computerized goods transportation system
Design and implementation of a computerized goods transportation systemDesign and implementation of a computerized goods transportation system
Design and implementation of a computerized goods transportation systemOvercomer Michael
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algosAkhil Sharma
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed systemsatish raj
 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computingPalani murugan
 

What's hot (20)

Distributed Operating System_1
Distributed Operating System_1Distributed Operating System_1
Distributed Operating System_1
 
4. system models
4. system models4. system models
4. system models
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler Design
 
Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating System
 
3. challenges
3. challenges3. challenges
3. challenges
 
Design and implementation of a computerized goods transportation system
Design and implementation of a computerized goods transportation systemDesign and implementation of a computerized goods transportation system
Design and implementation of a computerized goods transportation system
 
Distributed System
Distributed System Distributed System
Distributed System
 
1.intro. to distributed system
1.intro. to distributed system1.intro. to distributed system
1.intro. to distributed system
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed system
 
Operating System
Operating SystemOperating System
Operating System
 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computing
 

Similar to Distributed Middleware Reliability & Fault Tolerance Support in System S

Operating systems
Operating systemsOperating systems
Operating systemsanishgoel
 
Distributed system
Distributed systemDistributed system
Distributed systemchirag patil
 
Dynamic Analysis And Profiling Of Multi Threaded Systems
Dynamic Analysis And Profiling Of Multi Threaded SystemsDynamic Analysis And Profiling Of Multi Threaded Systems
Dynamic Analysis And Profiling Of Multi Threaded SystemsKashif Dayo
 
OPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTESOPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTESsuthi
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealingAtul Dhingra
 
Java remote control for laboratory monitoring
Java remote control for laboratory monitoringJava remote control for laboratory monitoring
Java remote control for laboratory monitoringIAEME Publication
 
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...Editor IJCATR
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Sensu Inc.
 
Pull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 TalkPull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 TalkJulian Dunn
 
Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkKnoldus Inc.
 
Understanding the characteristics of android wear os
Understanding the characteristics of android wear osUnderstanding the characteristics of android wear os
Understanding the characteristics of android wear osPratik Jain
 
Presentation of ditributed system
Presentation of ditributed systemPresentation of ditributed system
Presentation of ditributed systemgoogle
 
Embedded system software
Embedded system softwareEmbedded system software
Embedded system softwareJamia Hamdard
 
Embedded os
Embedded osEmbedded os
Embedded oschian417
 
Unit 1os processes and threads
Unit 1os processes and threadsUnit 1os processes and threads
Unit 1os processes and threadsdonny101
 
Presentation on Transaction
Presentation on TransactionPresentation on Transaction
Presentation on TransactionRahul Prajapati
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyMichelle Holley
 

Similar to Distributed Middleware Reliability & Fault Tolerance Support in System S (20)

Operating systems
Operating systemsOperating systems
Operating systems
 
Distributed system
Distributed systemDistributed system
Distributed system
 
Chapter 4 u
Chapter 4 uChapter 4 u
Chapter 4 u
 
Dynamic Analysis And Profiling Of Multi Threaded Systems
Dynamic Analysis And Profiling Of Multi Threaded SystemsDynamic Analysis And Profiling Of Multi Threaded Systems
Dynamic Analysis And Profiling Of Multi Threaded Systems
 
OPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTESOPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTES
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
DevOps_SelfHealing
DevOps_SelfHealingDevOps_SelfHealing
DevOps_SelfHealing
 
Operating System
Operating SystemOperating System
Operating System
 
Java remote control for laboratory monitoring
Java remote control for laboratory monitoringJava remote control for laboratory monitoring
Java remote control for laboratory monitoring
 
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...
CPU Performance in Data Migrating from Virtual Machine to Physical Machine in...
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...
 
Pull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 TalkPull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 Talk
 
Stateful stream processing with Apache Flink
Stateful stream processing with Apache FlinkStateful stream processing with Apache Flink
Stateful stream processing with Apache Flink
 
Understanding the characteristics of android wear os
Understanding the characteristics of android wear osUnderstanding the characteristics of android wear os
Understanding the characteristics of android wear os
 
Presentation of ditributed system
Presentation of ditributed systemPresentation of ditributed system
Presentation of ditributed system
 
Embedded system software
Embedded system softwareEmbedded system software
Embedded system software
 
Embedded os
Embedded osEmbedded os
Embedded os
 
Unit 1os processes and threads
Unit 1os processes and threadsUnit 1os processes and threads
Unit 1os processes and threads
 
Presentation on Transaction
Presentation on TransactionPresentation on Transaction
Presentation on Transaction
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
 

Distributed Middleware Reliability & Fault Tolerance Support in System S

  • 1. RohitWagle, Henrique Andrade, Kristen Hildrum, ChitraVenkatramani and Michael Spicer 1
  • 2. Laksri Wijerathna  HimaliErangika  Erica Jayasundara  HariniSirisena 2
  • 3. Distribute Middleware reliability and Fault tolerance support in System S.  Fault-tolerance technique to implementing operations in a large-scale distributed system that ensures that all components will eventually have a consistent view of the system even in the component failure. 3
  • 4. How to develop a reliable large-scale distributed system?  How to ensure that in a large-scale distributed system that all the components will have a consistent view of the system even in a component failure? 4
  • 5. Multiple components are employed in a large scale distributed system.  Failure in any single component can have system-wide effects. 5
  • 6. Trigger a chain of activities across several tiers of distributed components. Example:- Online purchase can trigger -Web front-end Component -Database system Component -Credit card clearinghouse Component 6
  • 7. Failure in one or more component require that all state changes related to the current operation be rolled back across the components.  This approach is cumbersome and may be impossible in cases where components do not have the ability to roll back 7
  • 8. Break distributed operation into a series of smaller operations (local operations), which is called as single component, which are linked together.  The effect of component failure and restart in the middle of the multi-component operation is limited to that component and its immediate neighbors. 8
  • 9. Never roll back once the first local operation completed.  If local operation fails, only that operation retried until it completes.  Ensure that communication between components is tolerant to failure and the communication protocol implements a retry policy. 9
  • 10. Ensure that each component persists enough data when restarted after a failure, it continues pending requests where the predecessor left off.  If the state of the system changes we adjust the operation as appropriate.  Remote Procedure Calls (RPC) between the component-local operations are stored as work items in a queue, where queue is also saved as part of a local action. 10
  • 11. Comprises a middleware runtime system and application development framework.  System S middleware runtime architecture separates the logical system view from the physical system view.  Runtime contains two components 1. Centralized components 2. Distributed Management Components 11
  • 12. 12
  • 13. Streams Application Manager (SAM)  Centralized gatekeeper for logical system information related to the application running on System S.  System entry point for job management tasks. 13
  • 14. Streams Resource Manager (SRM)  Centralized gatekeeper for physical system information related to the software and hardware components that make up a System S instance.  Middleware bootstrapper which does the system initialization, upon administrator request. 14
  • 15. Scheduler (SCH)  Responsible for computing placement decision for applications to be deployed on the runtime systems. 15
  • 16. Name Service (NS)  Centralized component responsible for storing service references which enable inter- component communication. 16
  • 17. Authentication and Authorization Service (AAS)  Centralized component that provide user authentication as well as inter-component cross authentication. 17
  • 18. Host Controller (HC)  Component running on every application host and is responsible for carrying out all local job management tasks like starting, stopping, monitoring processing element on behalf of the request made by SAM. 18
  • 19. Processing Element container (PEC)  Hosts the application user code embedded in a processing element. 19
  • 20. How to achieve wide reliability in System-S. Two fundamental building blocks required: 1. Building Block 01 2. Building Block 02:
  • 21. Undying inter component communication infrastructure must be reliable How this is achieved ?  Ensuring that, ◦ Remote Procedure Call correctly carried out ◦ Failures  convey back to caller  This is almost satisfied existing technologies and some protocols in today,  But , System S uses  CORBA as Basic RPC mechanism
  • 22. The data storage mechanism must be reliable.  System S uses  IBM DB2 as the data stores.
  • 23. Distributed Operation Convert Component Local Transaction (connected with Communication protocol) Until they succeed
  • 24. Failures can happen due to ◦ Component failure ◦ Communication failure  Operations are retried always in the case of failures.  Retries are processed… until 1. User cancel the operation 2. System shutdown 3. logical errors
  • 25. Remote operations always executed.  Failures are seen as transient in nature. (i.e failed component restarted quickly and prime with the state, they held before the failure)  Client ability to transparently retry or back out from pending remote operations.
  • 26. 1. Devised the Reliability architecture ,to deployable as part of the component design rather than backing into a particular framework as CORBA .  a challenging task  Because ◦ Distributed system grow organically , ◦ Different components may choose to represent to present remote interface with several communication mechanisms. ◦ Component writers can pick different reliability levels for different components ◦ Different infrastructure for components
  • 27. 2. Management of component’s internal state. •Information •<component’s static state> component •<Asynchronous work items > •(to carry out the request to external components)  Information that required to be maintained by the component for its operation.  Info  persisted and restored in the case of failure to recover back
  • 28. For every component that maintain an internal state  to restore after failure  Following information must be store in the durable data store, 1. The components in-core management data structure 2. The serialized asynchronous processing requests (Work item in the component work queue). 3. The repository of completed remote operations and their associated results
  • 29. Persisting a component’s in-core data structures need to be engineer in a way that that one  as it should not tied to a particular durable storage solution  The System-s use a paradigm made popular by Hibernate. •Presents Object/relational interface for Top layer wrapping traditional data structure like , associative maps ,red-black trees •Used to hook up the data storage to Lower layer converts entries map into database records
  • 30. Persisting Asynchronous work item is achieved by ◦ Serializing the work items while maintaining their order of submission. ◦ Thus, while retrieving them from data store after a crash, the work items are scheduled to work in the same sequence workitem workitem workitem workitem workitem workitem crash
  • 31. System-S require to some remote operations to be execute at most once.  That means , same request made multiple times…  Reliable middleware should handle them to ensure that they are harmless or re-issue is flagged and correctly dealt with.
  • 32. To handle this type of situations, each of external operations is classified as either,  Idempotent:-  Multiple invocation do not change remote component’s internal state  But, might be different results. (Eg: an operation queering the internal state of a component)  Non-Idempotent  An operation invocation will yield an internal state change in the remote component.
  • 33. Idempotent in safe retries condition  as no change  Concerned much more on  Non idempotent operations  For each Non idempotent operation, ◦ (Operation Transaction Identifier(OTID) ) field attached to the argument of the interface) ◦ This ensure operation is repeated.
  • 34. X S OTID SESSION jOBdESC NO A Otid COMPLE M TE submitJob YES C L Retrieve/ Process results I request E Returned results N Save T RPC results JOB ID Rapos submitJob TID SAM complete Output Reliability wrapper parameter
  • 35. Considering Non-Idempotent operation states,  It change the initial state of component  But does not  Initiate the request to external components  Does not carry out asynchronous processing to complete the request  Non-idempotent code are implemented that are wrapped within the Database transaction ◦ First Consider this simple non idem potent code handling…
  • 36. 1. Begin Network Service(oTid) 2. Non-idempotent code 3. Log service request result(oTid,results) 4. End Network Service
  • 37. 1. Begin Network Service(oTid) 2. DB Transction Begin 3. Non-idempotent code 4. Log service request result(oTid,results) 5. DB Transction End(Commit) 6. End Network Service
  • 38. 1. Begin Network Service(oTid) 2. DB Transction Begin 3. Non-idempotent code 4. Log service request result(oTid,results) Case 5. DB Transction End(Commit) 1 6. 4. End Network Service Case 2  Case 1: if system crashes before 5  State changes are not committed to durable storage  Hence maintain consistent state  Client requesting the remote operation will continue retrying the request until complete • Case 2 : if system crashes after 5, but no result send to the client • Then the framework already committed the log of the service request • Contains only service otid and the response need to send back to the client • Reliable protocol layer will just look at the log and reply back with the original result.
  • 39. When middleware performing additional operations when using other components.  Eg: Launching PEs  Undergo validation of pre condition  Security check Perform synchronously  Dispatching PEs can be carried out asynchronously  System S approach is ◦ Processing task only after the database transaction under which under which the task was created to the to the durable repository. repository
  • 40. System S approach is better handle the problems by ◦ Execution of a new unit of work on each thread has to go with reliability approach. ◦ But quite complicated to implement. ◦ Complexity can be reduced by assumption  Work unit can be scheduled after commit from the original request.  This guarantee work units are executed once.
  • 41. Interacting with each other is very important.  Framework should handle this interaction.  Interactions due to 1. user initiated Component x 2. System initiated Component y Component z
  • 42. System S job submission process consists of 6 steps  1. Accepts the job description from the user  2.Check the permission AAS query No change in the AAS local state  3. Determine PE placement. Check node availability SSH query SRM No state change
  • 43. 4. Update the local state ◦ Insert job into SAM’s local tables change in the  5. register the job with AAS  AAS local state (registerJOB operation)  6. deploy PEs change the state of the system  But HCs do not in persistent state  on restart it does the state from that. But ,Not a problem
  • 44. Consider registerJOB operation..(SAM  AAS)  What happened if AAS crashes… ◦ appears as failed, but two possibilities, ◦ 1. AAS complete the JOB  Error, if JOB is already in the system ◦ 2. AAS do not complete the JOB  SAM must register the JOB,  if JOB already in  can retry
  • 45. What happened if SAM crashes… ◦ may leave the distributed system in a inconsistent state, In the case of ◦ Job may not be existed ◦ AAS job might be succeeded ◦ On restart SAM  retry to submit operation. ◦ (while SAM down ,client trying to submit the operation). ◦ But problem , if re-registering the job again.
  • 46. 1. PEREPATATION PHASE  1. Accepts the job description from the user  2.Check the permission  3. Determine PE placement  4. Update the local state ◦ Insert job into SAM’s local tables  5. Generate oTid, for AAS registerJob queue registration work item with that id.  Commit current state (SAM’s internal tables and work queue) to the database.  5. register the job with AAS  (registerJOB operation)  6. deploy PEs  But HCs do not in persistent state  on restart it does the state from that.
  • 47. 2. REGISTER AND LAUNCH PHASE  1. Register the job with AAS using already generated oTid  2. Start a local database Transaction  3. Deploy PEs  4. Commit current state to the database.
  • 48. With in this approach , ◦ Preparation phase  contains no calls to change the internal state harmless ◦ Register and Launch phase  Can repeat many times  No problem, if SAM fails  Since Register and Launch retries from the beginning  Since same oTid for same call  no danger for registering twice the job
  • 49. 1. Registering PE  For failed PEs  2. Generalizing  For correcting the proceeding sections
  • 50. Retry Policy Retry Controller I. Bounded retries II. Unbounded retries
  • 51. During the normal operation of the System S middleware, once failures are detected, the recovery process is automatically kick-started.  In System S, failure detection is the responsibility of the SRM component. Failures are detected in two different ways.  Central components are periodically contacted by SRM to ensure their liveliness. This is done using an application-level ping operation that is built into all the components as a part of our framework.  Moreover, all distributed components communicate their liveliness to SRM via a scalable heartbeat mechanism.
  • 52.  The recovery process is simple and involves only the restartof the failed component or components.  Once a failed component is restarted, its state is rebuilt from information in durable storage before it starts processing any new or pending operations.  First, the component in-core structures are read from storage.
  • 53.  Next, the list of completed operations is retrieved, followed by re-populating the work queue with any pending asynchronous operations.  Once all the state is populated, the component starts accepting new external requests and the pending requests start being processed.  Any components trying to contact the restarted component will be able to receive responses and the system will resume normal operation.
  • 54.  Able to handle multiple component failures at the same time without any additional work or coordination.  Failed components can be restarted in any order and will begin processing requests as and when they are restarted.  NB: completion of a pending distributed operation depends on the availability of all components needed to service that operation  The failure of a component after it has completed its part of the distributed operation does not affect the completion of the operation.
  • 55. Operation Completion Time
  • 56. Measure the effect of failures in three different mocked-up component-graph configurations.  All experiments were conducted with System S running on up to five Linux hosts. Each host contains 2 Intel Xeon 3.4 GHz CPUs with 16GB RAM5IBM DB2  Database as durable storage running on a separate dedicated host.
  • 57.
  • 58.
  • 59. Source-Relay-Sink (SRS)  Market Data Processing (MDP)
  • 60. Inspiration Berkeley’s Recovery Oriented Computing paradigm  Bug free is impossible  Lower MTTR (Mean Time To Recover) rather than increasing MTTF (Mean Time To Failure) Fault Tolerance in 3 Tier Applications – Vaysburd, 1999.
  • 61. Inspiration.. Fault Tolerance in 3 Tier Applications – Vaysburd, 1999.  Client tier should tag requests  Server tier should offload state to a database  Database tier alone should be concerned with reliability.
  • 62. 1). Replica and consistency management  How to physically setup replicas?  How to switch to a different one?  How to main consistency? Disadvantages :  Overhead of having replicas  Difficulty of ensuring consistency in the presence of non-idempotent operations.
  • 63. Replica and consistency management .. 1). FT CORBA – OMG, 1998.  First standardization effort on fault tolerant middleware support.  Handles distributed non-idempotent request through service replication and consistency.
  • 64. Replica and consistency management .. 1). An architecture for Object Replication in Distributed Systems – Beedubail et all 1997.  Hot replicas (multiple copies of a service exist in standby)  Fault tolerance layer, a middleware relays state changes from primary replica to secondary ones to maintain consistency.
  • 65. Replica and consistency management .. 2). Exactly once end to end semantics in CORBA Invocation across Heterogeneous Fault Tolerance ORBs – Vaysburd&Yajnik, 1999.  Similar to TID approach, however assumption is that in case of failures a replica will pick up the request and a multicast mechanism is used to notify all replicas of state changes.
  • 66. Replica and consistency management .. 3). DOORS by Bell Labs – 2000.  Uses interception to capture inter-component interactions.  FT mainly supported through replication.
  • 67. Replica and consistency management .. 4). Chubby (Lock Service for loosely coupled distributed systems – 2006) & Zookeeper (Wait free coordination for Internet scale systems -2010).  Useful for group services (where a set of nodes vote to elect a master)  Replicate servers and databases to provide high availability.
  • 68. 2). Flexible consistency models  Failure is dealt by relaxing ACID and allowing a temporary inconsistent state.  It has been shown that many applications can actually work under such relaxed assumptions.
  • 69. Flexible consistency models.. 1). Cluster Based Scalable Network Services – Fox et all, 1997)  BASE (Basically Available, Soft State Eventual Consistency) model.  Doesn’t handle situations where non- idempotent requests are carried out.
  • 70. Flexible consistency models.. 2). Neptune – Shen et all, 2003)  Middleware for clustering support and replication management of network services.  Flexible replication consistency support.
  • 71. 3). Distributed transaction support  Allow a distributed transaction to roll back in case of failures.  Done at the expense of central coordination and a global roll back mechanism.
  • 72. Gave a mechanism for achieving reliability and fault tolerance in large scale distributed system.  Used in real world middleware – IBM Infosphere Streams.  This approach avoids complex rollbacks and the overhead of maintaining active replicas of components.  Can be implemented as an extension to existing low level distributed computing technologies (CORBA, DCOM)
  • 73. Support for both stateful and stateless components allowing the system to grow organically while providing different levels of reliability for components (global state consistency).  Low MTTR.  Can incorporate other low cost alternatives for ensuring durability(eg: journaling file systems).  Can tolerate or recover from one or more concurrent failures.
  • 74. Future plan is to experiment with alternate durable storage mechanisms and use this mechanism in other distributed middleware.
  • 75. Good mechanism for implementing FT in a distributed system, by using middleware.  Unlike traditional FT mechanisms, this approach focuses on converting a distributed operation into component local operations and implementing FT in the communication protocol (reliable RPC).  Test results prove reliable FT.  This mechanism is used in IBM’s Infoshpere Streams enterprise platform, which supports large scale distribution and can handle petabytes of data.

Editor's Notes

  1. 3 tier – client/server/database.Focus of this work is analyzing FT when applications makee use of commercial dbs.
  2. Failed request taken over by replica.
  3. OMG – Object Management Group.OMG has been an international, open membership, not-for-profit computer industry consortium since 1989.
  4. ORBs?
  5. DOORS?
  6. ACID – Atomicity, Consistency, Isolation, Durability
  7. ACID – Atomicity, Consistency, Isolation, Durability(eg: Internet based content providers)
  8. ACID – Atomicity, Consistency, Isolation, Durability(eg: Internet based content providers)
  9. DCOM – MicrosoftInfosphere Streams – real time Big Data Analysis platform for enterpirse.
  10. DCOM – MicrosoftA journalingfile system is a type of a log file that keeps track of the changes that will be madebefore committing them to the main file system. In the event of a system crash or power failure, they are quicker to bring back online and less likely to become corrupted.
  11. DCOM– Microsoft