SlideShare ist ein Scribd-Unternehmen logo
1 von 199
Downloaden Sie, um offline zu lesen
Rolando da Silva Martins


On the Integration of Real-Time
  and Fault-Tolerance in P2P
          Middleware




    Departamento de Ciˆncia de Computadores
                        e
  Faculdade de Ciˆncias da Universidade do Porto
                 e
                       2012
Rolando da Silva Martins


On the Integration of Real-Time
  and Fault-Tolerance in P2P
          Middleware




           Tese submetida ` Faculdade de Ciˆncias da
                          a                e
     Universidade do Porto para obten¸˜o do grau de Doutor
                                     ca
                 em Ciˆncia de Computadores
                      e




    Advisors: Prof. Fernando Silva and Prof. Lu´ Lopes
                                               ıs




        Departamento de Ciˆncia de Computadores
                            e
      Faculdade de Ciˆncias da Universidade do Porto
                     e
                       Maio de 2012
To my wife Liliana, for her endless love, support, and encouragement.




                                                                        3
–Imagination is everything. It is the preview of
                                                     Acknowledgments
life’s coming attractions.
                            Albert Einstein



To my soul-mate Liliana, for her endless support on the best and worst of times. Her
unconditional love and support helped me to overcome the most daunting adversities
and challenges.

I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and Paulo
Paix˜o, for their vision and support that allowed me to pursuit this Ph.D.
    a

I would like to thank the financial support from EFACEC, Sistemas de Engenharia,
S.A. and FCT - Funda¸˜o para a Ciˆncia e Tecnologia, with Ph.D. grant SFRH/B-
                       ca           e
DE/15644/2006.

I would especially like to thank my advisors, Professors Lu´ Lopes and Fernando Silva,
                                                             ıs
for their endless effort and teaching over the past four years. Lu´ thank you for steering
                                                                 ıs,
me when my mind entered a code frenzy, and for teaching me how to put my thoughts
to words. Fernando, your keen eye is always able to understand the “big picture”, this
was vital to detected and prevent the pitfalls of building large and complex middleware
systems. To both, I thank you for opening the door of CRACS to me. I had an incredible
time working with you.

A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.
She opened the door of CMU to me and helped to shape my work at crucial stages.
Priya, I had a fantastic time mind-storming with you, each time I managed to learn
something new and exciting. Thank you for sharing with me your insights on MEAD’s
architecture, and your knowledge on fault-tolerance and real-time.

Lu´ Fernando and Priya, I hope someday to be able to repay your generosity and
   ıs,
friendship. It is inspirational to see your passion for your work, and your continuous
effort on helping others.

I would like to thank Jiaqi Tan for taking the time to explain me the architecture and
functionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on my
work and for his insights on byzantine failures and consensus protocols.

I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-
reia, V´ Costa, and Inˆs Dutra, for listening and sharing their thoughts on my work.
       ıtor              e

A big thank you to Hugo Ribeiro, for his crucial help with the experimental setup.




                                                                                       5
–All is worthwhile if the soul is not small.
                                                                         Abstract
                            Fernando Pessoa




The development and management of large-scale information systems, such as high-
speed transportation networks, are pushing the limits of the current state-of-the-art
in middleware frameworks. These systems are not only subject to hardware failures,
but also impose stringent constraints on the software used for management and there-
fore on the underlying middleware framework. In particular, fulfilling the Quality-
of-Service (QoS) demands of services in such systems requires simultaneous run-time
support for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage that
remains a challenge for current middleware frameworks. Fault-tolerance support is
usually introduced in the form of expensive high-level services arranged in a client-server
architecture. This approach is inadequate if one wishes to support real-time tasks due
to the expensive cross-layer communication and resource consumption involved.

In this thesis we design and implement Stheno, a general purpose P2 P middleware
architecture. Stheno innovates by integrating both FT and soft-RT in the architecture,
by: (a) implementing FT support at a much lower level in the middleware on top of a
suitable network abstraction; (b) using the peer-to-peer mesh services to support FT,
and; (c) supporting real-time services through a QoS daemon that manages the under-
lying kernel-level resource reservation infrastructure (CPU time), while simultaneously
(d) providing support for multi-core computing and traffic demultiplexing. Stheno is
able to minimize resource consumption and latencies from FT mechanisms and allows
RT services to perform withing QoS limits.

Stheno has a service oriented architecture that does not limit the type of service that can
be deployed in the middleware. Whereas current middleware systems do not provide
a flexible service framework, as their architecture is normally designed to support a
specific application domain, for example, the Remote Procedure Call (RPC) service.
Stheno is able to transparently deploy a new service within the infrastructure without
the user assistance. Using the P2 P infrastructure, Stheno searches and selects a suitable
node to deploy the service with the specified level of QoS limits.

We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, such
as membership, discovery and service deployment, the impact of FT over RT, with
and without resource reservation, and compare with other closely related middleware
frameworks. Results showed that Stheno is able to sustain RT performance while
simultaneously providing FT support. The performance of the resource reservation
infrastructure enabled Stheno to maintain this behavior even under heavy load.


                                                                                         7
Acronyms


API Application Programming Interface

BFT Byzantine Fault-Tolerance

CCM CORBA Component Model

CID Cell Identifier

CORBA Common Object Request Broker Architecture

COTS Common Of The Shelf

DBMS Database Management Systems

DDS Data Distribution Service

DHT Distributed Hash Table

DOC Distributed Object Computing

DRE Distributed Real-Time and Embedded

DSMS Data Stream Management Systems

EDF Earliest Deadline First

EM/EC Execution Model/Execution Context

FT Fault-Tolerance

IDL Interface Description Language

IID Instance Identifier

IPC Inter-Process Communication

IaaS Infrastructure as a Service

J2SE Java 2 Standard Edition

JMS Java Messaging Service

JRTS Java Real-Time System

JVM Java Virtual Machine

                                                         9
JeOS Just Enough Operating System

KVM Kernel Virtual-Machine

LFU Least Frequently Used

LRU Least Recently Used

LwCCM Lightweight CORBA Component Model

MOM Message-Oriented Middleware

NSIS Next Steps in Signaling

OID Object Identifier

OMA Object Management Architecture

OS Operating Systems

PID Peer Identifier

POSIX Portable Operating System Interface

PoL Place of Launch

QoS Quality-of-Service

RGID Replication Group Identifier

RMI Remote Method Invocation

RPC Remote Procedure Call

RSVP Resource Reservation Protocol

RTSJ Real-Time Specification for Java

RT Real-Time

SAP Service Access Point

SID Service Identifier

SLA Service Level of Agreement

SSD Solid State Disk

10
TDMA Time Division Multiple Access

TSS Thread-Specific Storage

UUID Universal Unique Identifier

VM Virtual Machine

VoD Video on Demand




                                     11
Contents

Acknowledgments                                                                           5

Abstract                                                                                  7

Acronyms                                                                                  9

List of Tables                                                                           17

List of Figures                                                                          19

List of Algorithms                                                                       23

List of Listings                                                                         25

1 Introduction                                                                           27
   1.1   Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
   1.2   Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28
   1.3   Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   1.4   Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31
   1.5   Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
   1.6   Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Overview of Related Work                                                               35
   2.1   Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
   2.2   RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37
         2.2.1   Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37
         2.2.2   CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39
   2.3   P2 P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44
         2.3.1   Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

                                                                                         13
2.3.2   QoS-Aware P2 P      . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
     2.4   P2 P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46
           2.4.1   Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
           2.4.2   Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47
           2.4.3   Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
     2.5   P2 P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49
     2.6   A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49
           2.6.1   TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
           2.6.2   MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
           2.6.3   ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
     2.7   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Architecture                                                                              59
     3.1   Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61
           3.1.1   Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62
           3.1.2   Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
           3.1.3   P2 P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66
           3.1.4   Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70
           3.1.5   Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76
     3.2   Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
           3.2.1   Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
           3.2.2   Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
           3.2.3   Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
     3.3   Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81
           3.3.1   Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81
           3.3.2   Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82
           3.3.3   Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88
     3.4   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 Implementation                                                                            91
     4.1   Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
           4.1.1   Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93
           4.1.2   Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
           4.1.3   Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

14
4.1.4   Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111
  4.2   Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122
        4.2.1   Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123
        4.2.2   Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
        4.2.3   Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
  4.3   Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142
        4.3.1   Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142
        4.3.2   CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
        4.3.3   Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145
        4.3.4   An Execution Model for Multi-Core Computing . . . . . . . . . . 148
  4.4   Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155
  4.5   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5 Evaluation                                                                            157
  5.1   Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
        5.1.1   Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157
        5.1.2   Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
  5.2   Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
        5.2.1   Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159
        5.2.2   Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161
        5.2.3   Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
  5.3   Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
        5.3.1   Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163
        5.3.2   Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164
        5.3.3   Service Deployment Performance . . . . . . . . . . . . . . . . . . 165
  5.4   Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
        5.4.1   Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167
        5.4.2   Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169
  5.5   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6 Conclusions and Future Work                                                           177
  6.1   Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
  6.2   Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
  6.3   Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

                                                                                         15
References   182




16
List of Tables

4.1   Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155




                                                                                  17
List of Figures

1.1    Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1    Middleware system classes. . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
2.2    TAO’s architectural layout. .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   51
2.3    FLARe’s architectural layout.     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
2.4    MEAD’s architectural layout.      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   54

3.1    Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   61
3.2    Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   63
3.3    Stheno’s organization overview. . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   63
3.4    Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                         .   .   .   .   .   65
3.5    QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . .                                         .   .   .   .   .   66
3.6    Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                        .   .   .   .   .   67
3.7    Examples of mesh topologies. . . . . . . . . . . . . . . . . . . .                                            .   .   .   .   .   68
3.8    Querying in different topologies. . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   69
3.9    Support framework layer. . . . . . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   71
3.10   QoS daemon resource distribution layout. . . . . . . . . . . . . .                                            .   .   .   .   .   73
3.11   End-to-end network reservation. . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   75
3.12   Operating system interface. . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   77
3.13   Interactions between layers. . . . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   78
3.14   Multiple processes runtime usage. . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   79
3.15   Creating and bootstrapping of a runtime. . . . . . . . . . . . . .                                            .   .   .   .   .   81
3.16   Local service creation. . . . . . . . . . . . . . . . . . . . . . . .                                         .   .   .   .   .   83
3.17   Finding a suitable deployment site. . . . . . . . . . . . . . . . .                                           .   .   .   .   .   84
3.18   Remote service creation without fault-tolerance. . . . . . . . . .                                            .   .   .   .   .   85
3.19   Remote service creation with fault-tolerance: primary-node side.                                              .   .   .   .   .   86
3.20   Remote service creation with fault-tolerance: replica creation. .                                             .   .   .   .   .   87
3.21   Client creation and bootstrap sequence. . . . . . . . . . . . . . .                                           .   .   .   .   .   88

4.1    The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 91
4.2    The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

                                                                                                                                         19
4.3    The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    94
     4.4    The initial binding process for a new peer. . . . . . . . . . . . . . . . . .       95
     4.5    The final join process for a new peer. . . . . . . . . . . . . . . . . . . . .       96
     4.6    Overview of the cell group communications. . . . . . . . . . . . . . . . .          99
     4.7    Cell discovery and management entities. . . . . . . . . . . . . . . . . . .        103
     4.8    Failure handling for non-coordinator (left) and coordinator (right) peers.         105
     4.9    Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . .      106
     4.10   Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . .      109
     4.11   Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . .      112
     4.12   Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . .     113
     4.13   Replication group binding overview. . . . . . . . . . . . . . . . . . . . . .      114
     4.14   The addition of a new replica to the replication group. . . . . . . . . . .        115
     4.15   The control and data communication groups. . . . . . . . . . . . . . . . .         118
     4.16   Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . .     120
     4.17   Recovery process within a replication group. . . . . . . . . . . . . . . . .       122
     4.18   RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    124
     4.19   RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . .      125
     4.20   RPC service architecture without (left) and with (right) semi-active FT.           130
     4.21   RPC service with passive replication. . . . . . . . . . . . . . . . . . . . .      132
     4.22   Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . .     134
     4.23   Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . .     135
     4.24   Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . .      137
     4.25   Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . .    138
     4.26   Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . .    139
     4.27   Streaming service with fault-tolerance support. . . . . . . . . . . . . . . .      141
     4.28   Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . .     143
     4.29   Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . .        144
     4.30   Object-to-Object interactions with different partitions. . . . . . . . . . .        145
     4.31   Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    146
     4.32   End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . .        148
     4.33   RPC service using CPU partitioning on a quad-core processor. . . . . . .           148
     4.34   Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . .     149
     4.35   Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . .       150
     4.36   RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . .            153

     5.1    Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
     5.2    Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159
     5.3    Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 160

20
5.4    Network organization for the service benchmarks. . . . . . . . . . . . .          .   161
5.5    Overlay bind (left) and rebind (right) performance. . . . . . . . . . . .         .   164
5.6    Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . .        .   165
5.7    Overlay service deployment performance. . . . . . . . . . . . . . . . . .         .   166
5.8    Service rebind time (left) and latency (right). . . . . . . . . . . . . . .       .   168
5.9    Rebind time and latency results with resource reservation. . . . . . . .          .   170
5.10   Missed deadlines without (left) and with (right) resource reservation. .          .   172
5.11   Invocation latency without (left) and with (right) resource reservation.          .   174
5.12   RPC invocation latency comparing with reference middlewares (without
       fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 175




                                                                                             21
List of Algorithms


4.1    Overlay bootstrap algorithm . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .    93
4.2    Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .    97
4.3    Cell initialization . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .    97
4.4    Cell group communications: receiving-end . . . . . . . . .          .   .   .   .   .   .   .   .   100
4.5    Cell group communications: sending-end . . . . . . . . . .          .   .   .   .   .   .   .   .   102
4.6    Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   103
4.7    Cell fault handling. . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   107
4.8    Cell fault handling (continuation). . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   108
4.9    Discovery service. . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   110
4.10   Creation and joining within a replication group . . . . . .         .   .   .   .   .   .   .   .   116
4.11   Primary bootstrap within a replication group . . . . . . .          .   .   .   .   .   .   .   .   117
4.12   Fault-Tolerance resource discovery mechanism. . . . . . . .         .   .   .   .   .   .   .   .   118
4.13   Replica startup. . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   119
4.14   Replica request handling . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   119
4.15   Support for semi-active replication. . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   121
4.16   Fault detection and recovery . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   123
4.17   A RPC object implementation. . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   126
4.18   RPC service bootstrap. . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   127
4.19   RPC service implementation. . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   128
4.20   RPC client implementation. . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   129
4.21   Semi-active replication implementation. . . . . . . . . . . .       .   .   .   .   .   .   .   .   130
4.22   Service’s replication callback. . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   131
4.23   Passive Fault-Tolerance implementation. . . . . . . . . . .         .   .   .   .   .   .   .   .   133
4.24   Actuator service bootstrap. . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   135
4.25   Actuator service implementation. . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   136
4.26   Actuator client implementation. . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   136
4.27   Stream service bootstrap. . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   139
4.28   Stream service implementation. . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   140
4.29   Stream client implementation. . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   141
4.30   Joining an Execution Model. . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   151
4.31   Execution Context stack management. . . . . . . . . . . .           .   .   .   .   .   .   .   .   152
4.32   Implementation of the EM/EC pattern in the RPC service.             .   .   .   .   .   .   .   .   154




                                                                                                           23
List of Listings

3.1   Overlay plugin and runtime bootstrap. . . . . . . . . . . .        .   .   .   .   .   .   .   .    82
3.2   Transparent service creation. . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .    83
3.3   Service creation with explicit and transparent deployments.        .   .   .   .   .   .   .   .    85
3.4   Service creation with Fault-Tolerance support. . . . . . . .       .   .   .   .   .   .   .   .    87
3.5   Service client creation. . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .    89
4.1   A RPC IDL example. . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   125




                                                                                                         25
–Most of the important things in the world have




                                                                                         1
been accomplished by people who have kept
trying when there seemed to be no hope at all.
                              Dale Carnegie




                                                                         Introduction


1.1       Motivation

The development and management of large-scale information systems is pushing the
limits of the current state-of-the-art in middleware frameworks. At EFACEC1 , we have
to handle a multitude of application domains, including: information systems used
to manage public, high-speed transportation networks; automated power management
systems to handle smart grids, and; power supply systems to monitor power supply units
through embedded sensors. Such systems typically transfer large amounts of streaming
data; have erratic periods of extreme network activity; are subject to relatively common
hardware failures and for comparatively long periods, and; require low jitter and fast
response time for safety reasons, for example, vehicle coordination.

Target Systems

The main motivation for this PhD thesis was the need to address the requirements of the
public transportation solutions at EFACEC, more specifically, the light-train systems.
The deployment of one of such systems is installed in Oporto’s light-train network and
is composed of 5 lines, 70 stations and approximately 200 sensors (partially illustrated
in Figure 1.1). Each station is managed by a computational node, that we designate
as peer, that is responsible for managing all the local audio, video, display panels, and
low-level sensors such as track sensors for detecting inbound and outbound trains.

   1
    EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systems
engineering namely in public transportation and energy systems, employs around 3000 people and has
a turnover of almost 1000 million euro; it is established in more than 50 countries and exports almost
half of its production (c.f. http://www.efacec.com).


                                                                                                  27
CHAPTER 1. INTRODUCTION

The system supports three types of traffic: normal - for regular operations over the
system, such as playing an audio message in a station through an audio codec; critical
- medium priority traffic comprised of urgent events, such as an equipment malfunction
notification; alarms - high priority traffic that notifies critical events, such as low-level
sensor events. Independently of the traffic type (e.g., event, RPC operation), the system
requires that any operation must be completed within 2 seconds.

From the point of view of distributed architectures, the current deployments would be
best matched with P2 P infra-structures that are resilient and allow resources (e.g., a
sensor connected through a serial link to a peer) to be seamlessly mapped to the logical
topology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)
services. Support for both RT and FT is fundamental to meet system requirements.
Moreover, the next generation light train solutions require deployments across cities
and regions that can be overwhelmingly large. This introduces the need for a scalable
hierarchical abstraction, the cell, that is composed of several peers that cooperate to
maintain a portion of the mesh.




                       Figure 1.1: Oporto’s light-train network.




1.2     Challenges and Opportunities

The requirements from our target systems pose a significant number of challenges. The
presence of FT mechanisms, specially using space redundancy [1], it introduces the need
for the presence of multiple copies of the same resource (replicas), and these, in turn,
ultimately lead to a greater resource consumption.

FT also introduce overheads in the form of latency and this is another constraint
that is important when dealing with RT systems. When an operation is performed,
irrespectively, of whether it is real-time or not, any state change that it causes by it

28
1.2. CHALLENGES AND OPPORTUNITIES


must be propagated among the replicas through a replication algorithm that introduces
an additional source of latency. Furthermore, the recovery time, that consists in the
time that the system needs to recover from a fault, is an additional source of latency
to real-time operations. There are well known replication styles that offer different
trade-offs between state consistency and latency.

Our target systems have different traffic types with distinct deadlines requirements that
must be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernet
networking) and software (e.g., Linux). This requires that the RT mechanisms leverage
the available resources, through resource reservation, while providing different threading
strategies that allow different trade-offs between latency and throughput.

To overcome the overhead introduced by the FT mechanisms, it must be possible
to employ a replication algorithm that do not compromises the RT requirements.
Replication algorithms that offer a higher degree of consistency introduce a higher
level of latency [1, 2] that may be prohibitive for certain traffic types. On the other
hand, certain replication algorithms exhibit a lower resource consumption and latency
at the expense of a longer recovery time, that may also be prohibitive.

Considering current state-of-the-art research we see many opportunities to address
the previous challenges. One is the use of COTS operating system that allow for a
faster implementation time, thus smaller development cost, while offering the necessary
infrastructure to build a new middleware system.

P2 P networks can be used to provide a resilient infra-structure that mirrors the physical
deployments of our target systems, furthermore, different P2 P topologies offer different
trade-offs between self-healing, resource consumption and latency in end-to-end oper-
ations. Moreover, by directly implementing FT on the P2 P infra-structure we hope
to lower resource usage and latency to allow the integration of RT. By using proven
replication algorithms [1, 2] that offer well-known trade-offs regarding consistency,
resource consumption and latency, we can focus on the actual problem of integrating
real-time, fault-tolerance within a P2 P infrastructure.

On the other hand, RT support can be achieve through the implementation of different
threading strategies, resource reservation (through the Linux’s Control Groups) and by
avoiding traffic multiplexing through the use of different access points to handle different
traffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling would
provide greater RT guarantees, this goal will not be pursued due the lack of maturity
of the current EDF implementations in Linux (our reference COTS operating system).
Because we are limited to use priority based scheduling and resource reservation, we can

                                                                                       29
CHAPTER 1. INTRODUCTION


only partially support our goal of providing end-to-end guarantees, more specifically,
we enhance our RT guarantees through the use of RT scheduling policies with over-
provisioning to ensure that deadlines are met.




1.3     Problem Definition

The work presented in this thesis focuses on the integration of Real-Time (RT) and
Fault-Tolerance (FT) in a scalable general purpose middleware system. This goal
can only be achieved if the following premises are valid: (a) FT infrastructure cannot
interfere in RT behavior, independently of the replication policy; (b) the network model
must be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and aware
of the underlying infrastructure, i.e. network model, operating system and physical
environment.

Our problem definition is a direct consequence of the requirements from our target
systems, and it can be summarize with the following question: ”Can we opportunistically
leverage and integrate these proven strategies to simultaneously support soft-RT and FT
to meet the needs of our target systems even under faulty conditions?”

In this thesis we argue that a lightweight implementation of fault-tolerance mechanisms
in a middleware is fundamental for its successful integration with soft real-time support.
Our approach is novel in that it explores peer-to-peer networking as a means to imple-
ment generic, transparent, lightweight fault-tolerance support. We do this by directly
embedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage of
their scalable, decentralized and resilient nature. For example, peer-to-peer networks
readily provide the functionality required to maintain and locate redundant copies of
resources. Given their dynamic and adaptive nature, they are promising infra-structures
for developing lightweight fault-tolerant and soft real-time middleware.

Despite these a priori advantages, mainstream generic peer-to-peer middleware systems
for QoS computing are, to our knowledge, unavailable. Motivated by this state of
affairs, by the limitations of the current infra-structure for the information system we
are managing at EFACEC (based on CORBA technology) and, last but not least, by the
comparative advantages of flexible peer-to-peer network architectures, we have designed
and implemented a prototype service-oriented peer-to-peer middleware framework.

The networking layer relies on a modular infra-structure that can handle multiple peer-
to-peer overlays. The support for fault-tolerance and soft real-time features is provided

30
1.4. ASSUMPTIONS AND NON-GOALS


at this level through the implementation of efficient and resilient services for, e.g.
resource discovery, messaging and routing. The kernel of the middleware system (the
runtime) is implemented on top of these overlays and uses the above mentioned peer-to-
peer functionalities to provide developers with APIs for customization of QoS policies
for services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,
number of replicas). This approach was inspired in that of TAO [3], that allows for
distinct strategies for the execution of tasks by threads to be defined.



1.4      Assumptions and Non-Goals

The distributed model used in this thesis is based on a partial asynchronous model
computing model, as defined in [2], extended with fault-detectors.

The services and P2 P plugin implemented in this thesis only support crash failures. We
consider a crash failure [1] to be characterized as a complete shutdown of a computing
instance in the event of a failure, ceasing to interact any further with the remaining
entities of the distributed system.

The timing faults are handled differently by services and the P2 P plugin. In our service
implementations a timing fault is logged (for analysis) with no other action being
performed, whereas, in the P2 P layer we consider a timing fault as a crash failure, i.e.,
if the remote creation of a service exceeds its deadline, the peer is considered crashed.
This method is also called as process controlled crash, or crash control, as defined in
[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect of
being crashed, it does not get killed or commits suicide, instead it gets shunned, that
is, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it must
rebind using the membership service in the P2 P layer.

The fault model used was motivated by the author’s experience on several field deploy-
ments of ligth-train transportation systems, such as the Oporto, Dublin and Tenerife
Light Rail solutions [5]. Due to the use of highly redundant hardware solutions, such as
redundant power supplies and redundant 10-Gbit network ring links, network failures
tend to be short. The most common cause for downtime is related with software bugs,
that mostly results in a crashing computing node. While simultaneous failures can
happen, they are considered rare events.

We also assume that the resource-reservation mechanisms are always available.

In this thesis we do not address value faults and byzantine faults, as they are not a

                                                                                         31
CHAPTER 1. INTRODUCTION


requirement for our target systems. Furthermore, we do not provide a formal specifica-
tion and verification of the system. While this would be beneficial to assess system
correctness, we had to limit the scope of this thesis. Nevertheless, we provide an
empirical evaluation of the system.

We also do not address hard real-time because the lack of a mature support for EDF
scheduling in the Linux kernel. Furthermore, we do not provide a fully optimized
implementation, but only a proof-of-concept to validate our approach. Testing the
system in a production environment is left for future work.


1.5      Contributions

Before undertaking the task of building an entire new middleware system from scratch,
we explored current solutions, presented in Chapter 2, to see if any of them could
support the requirements from our target system. As we did not find any suitable
solution, we then assessed if it was possible to extend an available solution to meet
those requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]
within an hierarchical P2 P mesh, and concluded that the simultaneous support for real-
time, fault-tolerance and P2 P requires fine grain control of resources that is not possible
with the use of ”black-box” solutions, for example, it is impossible to have out-of-the-
box support for resource reservation in JGroups.

Given these assessments, we have designed and implemented Stheno, that to the best
of our knowledge is the first middleware system to seamlessly integrate fault-tolerance
and real-time in a peer-to-peer infrastructure. Our approach was motivated by the
lack of support of current solutions for the timing, reliability and physical deployment
characteristics of our target systems.

For that, a complete architectural design is proposed that addresses the levels of the
software stack, including kernel space, network, runtime and services, to achieve a
seamless integration. The list of contributions include: (a) a full specification of a user
Application Programming Interface (API); (b) pluggable P2 P network infrastructure
aiming to better adjust to the target application; (c) support for configurable FT on
the P2 P layer with the goal of providing lightweight FT mechanisms, that fully enable
RT behavior, and; (d) integration of resource reservation at all the levels of runtime,
enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.

Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioning
for real-time systems. In this thesis we present, Euryale, a QoS network oriented

32
1.6. THESIS OUTLINE


framework that features resource reservation with support for a broader range of sub-
systems, including CPU, memory, I/O and network bandwidth for a general purpose
operating system as Linux. At the heart of this infrastructure resides Medusa, a QoS
daemon that handles admission and management of QoS requests.

Current well-known threading strategies, such as Leader-Followers [11], Thread-per-
Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-
tency and resource usage [3, 14]. However, they do not support resource reservation,
namely, CPU partitioning. In order to suppress this limitation, this thesis provides an
additional contribution with the introduction of a novel design pattern (Chapter 4) that
is able to integrate multi-core computing with resource reservation within a configurable
framework that supports these well-known threading strategies. For example, when a
client connects to a service it can specify, through the QoS real-time parameters, for a
particular threading strategy that best meets its requirements.

We present a full implementation that covers all the previously architectural features,
including a complete overlay implementation, inspired in the P3 [15] topology, that
seamlessly integrates RT and FT.

To evaluate our implementation and justify our claims, we present a complete evalua-
tion for both mechanisms. The impact of the resource reservation mechanism is also
evaluated, as well as a comparative evaluation of RT performance against state-of-the-
art middleware systems. The experimental results show that Stheno meets and exceeds
target system requirements for end-to-end latency and fail-over latency.



1.6     Thesis Outline

The focus of this thesis is on the design, implementation and evaluation of a scalable
general purpose middleware that provides the seamless integration of RT and FT. The
remaining of this thesis is organized as follows.

Chapter 2: Overview of Related Work.
This chapter presents an overview on related middleware systems that exhibit support
for RT, FT and P2 P, the mandatory requirements from our target system. We started
by searching for an available off-the-shelf solution that could support all of these
requirements, or in its absence, identifying a current solution that could be extended in
order to avoid creating a new middleware solution from scratch.

Chapter 3: Architecture.

                                                                                      33
CHAPTER 1. INTRODUCTION


Chapter 3 describes the runtime architecture on the proposed middleware. We start
by providing a detailed insight on the architecture, covering all layers present in the
runtime. Special attention is given to the presentation of the QoS and resource reser-
vation infrastructure. This is followed by an overview of the programming model that
describes the most important interfaces present in the runtime, as well the interactions
that occur between them. The chapter ends with the description of the fundamental
runtime operations, namely: the creation of services with and without FT support,
deployment strategy, and client creation.

Chapter 4: Implementation.
Chapter 4 describes the implementation of a prototype based on the aforementioned
architecture, and is divided in four parts. In the first part, we present a complete
implementation of P2 P overlay that is inspired on the P3 [15] topology, while providing
some insight on the limitations of the current prototype. The second part of this chapter
focuses on the implementation of three types of user services, namely, Remote Procedure
Call (RPC), Actuator, and Streaming. These services are thoroughly evaluated in
Chapter 5. In the third part, we describe our support for multi-core computing, through
the presentation of a novel design pattern, the Execution Model/Context. This design
pattern is able to integrate resource reservation, especially CPU partitioning, with
different well-known (and configurable) threading strategies. The fourth and final part
of this chapter describes the most relevant parameters used in the bootstrap of the
runtime.

Chapter 5: Evaluation.
The experimental results are presented in this chapter. It starts by providing details
of physical setup used throughout the evaluation. Then it describes the parameters
used in the testbed suite, that is composed by the three services previously described in
Chapter 4. We then focus on presenting the results for the benchmarks, including the
assessment of the impact of FT on RT, and the impact of the resource reservation infra-
structure in the overall performance. The chapter ends with a comparative evaluation
against well-known middleware systems.

Chapter 6: Conclusion and Future Work.
This last chapter presents the concluding remarks. It highlights the contributions of
the proposed and implemented middleware, and provides




34
–By failing to prepare, you are preparing to fail.




                                                                                2
                          Benjamin Franklin




                                          Overview of Related Work


2.1       Overview


This chapter presents an overview of the state-of-the-art on related middleware systems.
As illustrated in Figure 2.1, we are mostly interested in systems that exhibit support
for real-time (RT), fault-tolerance (FT) and peer-to-peer (P2 P), the mandatory require-
ments from our target system. We started by searching for an available off-the-shelf
solution that could support all of these requirements, or in its absence, identify a current
solution that could be extended, and thus avoid the creation of a new middleware
solution from the ground up. For that reason, we have focused on the intersecting
domains, namely, RT+FT, RT+P2 P and FT+P2 P, since the systems contained in these
domains are closer to meet the requirements of our target system.

From an historic perspective, the origins of modern middleware systems can be traced
back to the 1980s, with the introduction of the concept of ubiquitous computing, in
which computational resources are accessible and seen as ordinary commodities such
as electricity or tapwater [2]. Furthermore, the interaction between these resources
and the users was governed by the client-server model [16] and a supporting protocol
called RPC [17]. The client-server model is still the most prevalent paradigm in current
distributed systems.

An important architecture for client-server systems was introduced with the Common
Object Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did not
address real-time or fault-tolerance. Only recently both real-time and fault-tolerance
specifications were finalized but remained mutually exclusive. This means that a
system supporting the real-time specification will not be able to support the fault-

                                                                                         35
CHAPTER 2. OVERVIEW OF RELATED WORK

                           DDS                                       Video
                                                                   Streaming


                                            RT
                                                          RT+P2P


           CORBA RT FT              RT+FT           RT+FT+P2P                  P2P


                                                          FT+P2P
                                            FT
                                                                                     Pastry

                                                               Distributed
                                                                storage
                         CORBA FT
                                                 Stheno


                         Figure 2.1: Middleware system classes.



tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-
dressed these limitations and offered systems supporting both features, namely, TAO [3]
and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared as
a Java alternative capable of providing a more flexible and easy-to-use environment.
In recent years, CORBA entered in a steady decline [20] in favor of web-oriented
platforms, such as J2EE [21], .NET [22] and SOAP [23], and P2 P systems. The
web-oriented platforms, such as the JBoss [24] application server, aim to integrate
availability with scalability, but they remain unable to support real-time. Moreover,
while partitioning offers a clean approach to improve scalability, it fails to support
large scale distributed systems [2]. Alternatively, P2 P systems focused on providing
logical organizations, i.e., meshes, that abstract the underlying physical deployment
while providing a decentralized architecture for increased resiliency. These systems
focused initially on resilient distributed storage solutions, such as Dynamo [25], but
progressively evolved to support soft real-time systems, such as video streaming [26].

More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributed
message passing infrastructure based on an asynchronous interaction model, that is
able to suppress the scaling issues present in RPC. A considerable amount of im-
plementations exist, including Tibco [28], Websphere MQ [29] and Java Messaging
Service (JMS) [30]. MOM sometimes are integrated as subsystems in the application
server infrastructures, such as JMS in J2EE and Websphere MQ in the Websphere
Application Server.

A substantial body of research has focused on the integration of real-time within

36
2.2. RT+FT MIDDLEWARE SYSTEMS


CORBA-based middleware, such as TAO [3] (that later addressed the integration of
fault-tolerance). More recently, QoS-enabled publish-subscribe middleware systems
based on the JAIN SLEE specification [31], such as Mobicents [32], and in the Data
Distribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]
and OpenSplice [35], appeared as a way to overcome the current lack of support for
real-time applications in SOA-based middleware systems.

The introduction of fault-tolerance in middleware systems also remains an active topic
of research. CORBA-based middleware systems were a fertile ground to test fault-
tolerance techniques in a general purpose platform, resulting in the creation of the
CORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-
based platforms, such as J2EE. One of the most popular deployments, JBoss, supports
scalability and availability through partitioning. Each partition is supported by a group
communication framework based on the virtual synchrony model, more specifically, the
JGroups [7] group communication framework.



2.2     RT+FT Middleware Systems

This section overviews systems that provide simultaneous support for real-time and
fault-tolerance. These systems are divided into special purposed solutions, designed for
specific application domains, and CORBA-based solutions, aimed for general purposed
computing.


2.2.1    Special Purpose RT+FT Systems

Special purpose real-time fault-tolerant systems introduced concepts and implementa-
tion strategies that are still relevant on current state-of-the-art middleware systems.

Armada

Armada [37] focused on providing middleware services and a communication infrastruc-
ture to support FT and RT semantics for distributed real-time systems. This was
pursued in two ways, which we now describe.

The first contribution was the introduction of a communication infrastructure that is
able to provide end-to-end QoS guarantees, in both unicast and multicast primitives.
This was supported by a control signaling and a QoS-sensitive data transfer (as in the
newer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)).

                                                                                      37
CHAPTER 2. OVERVIEW OF RELATED WORK


The network infrastructure used a reservation mechanism based on EDF scheduling
policy that was built on top of the Mach OS priority based scheduling. The initial
implementation was done in the user-level but subsequently migrated to the kernel
level with the goal of reducing latency.

Much of the architectural decisions regarding RT support were based on the available
operating system at the time, mainly Mach OS. Despite the advantages of a micro-
kernel approach, its application remains restricted by the underlying cost associated
with message passing and context switching. Instead, a large body of research has been
made on monolithic kernels, specially in Linux OS, that are able to offer the advantages
of the micro-kernel approach, through the introduction of kernel modules, and the speed
of monolithic kernels.

The second contribution came in the form of a group communication infrastructure
based on a ring topology that ensured the delivery of messages in a reliable and total
order fashion within a bounded time. It also had support for membership management
that offered consistent views of the group through the detection of process and commu-
nication failures. These group communication mechanisms enabled the support for FT
through the use of a passive replication scheme, that allowed for some inconsistencies
between the primary and the replicas, where the states of the replicas could lag behind
the state of the primary, up to a bounded time window.

Mars

Mars [38] provided support for the analysis and deployment of synchronous hard real-
time systems through a static off-line scheduler for CPU and Time Division Multiple
Access (TDMA) bus. Mars is able to offer FT support through the use of active
redundancy on the TDMA bus, i.e. sending multiple copies of the same message, and
self-checking mechanisms. Deterministic communications are achieved though the use
of a time-triggered protocol.

The project focused on the RT process control, where all the intervening entities are
known in advance. So it does not offer any type of support for dynamical admission of
new components, neither it supports on-the-fly fault-recovery.

ROAFTS

ROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-
tributed RT applications, consisting in a network of Time-triggered Message-triggered
Objects [41] (TMO’s), whose execution is managed by a TMO support manager. The
FT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic

38
2.2. RT+FT MIDDLEWARE SYSTEMS


fault server ; (b) and a network surveillance [42] manager. Fault-detection is assured
by the network surveillance TMO, and used by the generic fault-server to change the
FT policy with the goal of preserving RT semantics. The system assumes that RT
can live with lesser reliability assurances from the middleware, under highly dynamic
environments.

Maruti

Maruti [43] aimed to provide a development framework and an infrastructure for the
deployment of hard real-time applications within a reactive environment, focusing on
real-time requirements on a single-processor system. The reactive model is able to
offer runtime decisions on the admission of new processing requests without producing
adverse effects on the scheduling of existing requests. Fault-tolerance is achieved by
redundant computation. A configuration language allows the deployment of replicating
modules and services.

Delta-4

Delta-4 [44] provided an in-depth characterization of fault assumptions, for both the
host and the network. It also demonstrated various techniques for handling them,
namely, passive and active replication for fail-silent hosts and byzantine agreement for
fail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-
tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 framework
through the introduction of the Leader/Follower replication model (better known as
semi-active replication) for fail-silent hosts. This work also lead to the extension to the
communication system to support additional communication primitives (the original
work on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, and
AtLeastTo.



2.2.2     CORBA-based RT+FT Systems

The support for RT and FT in general purpose distributed platforms remains mostly
restricted to CORBA. While some support was carried out by Sun to introduce RT sup-
port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,
47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-
tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-Time
VM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attempted
to provide support for RT in a J2EE environment. Nevertheless, this support seems to
be confined to the introduction of a deterministic garbage collector, through the use of

                                                                                        39
CHAPTER 2. OVERVIEW OF RELATED WORK


the RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbage
collection [51].

Previous work on integration of RT and FT in CORBA context systems can be catego-
rized into three distinct approaches: (a) integration, where the base ORB is modified;
(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),
and; (c) interception, systems that perform interception on client request to provide
transparent FT and RT.

Integration Approach

Past work on the integration of fault-tolerance in CORBA-like systems was done in
Electra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors of
the CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-
ment Architecture (OMA) to support transparent and non-transparent fault-tolerance
capabilities. Instead of using message queues or transaction monitors [56], it relied on
object-communication groups [57, 58]. Maestro [53] is a distribute layer built on top of
the Ensemble [59] group communication, that was used by Electra [52] in the Quality
of Service for CORBA Objects (QuO) project [60]. Its main focus was to provide an
efficient, extensible and non disruptive integration of the object layers with the low-
level QoS system properties. The AQuA [54] system uses both QuO and Maestro on
top of the Ensemble communication groups, to provide a flexible and modular approach
that is able to adapt to faults and changes in the application requirements. Within its
framework a QuO runtime accepts availability requests by the application and relays
them to a dependability manager, that is responsible to leverage the requests from
multiple QuO runtimes.

TAO+QuO

The work done in [61] focused on the integration of QoS mechanisms, for both CPU and
network resources while supporting both priority- and reservation-based QoS semantics,
with standard COTS Distributed Real-Time and Embedded (DRE) middleware, more
precisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. The
priority-based approach was built on top of the RT-CORBA specification, and it defined
a set of standard features in order to provide end-to-end predictability for operations
within a fixed priority context [62]. The CPU priority-based resource management is
left to the scheduling of the underlying Operating Systems (OS), whereas the network
priority-based management is achieved through the use of the DiffServ architecture [63],
by setting the DSCP codepoint on the IP header of the GIOP requests. Based on
various factors, the QuO runtime can dynamically change this priority to adjust to

40
2.2. RT+FT MIDDLEWARE SYSTEMS


environment changes. Alternatively, the network reservation-based approach relies on
the RSVP [64] signaling protocol to guarantee the desired network bandwidth between
hosts. The QuO runtime monitors the RSVP connections and makes adjustments to
overcome abnormal conditions. For example, in a video service it can drop frames to
maintain stability. The cpu-reservation is made using reservation mechanisms present
in the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservations
policies. This was done to preserve the end-to-end QoS semantics that is only available
at a higher level of the middleware.

CIAO+QuO

CIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built on
top of TAO [3] that aims to alleviate the complexity of integrating real-time features on
DRE using Distributed Object Computing (DOC) middleware. These DOC systems,
of which TAO is an example, offer configurable policies and mechanisms for QoS,
namely real-time, but lack a programming model that is capable of separating systemic
aspects from applicational logic. Furthermore, QoS provisioning must be done in an
end-to-end fashion, thus having to be applied to several interacting components. It
is difficult, or nearly impossible, to properly configure a component without taking
into account the QoS semantics for interacting entities. Developers using standard
DOC middleware systems are susceptible to produce misconfigurations that cause an
overall system misbehavior. CIAO overcomes these limitations by applying a wide
range of aspect-oriented development techniques that support the composition of real-
time semantics without intertwining configurations concerns. The support for CIAO’s
CCM architecture was done in CORFU [66] and is described below.

Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].
The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-
sioning to a total provisioning middleware that is also able to accommodate dynamical
and adaptive QoS provisioning. For example, the setup of a RSVP [64] connection
would require the explicit configuration from the developer, defeating the purpose of
CIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], it
does not provide a solution for component cross-cutting.

DynamicTAO

DynamicTAO [69] focused on providing a reflective model middleware that extends
TAO to support on-the-fly dynamic reconfiguration of its component behavior and
resource management through meta-interfaces. It allows the application to inspect
the internal state/configuration and, if necessary, to reconfigure it in order to adapt

                                                                                      41
CHAPTER 2. OVERVIEW OF RELATED WORK


to environment changes. Subsequently, it is possible to select networking protocols,
encoding and security policies to improve the overall system performance in the presence
of unexpected events.

Service-based Approach

An alternative, high-level service approach for CORBA fault-tolerance was taken by
Distributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service
(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providing
replica management, fault-detection and fault-recovery as a CORBA high-level service.
It did group communication and it mainly focused on passive replication, but allowed
the developer to select the desired level of reliability (number of replicas), replication
policy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recovery
strategy. OGS improved over prior approaches by using a group communication protocol
that imposes consensus semantics. Instead of adopting an integrated approach, group
communication services are transparent to the ORB, by providing a request level
bridging. Newtop followed a similar approach to OGS but augmented the support
for network partition, allowing the newly formed sub-groups to continue to operate.

TAO

TAO [3] is a CORBA middleware with support for RT and FT middleware, that is
compliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. The
support for RT includes priority propagation, explicit binding, and RT thread pools.
The FT is supported through the of a high level service, the Replication Manager, that
sits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,
acting as a rendezvous for all the remaining components, more precisely, monitors that
watch the status of the replicas, replica factories that allow the creation of new replicas,
and fault notifiers that inform the manager of failed replicas. TAO’s architecture is
further detailed in Section 2.6 of Chapter 3.

FLARe and CORFU

FLARe [74] focus on proactively adapting the replication group to underlying changes
on resource availability. To minimize resource usage, it only supports passive replica-
tion [75]. Its implementation is based on TAO [3]. It adds three new components to
the existing architecture: (a) Replication Manager high level service that decides on the
strategy to be employed to address the changes on resource availability and faults; (b)
a client interceptor that redirects invocations to the active primary; (c) a redirection
agent that receives updates from the Replication Manager and is used by the interceptor,


42
2.2. RT+FT MIDDLEWARE SYSTEMS


and; (d) a resource monitor that watches the load on nodes and periodically notifies the
Replication Manager. In the presence of faulty conditions, such as overload of a node,
the Replication Manager adapts the replication group to the changing conditions, by
activating replicas on nodes that have a lower resource usage, and additionally, change
the location of the primary node to a better suitable placement.

CORFU [66] extends FLARe to support real-time and fault-tolerance for the Lightweight
CORBA Component Model (LwCCM) [76] standard for DRE systems. It provides
fail-stop behavior, that is, when one component on a failover unit fails, then all the
remaining components are stopped, allowing for a clean switch to a new unit. This is
achieved through a fault mapping facility that allows the correspondence of the object
failure into the respective plan(s), with the subsequent component shutdown.

DeCoRAM

The DeCoRAM system [77] aims to provide RT and FT properties through a resource-
aware configuration, executed using a deployment infrastructure. The class of supported
systems is confined to closed DRE, where the number of tasks and their respective
execution and resource requirements are known a priori and remain invariant thought
the system’s life-cycle. As the tasks and resources are static, it is possible to optimize the
allocation of the replicas on available nodes. The allocation algorithm is configurable
allowing for a user to choose the best approach to a particular application domain.
DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-
Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-
mization problem, while satisfying both RT and FT system constraints. Because of the
limited resources normally available on DRE systems, DeCoRAM only supports passive
replication [75], thus avoiding the high overhead associated with active replication [78].
The allocation algorithm calculates the components inter-dependencies and deploys the
execution plan using the underlying middleware infrastructure, which is provided by
FLARe [74].

Interception-based Approach

The work done in Eternal [79, 80] focused on providing transparent fault-tolerance for
CORBA ensuring strong replica consistency through the use of reliable totally-ordered
multicast protocol. This approach alleviated the developer from having to deal with low-
level mechanisms for supporting fault-tolerance. In order to maintain compatibility with
the CORBA-FT standard, Eternal exposes the replication manager, fault detector, and
fault notifier to developers. However, the main infrastructure components are located
below the ORB for both efficiency and transparency purposes. These components

                                                                                           43
CHAPTER 2. OVERVIEW OF RELATED WORK


include logging-recovery mechanisms, replication mechanisms, and interceptors. The
replication mechanisms provide support for warm and cold passive replication and active
replication. The interceptor captures the CORBA IIOP requests and replies (based on
TCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recovery
mechanisms are responsible for managing the logging, checkpointing, and performing
the recovery protocols.

MEAD

MEAD focuses on providing fault-tolerance support in a non intrusive way by en-
hancing distributed RT systems with (a) a transparent, although tunable FT, that
is (b) proactively dependable through (c) resource awareness, that has (d) scalable and
fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,
as proof-of-concept. The paper makes an important contribution by leveraging fault-
tolerance resource consumption for providing RT behavior. MEAD is detailed further
in Section 2.6 of Chapter 3.



2.3     P2P+RT Middleware Systems

While most of the focus on P2 P systems has been on the support of FT, there is a
growing interested in using these systems for RT applications, namely, in streaming
and QoS support. This section provides an overview on P2 P systems that support RT.


2.3.1     Streaming

Streaming and specially Video on Demand (VoD), were a natural evolution of the first
file sharing P2 P systems [81, 82]. With the steady increase of network bandwidth on the
Internet, it is now possible to have high-quality multimedia streaming solutions to the
end-user. These focus on providing near soft real-time performance resorting to streams
split through the use of distributed P2 P storage and redundant network channels.

PPTV

The work done in [26] provides the background for the analysis, design and behavior
of VoD systems, focusing on the PPTV system [83]. An overview of the different
replication strategies and their respective trade-offs is presented, namely, Least Recently
Used (LRU) and Least Frequently Used (LFU). The later uses a weighted estimation
based on the local cache completion and by the availability to demand ratio (ATD).

44
2.3. P2 P+RT MIDDLEWARE SYSTEMS


Each stream is divided into chunks. The size of these chunks have a direct influence on
the efficiency of the streaming, with smaller size pieces facilitating replication and thus
overall system load-balancing, whereas bigger pieces decrease the resource overhead
associated with piece management and bandwidth consumption due to less protocol
control. To allow for a more efficient piece selection three algorithms are proposed:
sequential, rarest first and anchor-based. To ensure real-time behavior the system is
able to offer different levels of aggressiveness, including: simultaneous requests of the
same type to neighboring peers; simultaneous sending different content requests to
multiple peers, and; requesting to a single peer (making a more conservative use of
resources).

Thicket

Efficient data dissemination over unstructured P2 P was addressed by Thicket [84].
The work used multiple trees to ensure efficient usage of resources while providing
redundancy in the presence of node failure. In order to improve load-balancing across
the nodes, the protocol tries to minimize the existence of nodes that act as interior
nodes on several of trees, thus reducing the load produced from forwarding messages.
The protocol also defines a reconfiguration algorithm for leveraging load-balance across
neighbor nodes and a tree repair procedure to handle tree partitions. Results show
that the protocol is able to quickly recover from a large number of simultaneous node
failures and leverage the load across existing nodes.



2.3.2     QoS-Aware P2 P

Until recently, P2 P systems have been focused on providing resiliency and throughput,
and thus, not addressing the increasing need for QoS on latency-sensitive applications,
such as VoD.

QRON

QRON [85] aimed to provide a general unified framework in contrast to application-
specific overlays. The overlays brokers (OBs), present at each autonomous system in
the Internet, support QoS routing for overlay applications through resource negotiation
and allocation, and topology discovery. The main goal of QRON is to find a path that
satisfies the QoS requirements, while balancing the overlay traffic across the OBs and
overlay links. For this it proposes two distinct algorithms, a “modified shortest distance
path” (MSDP) and “proportional bandwidth shortest path (PBSP).

                                                                                      45
CHAPTER 2. OVERVIEW OF RELATED WORK


GlueQoS

GlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoS
features from two communicating processes. It provides a declarative language that
allows the specification of the feature QoS set (and possible conflicts) and a runtime
negotiation mechanism that finds a set of valid QoS features that is valid in the both
ends of the interacting components. Contrary to aspect-oriented programming [65], that
only enforces QoS semantics at deployment time, GlueQoS offers a runtime solution that
remains valid throughout the duration of the session between a client and a server.



2.4      P2P+FT Middleware Systems

The research on P2 P systems has been largely dominated by the pursuit for fault-
tolerance, such as in distributed storage, mainly due to the resilient and decentralized
nature of P2 P infrastructures.


2.4.1     Publish-subscribe

P2 P publish-subscribe systems are a set of P2 P systems that implement a message
pattern where the publishers (senders) do not have a predefined set of subscribers
(receivers) to their messages. Instead, the subscribers must first register their interests
with the target publisher, before starting to receive published messages. This decou-
pling between publishers and subscribers allows for a better scalability, and ultimately,
performance.

Scribe

Scribe [87] aimed to provided a large scale event notification infrastructure, built on
top of Pastry [88], for topic-based publish-subscribe applications. Pastry is used to
support topics and subscriptions and build multicast trees. Fault-Tolerance is provided
by the self-organizing capabilities of Pastry, through the adaptation to network failures
and subsequent multicast tree repair. The event dissemination performed is best-
effort oriented and without any delivery order guarantees. Nevertheless, it is possible
to enhance Scribe to support consistent ordering thought the implementation of a
sequential time stamping at the root of the topic. To ensure strong consistency and
tolerate topic root node failures, an implementation of a consensus algorithm such as
Paxos [89] is needed across the set of replicas (of the topic root).

46
2.4. P2 P+FT MIDDLEWARE SYSTEMS


Hermes

Hermes [90] focused on providing a distributed event-based middleware with an underly-
ing P2 P overlay for scalability and reliability. Inspired by work done in Distributed Hash
Table (DHT) overlay routing [88, 91], it also has some notions of rendezvous similar
to [81]. It bridges the gap between programming language type semantics and low-level
event primitives, by introducing the concepts of event-type and event-attributes that
have some common ground with Interface Description Language (IDL) within the RPC
context. In order to improve performance, it is possible in the subscription process to
attach a filter expression to the event attributes. Several algorithms are proposed for
improving availability, but they all provide weak consistency properties.



2.4.2     Resource Computing

There is a growing interest on harvesting and managing the spare computing power
from the increasing number of networked devices, both public and private, as reported
in [92, 93, 94, 95]. Some relevant examples are:

BOINC

BOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-
tate the harvesting of public resource computing by the scientific research community.
BOINC implements a redundant computing mechanism to prevent malicious or erro-
neous computational results. Each project specifies the number of results that should be
created for each “workunit”, i.e. the basic unit of computation to be performed. When
some number of the results are available, an application specific function is called to
evaluate the results and possibly choosing a canonical result. If no consensus is achieved,
or if simply the results fail, a new set o results are computed. This process repeats until
a successful consensus is achieved or an application defined timeout occurs.

P2 P-MapReduce

Developed at Google, MapReduce [97] is a programming model that is able parallelize
the processing of large data sets in a distributed environment. It follows a master-slave
model, where a master distributes the data set across a set of slaves, returning at end
the computational results (from the map or reduce tasks). MapReduce provides fault-
tolerance for slave nodes by reassigning the failed job to an alternative active slave,
but lacks support for master failures. P2 P-MapReduce [98] provides fault-tolerance by
resorting to two distinct P2 P overlays, one containing the current available masters in

                                                                                        47
CHAPTER 2. OVERVIEW OF RELATED WORK


the system, and the other with the active slaves. When an user submits a MapReduce
job, it queries the master overlay for a list of the available masters (ordered by their
workload). It then selects a master node and the number of replicas. After this, the
master node notifies its replicas that they will participate on the current job. A master
node is responsible for periodically synchronizing the state of the job over its replica set.
In case of failure, a distributed procedure is executed to elect the new master across
the active replicas. Finally, the master selects the set of slaves using a performance
metric based on workload and CPU performance from the slave overlay and starts the
computation.



2.4.3     Storage

Storage systems were one of the most prevalent applications on first generation P2 P sys-
tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,
they have now become the choice for large-scale storage systems in both industry and
academia.

openDHT

Work done in [99] aimed to provide a lightweight framework for P2 P storage using
DHTs (such in [88, 91]) in a public environment. The key challenge was to handle
mutually untrusting clients, while guarantying fairness in the access and allocation of
storage. The work was able to provide a fair access to the underlying storage capacity,
while taking the assumption that storage capacity is free. Because of its intrinsic fair
approach, the system is unable to provide any type of Service Level of Agreement (SLA)
to the clients, so reducing the domain of applications that can use it.

Dynamo

Recent research on data storage [25] and distribution at Amazon, focus on key-value
approaches using P2 P overlays, more precisely DHT, to overcome the well explored
limitation of simultaneous providing high availability and strong consistency (through
synchronous replication) [100, 101]. The approach taken was to use an optimistic
replication scheme that relied on asynchronous replica synchronization (also known
as passive replication). The consistency conflicts between different replicas, that are
caused by network and server failures, are resolved in ’read time’, as opposed to the
more traditional ’write time’ strategy, with this being done to maximize the write
availability in the system. Such conflicts are resolved by the services, allowing for a
more efficient resolution (although the system offers a default ’last value holds’ strategy

48
2.5. P2 P+RT+FT MIDDLEWARE SYSTEMS


to the services). Dynamo offers efficient key-value storage, while maximizing write
operations availability. Nevertheless, the ring based overlay hampers the scalability of
the system, and depending on the partitioning strategy used, the membership process
does not seem efficient.




2.5      P2P+RT+FT Middleware Systems

These types of systems offer a natural evolution over previous FT-RT middleware
systems. They aim to provide scalability and resilience through a P2 P network infra-
structure that is able to provide lightweight FT mechanisms, allowing them to support
soft RT semantics. We first proposed an architecture [102, 103] for a general purpose
middleware that aimed to integrate FT into the P2 P network layer, while being able
to provide RT support. The first implementation, in Java, of the architecture was done
in DAEM [6, 104]. This work used an hierarchical tree P2 P based on P3 [15]. The FT
support was performed in all levels of the tree, resulting in a high availability rate but
the use of JGroups [7] for maintaining strong consistency, both for mesh and service
data, resulted in high overhead. Due to its highly coupled tree architecture, faults had a
major impact on availability when they occurred near the root node, as they produced
a cascade failure. Initial support for RT was provided, but the high overhead of the
replication infrastructure limited its applicability.




2.6      A Closer Look at TAO, MEAD and ICE

This section provides a closer look at middleware systems that have provided us with
several strategies and insights that we used to design and implement Stheno, our
middleware solution that is able to support RT, FT and P2 P.

All the referred systems share a service oriented architecture with a client-server network
model, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEAD
support the RT-CORBA standard, while ICE only supports best-effort invocations. As
for FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,
that combines both low and high-level services.

                                                                                        49
CHAPTER 2. OVERVIEW OF RELATED WORK


2.6.1    TAO


TAO is a classical RPC middleware and therefore only supports the client-server network
model. Name resolution is provided by a high-level service, representing a clear point-
of-failure and a bottleneck.

RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-
tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT thread
pools.

The priority propagation ensures that a request maintains its priority across a chain of
invocations. A client issues a request to an Object A, that in turn, issues an invocation
to other Object B. The request priority at Object A is then used to make the invocation
at Object B. There are two types of propagation: a server declared priorities, and
client propagated priorities. In the first type, a server dictates the priority that will
be used when processing an incoming invocation. In the other type, the priority of
the invocation is encoded within the request, so the server processes the request at the
priority specified by the client.

A source of unbound priority inversion is caused by the use of multiplexed communica-
tion channels. To overcome this, the RT CORBA specification defines that the network
channels should be pre-established, avoiding the latency caused by their creation. This
model allows two possible policies: (a) private connection between the client and the
server, or; (b) priority banded connection that can be shared but limits the priority of
the requests that can be made on it.

In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], with
the support of a reactor (an object that handles network event de-multiplexing), and is
normally associated with an acceptor (an entity that handles the incoming connections),
a connection cache, and a memory pool. In classic CORBA a high priority thread can
be delayed by a low priority one, leading to priority inversion. So in an effort to avoid
this unwanted side-effect, the RT-CORBA specification defines the concept of thread
pool lanes.

All the threads belonging to a thread pool lane have the same priority, and so, only
process invocation that have the same priority (or a band that contains that priority).
Because each lane has it own acceptor, memory pool and reactor, the risk of priority
inversion is greatly minimized at the expense of greater resource usage overhead.

FT Support. In a effort to combine RT and FT semantics, the replication style

50
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE


proposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids the
latency associated with both warm and cold passive replication [105] and the high
overhead and non-determinism of active replication, but represents an extension to the
FT specification.




              Figure 2.2: TAO’s architectural layout (adapted from [3]).



Figure 2.2 shows the architectural overview of TAO. The support for FT is achieved
through the use of a set of high-level services built on top of TAO. These services include
a Fault Notifier, a Fault Detector and a Replication Manager.

The Replication Manager is the central component of the FT infrastructure. It acts
as central rendezvous to the remaining FT components, and it has the responsibilities
of managing the replication groups life-cycle (creation/destruction) and perform group
maintenance, that is the election of a new primary, removal of faulty replicas, and
updating group information.

It is composed by three sub-components: (a) a Group Manager, that manages the group
membership operations (adds and removes elements), allows the change of the primary
of a given group (for passive replication only), and allows manipulation and retrieval
of group member localization; (b) a Property Manager, that allows the manipulation of

                                                                                        51
CHAPTER 2. OVERVIEW OF RELATED WORK


replication properties, like replication style; and (c) a Generic Factory, the entry point
for creating and destroying objects.

The Fault Detector is the most basic component of the FT infrastructure. Its role is to
monitor components, processes and processing nodes and report eventual failures to the
Fault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwards
them to the Replication Manager.

The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;
(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that
(d) finds the Replication Manager and registers itself with it. As a response, e) the
Replication Manager connects as a consumer to the Fault Notifier. (f) For each node
that is going to participate, starts a Fault Detector Factory and a Replica Factory, that
in turn register themselves in the Replication Manager. (g) A group creation request is
made to the Replication Manager (by an foreign entity, that is referred as Object Group
Creator ), followed by the request of a list to the available Fault Detector Factories and
a Replica Factories; (h) this is followed by a request to create an object group in the
Generic Factory. (i) The Object Group Creator then bootstraps the desired number
of replicas using the Replica Factory at each target node, and in turn, each Replica
Factory creates the actual replica, and at the same time, it starts a Fault Detector
at each site using the Fault Detector Factory. Each one of these detectors, finds the
Replication Manager and retrieves the reference to the Fault Notifier and connects to
it as a supplier. (j) Each replica is added to the object group by the Object Group
Creator by using the Group Manager at the Replication Manager. (k) At this point, a
client is started and retrieves the object reference from the naming service, and makes
an invocation to that group. This is then carried out by the primary of the replication
group.

Proactive FT Support. An alternative approach has been proposed by FLARe [74],
that focus on proactively adapting the replication group to the load present in the
system. The replication style is limited to semi-active replication using state-transfer,
that is commonly referred solely as passive replication .

Figure 2.3 shows the architectural overview of FLARe. This new architecture presents
three new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-
rects the invocations to the proper server, as the initial reference could have been
changed by the proactive strategy, in response to a load change; (b) a redirection agent
that receives the updates with these changes from the Replication Manager; and (c)
a resource monitor that monitors the load on a processing node and sends periodical


52
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE




            Figure 2.3: FLARe’s architectural layout (adapted from [74]).



updates to the Replication Manager.

In the presence of abnormal load fluctuations the Replication Manager changes the
replication group to adapt to these new conditions, by creating replicas on lower usage
nodes and, if required, by changing the primary to a better suitable replica.

TAO’s fault tolerance support relies on a centralized infrastructure, with its main
component, the Replication Manager, representing a major obstacle in the system’s
scalability and resiliency. No mechanisms are provided to replicate this entity.



2.6.2    MEAD

MEAD focused on providing fault-tolerance support in a non intrusive way for enhancing
distributed RT systems by providing a transparent, although tunable FT, that is
proactively dependable through resource awareness, that has scalable and fast fault-
detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-
concept.

Transparent Proactive FT Support. MEAD’s architecture contains three major
components, namely, the Proactive FT Manager, the Mead Recovery Manager and the

                                                                                    53
CHAPTER 2. OVERVIEW OF RELATED WORK


Mead Interceptor. The underlying communication is provided by Spread, an group
communication framework that offers reliable total ordered multicast, for guaranteeing
consistency for both component and node membership.

The Mead Interceptor provides the usual interception of system calls between the
application and underlying operating system. This approach allows a transparent and
non-intrusive way to enhance the middleware with fault-tolerance.




            Figure 2.4: MEAD’s architectural layout (adapted from [14]).


Figure 2.4 shows the architectural overview of MEAD. The main component of the
MEAD system is the Proactive FT Manager, and is embedded within the interceptors
in both server and client. It has the responsibility of monitoring the resource usage at
each server, initialization a proactive recovery schema based on a two-step threshold.
When the resource usage gets higher then the first threshold, the proactive manager
sends a request to the MEAD Recover Manager to launch a new replica. If the usage
gets higher than the second threshold then the proactive manager starts migrating the
replica’s clients to the next non-faulty replica server.

The Mead Recovery Manager has some similarities with the Replication Manager of
CORBA-FT, as it also must launch new replicas in the presence of failures (node or
server). In MEAD, the recovery manager does not follow a centralized architecture, as
in TAO or FLARe, where all the components of the FT infrastructure are connected to
the replication manager, instead, they are connected by a reliable total ordered group
communication framework that establishes an implicit agreement at each communica-
tion round. These frameworks also provide a notion of view, i.e. an instantaneous

54
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis
PhD Thesis

Weitere ähnliche Inhalte

Ähnlich wie PhD Thesis

Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...Luz Martinez
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPadTraitet Thepbandansuk
 
César Teixeira - Middleware for Large-scale Distributed Systems
César Teixeira - Middleware for Large-scale Distributed SystemsCésar Teixeira - Middleware for Large-scale Distributed Systems
César Teixeira - Middleware for Large-scale Distributed SystemsCésar Teixeira
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadTraitet Thepbandansuk
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTCS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTJosephat Julius
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitFredy Gomez Gutierrez
 
Life above the_service_tier_v1.1
Life above the_service_tier_v1.1Life above the_service_tier_v1.1
Life above the_service_tier_v1.1Ganesh Prasad
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsSubin Mathew
 
phd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enphd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enPierre CHATEL
 
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...Mohamed Marei
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemHuu Bang Le Phan
 

Ähnlich wie PhD Thesis (20)

thesis_online
thesis_onlinethesis_online
thesis_online
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
KHAN_FAHAD_FL14
KHAN_FAHAD_FL14KHAN_FAHAD_FL14
KHAN_FAHAD_FL14
 
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...
A Dynamic Middleware-based Instrumentation Framework to Assist the Understand...
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad
 
12.06.2014
12.06.201412.06.2014
12.06.2014
 
TR1643
TR1643TR1643
TR1643
 
César Teixeira - Middleware for Large-scale Distributed Systems
César Teixeira - Middleware for Large-scale Distributed SystemsCésar Teixeira - Middleware for Large-scale Distributed Systems
César Teixeira - Middleware for Large-scale Distributed Systems
 
DMDI
DMDIDMDI
DMDI
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPad
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTCS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everit
 
Life above the_service_tier_v1.1
Life above the_service_tier_v1.1Life above the_service_tier_v1.1
Life above the_service_tier_v1.1
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRistics
 
Marta de la Cruz-Informe Final
Marta de la Cruz-Informe FinalMarta de la Cruz-Informe Final
Marta de la Cruz-Informe Final
 
phd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enphd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_en
 
Montero thesis-project
Montero thesis-projectMontero thesis-project
Montero thesis-project
 
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...
Mohamed Marei 120126864 Dissertation Design and Construction of Hardware Add-...
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community System
 

Kürzlich hochgeladen

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Kürzlich hochgeladen (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

PhD Thesis

  • 1. Rolando da Silva Martins On the Integration of Real-Time and Fault-Tolerance in P2P Middleware Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e 2012
  • 2.
  • 3. Rolando da Silva Martins On the Integration of Real-Time and Fault-Tolerance in P2P Middleware Tese submetida ` Faculdade de Ciˆncias da a e Universidade do Porto para obten¸˜o do grau de Doutor ca em Ciˆncia de Computadores e Advisors: Prof. Fernando Silva and Prof. Lu´ Lopes ıs Departamento de Ciˆncia de Computadores e Faculdade de Ciˆncias da Universidade do Porto e Maio de 2012
  • 4.
  • 5. To my wife Liliana, for her endless love, support, and encouragement. 3
  • 6.
  • 7. –Imagination is everything. It is the preview of Acknowledgments life’s coming attractions. Albert Einstein To my soul-mate Liliana, for her endless support on the best and worst of times. Her unconditional love and support helped me to overcome the most daunting adversities and challenges. I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and Paulo Paix˜o, for their vision and support that allowed me to pursuit this Ph.D. a I would like to thank the financial support from EFACEC, Sistemas de Engenharia, S.A. and FCT - Funda¸˜o para a Ciˆncia e Tecnologia, with Ph.D. grant SFRH/B- ca e DE/15644/2006. I would especially like to thank my advisors, Professors Lu´ Lopes and Fernando Silva, ıs for their endless effort and teaching over the past four years. Lu´ thank you for steering ıs, me when my mind entered a code frenzy, and for teaching me how to put my thoughts to words. Fernando, your keen eye is always able to understand the “big picture”, this was vital to detected and prevent the pitfalls of building large and complex middleware systems. To both, I thank you for opening the door of CRACS to me. I had an incredible time working with you. A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor. She opened the door of CMU to me and helped to shape my work at crucial stages. Priya, I had a fantastic time mind-storming with you, each time I managed to learn something new and exciting. Thank you for sharing with me your insights on MEAD’s architecture, and your knowledge on fault-tolerance and real-time. Lu´ Fernando and Priya, I hope someday to be able to repay your generosity and ıs, friendship. It is inspirational to see your passion for your work, and your continuous effort on helping others. I would like to thank Jiaqi Tan for taking the time to explain me the architecture and functionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on my work and for his insights on byzantine failures and consensus protocols. I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor- reia, V´ Costa, and Inˆs Dutra, for listening and sharing their thoughts on my work. ıtor e A big thank you to Hugo Ribeiro, for his crucial help with the experimental setup. 5
  • 8.
  • 9. –All is worthwhile if the soul is not small. Abstract Fernando Pessoa The development and management of large-scale information systems, such as high- speed transportation networks, are pushing the limits of the current state-of-the-art in middleware frameworks. These systems are not only subject to hardware failures, but also impose stringent constraints on the software used for management and there- fore on the underlying middleware framework. In particular, fulfilling the Quality- of-Service (QoS) demands of services in such systems requires simultaneous run-time support for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage that remains a challenge for current middleware frameworks. Fault-tolerance support is usually introduced in the form of expensive high-level services arranged in a client-server architecture. This approach is inadequate if one wishes to support real-time tasks due to the expensive cross-layer communication and resource consumption involved. In this thesis we design and implement Stheno, a general purpose P2 P middleware architecture. Stheno innovates by integrating both FT and soft-RT in the architecture, by: (a) implementing FT support at a much lower level in the middleware on top of a suitable network abstraction; (b) using the peer-to-peer mesh services to support FT, and; (c) supporting real-time services through a QoS daemon that manages the under- lying kernel-level resource reservation infrastructure (CPU time), while simultaneously (d) providing support for multi-core computing and traffic demultiplexing. Stheno is able to minimize resource consumption and latencies from FT mechanisms and allows RT services to perform withing QoS limits. Stheno has a service oriented architecture that does not limit the type of service that can be deployed in the middleware. Whereas current middleware systems do not provide a flexible service framework, as their architecture is normally designed to support a specific application domain, for example, the Remote Procedure Call (RPC) service. Stheno is able to transparently deploy a new service within the infrastructure without the user assistance. Using the P2 P infrastructure, Stheno searches and selects a suitable node to deploy the service with the specified level of QoS limits. We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, such as membership, discovery and service deployment, the impact of FT over RT, with and without resource reservation, and compare with other closely related middleware frameworks. Results showed that Stheno is able to sustain RT performance while simultaneously providing FT support. The performance of the resource reservation infrastructure enabled Stheno to maintain this behavior even under heavy load. 7
  • 10.
  • 11. Acronyms API Application Programming Interface BFT Byzantine Fault-Tolerance CCM CORBA Component Model CID Cell Identifier CORBA Common Object Request Broker Architecture COTS Common Of The Shelf DBMS Database Management Systems DDS Data Distribution Service DHT Distributed Hash Table DOC Distributed Object Computing DRE Distributed Real-Time and Embedded DSMS Data Stream Management Systems EDF Earliest Deadline First EM/EC Execution Model/Execution Context FT Fault-Tolerance IDL Interface Description Language IID Instance Identifier IPC Inter-Process Communication IaaS Infrastructure as a Service J2SE Java 2 Standard Edition JMS Java Messaging Service JRTS Java Real-Time System JVM Java Virtual Machine 9
  • 12. JeOS Just Enough Operating System KVM Kernel Virtual-Machine LFU Least Frequently Used LRU Least Recently Used LwCCM Lightweight CORBA Component Model MOM Message-Oriented Middleware NSIS Next Steps in Signaling OID Object Identifier OMA Object Management Architecture OS Operating Systems PID Peer Identifier POSIX Portable Operating System Interface PoL Place of Launch QoS Quality-of-Service RGID Replication Group Identifier RMI Remote Method Invocation RPC Remote Procedure Call RSVP Resource Reservation Protocol RTSJ Real-Time Specification for Java RT Real-Time SAP Service Access Point SID Service Identifier SLA Service Level of Agreement SSD Solid State Disk 10
  • 13. TDMA Time Division Multiple Access TSS Thread-Specific Storage UUID Universal Unique Identifier VM Virtual Machine VoD Video on Demand 11
  • 14.
  • 15. Contents Acknowledgments 5 Abstract 7 Acronyms 9 List of Tables 17 List of Figures 19 List of Algorithms 23 List of Listings 25 1 Introduction 27 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4 Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2 Overview of Related Work 35 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2.1 Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37 2.2.2 CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39 2.3 P2 P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.1 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 13
  • 16. 2.3.2 QoS-Aware P2 P . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4 P2 P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.1 Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5 P2 P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49 2.6 A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49 2.6.1 TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.3 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3 Architecture 59 3.1 Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1.1 Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.2 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.3 P2 P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66 3.1.4 Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.5 Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76 3.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1 Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2 Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.2.3 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81 3.3.2 Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.3 Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4 Implementation 91 4.1 Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.2 Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.1.3 Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 14
  • 17. 4.1.4 Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2 Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2.2 Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3 Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142 4.3.1 Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142 4.3.2 CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.3.3 Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.3.4 An Execution Model for Multi-Core Computing . . . . . . . . . . 148 4.4 Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5 Evaluation 157 5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.1 Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.2 Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.1 Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2.2 Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2.3 Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3 Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.3.1 Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163 5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.3.3 Service Deployment Performance . . . . . . . . . . . . . . . . . . 165 5.4 Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.1 Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167 5.4.2 Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6 Conclusions and Future Work 177 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3 Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 15
  • 18. References 182 16
  • 19. List of Tables 4.1 Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155 17
  • 20.
  • 21. List of Figures 1.1 Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1 Middleware system classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2 TAO’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3 FLARe’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 53 2.4 MEAD’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1 Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 Stheno’s organization overview. . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6 Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 Examples of mesh topologies. . . . . . . . . . . . . . . . . . . . . . . . . 68 3.8 Querying in different topologies. . . . . . . . . . . . . . . . . . . . . . . . 69 3.9 Support framework layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.10 QoS daemon resource distribution layout. . . . . . . . . . . . . . . . . . . 73 3.11 End-to-end network reservation. . . . . . . . . . . . . . . . . . . . . . . . 75 3.12 Operating system interface. . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.13 Interactions between layers. . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.14 Multiple processes runtime usage. . . . . . . . . . . . . . . . . . . . . . . 79 3.15 Creating and bootstrapping of a runtime. . . . . . . . . . . . . . . . . . . 81 3.16 Local service creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.17 Finding a suitable deployment site. . . . . . . . . . . . . . . . . . . . . . 84 3.18 Remote service creation without fault-tolerance. . . . . . . . . . . . . . . 85 3.19 Remote service creation with fault-tolerance: primary-node side. . . . . . 86 3.20 Remote service creation with fault-tolerance: replica creation. . . . . . . 87 3.21 Client creation and bootstrap sequence. . . . . . . . . . . . . . . . . . . . 88 4.1 The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 91 4.2 The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 19
  • 22. 4.3 The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 The initial binding process for a new peer. . . . . . . . . . . . . . . . . . 95 4.5 The final join process for a new peer. . . . . . . . . . . . . . . . . . . . . 96 4.6 Overview of the cell group communications. . . . . . . . . . . . . . . . . 99 4.7 Cell discovery and management entities. . . . . . . . . . . . . . . . . . . 103 4.8 Failure handling for non-coordinator (left) and coordinator (right) peers. 105 4.9 Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . . 106 4.10 Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . . 109 4.11 Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . . 112 4.12 Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . . 113 4.13 Replication group binding overview. . . . . . . . . . . . . . . . . . . . . . 114 4.14 The addition of a new replica to the replication group. . . . . . . . . . . 115 4.15 The control and data communication groups. . . . . . . . . . . . . . . . . 118 4.16 Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . . 120 4.17 Recovery process within a replication group. . . . . . . . . . . . . . . . . 122 4.18 RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.19 RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.20 RPC service architecture without (left) and with (right) semi-active FT. 130 4.21 RPC service with passive replication. . . . . . . . . . . . . . . . . . . . . 132 4.22 Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.23 Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.24 Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . . 137 4.25 Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.26 Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . . 139 4.27 Streaming service with fault-tolerance support. . . . . . . . . . . . . . . . 141 4.28 Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . . 143 4.29 Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 144 4.30 Object-to-Object interactions with different partitions. . . . . . . . . . . 145 4.31 Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.32 End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . . 148 4.33 RPC service using CPU partitioning on a quad-core processor. . . . . . . 148 4.34 Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . . 149 4.35 Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.36 RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . . 153 5.1 Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.3 Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 160 20
  • 23. 5.4 Network organization for the service benchmarks. . . . . . . . . . . . . . 161 5.5 Overlay bind (left) and rebind (right) performance. . . . . . . . . . . . . 164 5.6 Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7 Overlay service deployment performance. . . . . . . . . . . . . . . . . . . 166 5.8 Service rebind time (left) and latency (right). . . . . . . . . . . . . . . . 168 5.9 Rebind time and latency results with resource reservation. . . . . . . . . 170 5.10 Missed deadlines without (left) and with (right) resource reservation. . . 172 5.11 Invocation latency without (left) and with (right) resource reservation. . 174 5.12 RPC invocation latency comparing with reference middlewares (without fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 21
  • 24.
  • 25. List of Algorithms 4.1 Overlay bootstrap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3 Cell initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4 Cell group communications: receiving-end . . . . . . . . . . . . . . . . . 100 4.5 Cell group communications: sending-end . . . . . . . . . . . . . . . . . . 102 4.6 Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Cell fault handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.8 Cell fault handling (continuation). . . . . . . . . . . . . . . . . . . . . . . 108 4.9 Discovery service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.10 Creation and joining within a replication group . . . . . . . . . . . . . . 116 4.11 Primary bootstrap within a replication group . . . . . . . . . . . . . . . 117 4.12 Fault-Tolerance resource discovery mechanism. . . . . . . . . . . . . . . . 118 4.13 Replica startup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.14 Replica request handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.15 Support for semi-active replication. . . . . . . . . . . . . . . . . . . . . . 121 4.16 Fault detection and recovery . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.17 A RPC object implementation. . . . . . . . . . . . . . . . . . . . . . . . 126 4.18 RPC service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.19 RPC service implementation. . . . . . . . . . . . . . . . . . . . . . . . . 128 4.20 RPC client implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.21 Semi-active replication implementation. . . . . . . . . . . . . . . . . . . . 130 4.22 Service’s replication callback. . . . . . . . . . . . . . . . . . . . . . . . . 131 4.23 Passive Fault-Tolerance implementation. . . . . . . . . . . . . . . . . . . 133 4.24 Actuator service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.25 Actuator service implementation. . . . . . . . . . . . . . . . . . . . . . . 136 4.26 Actuator client implementation. . . . . . . . . . . . . . . . . . . . . . . . 136 4.27 Stream service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.28 Stream service implementation. . . . . . . . . . . . . . . . . . . . . . . . 140 4.29 Stream client implementation. . . . . . . . . . . . . . . . . . . . . . . . . 141 4.30 Joining an Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.31 Execution Context stack management. . . . . . . . . . . . . . . . . . . . 152 4.32 Implementation of the EM/EC pattern in the RPC service. . . . . . . . . 154 23
  • 26.
  • 27. List of Listings 3.1 Overlay plugin and runtime bootstrap. . . . . . . . . . . . . . . . . . . . 82 3.2 Transparent service creation. . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3 Service creation with explicit and transparent deployments. . . . . . . . . 85 3.4 Service creation with Fault-Tolerance support. . . . . . . . . . . . . . . . 87 3.5 Service client creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1 A RPC IDL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 25
  • 28.
  • 29. –Most of the important things in the world have 1 been accomplished by people who have kept trying when there seemed to be no hope at all. Dale Carnegie Introduction 1.1 Motivation The development and management of large-scale information systems is pushing the limits of the current state-of-the-art in middleware frameworks. At EFACEC1 , we have to handle a multitude of application domains, including: information systems used to manage public, high-speed transportation networks; automated power management systems to handle smart grids, and; power supply systems to monitor power supply units through embedded sensors. Such systems typically transfer large amounts of streaming data; have erratic periods of extreme network activity; are subject to relatively common hardware failures and for comparatively long periods, and; require low jitter and fast response time for safety reasons, for example, vehicle coordination. Target Systems The main motivation for this PhD thesis was the need to address the requirements of the public transportation solutions at EFACEC, more specifically, the light-train systems. The deployment of one of such systems is installed in Oporto’s light-train network and is composed of 5 lines, 70 stations and approximately 200 sensors (partially illustrated in Figure 1.1). Each station is managed by a computational node, that we designate as peer, that is responsible for managing all the local audio, video, display panels, and low-level sensors such as track sensors for detecting inbound and outbound trains. 1 EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systems engineering namely in public transportation and energy systems, employs around 3000 people and has a turnover of almost 1000 million euro; it is established in more than 50 countries and exports almost half of its production (c.f. http://www.efacec.com). 27
  • 30. CHAPTER 1. INTRODUCTION The system supports three types of traffic: normal - for regular operations over the system, such as playing an audio message in a station through an audio codec; critical - medium priority traffic comprised of urgent events, such as an equipment malfunction notification; alarms - high priority traffic that notifies critical events, such as low-level sensor events. Independently of the traffic type (e.g., event, RPC operation), the system requires that any operation must be completed within 2 seconds. From the point of view of distributed architectures, the current deployments would be best matched with P2 P infra-structures that are resilient and allow resources (e.g., a sensor connected through a serial link to a peer) to be seamlessly mapped to the logical topology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT) services. Support for both RT and FT is fundamental to meet system requirements. Moreover, the next generation light train solutions require deployments across cities and regions that can be overwhelmingly large. This introduces the need for a scalable hierarchical abstraction, the cell, that is composed of several peers that cooperate to maintain a portion of the mesh. Figure 1.1: Oporto’s light-train network. 1.2 Challenges and Opportunities The requirements from our target systems pose a significant number of challenges. The presence of FT mechanisms, specially using space redundancy [1], it introduces the need for the presence of multiple copies of the same resource (replicas), and these, in turn, ultimately lead to a greater resource consumption. FT also introduce overheads in the form of latency and this is another constraint that is important when dealing with RT systems. When an operation is performed, irrespectively, of whether it is real-time or not, any state change that it causes by it 28
  • 31. 1.2. CHALLENGES AND OPPORTUNITIES must be propagated among the replicas through a replication algorithm that introduces an additional source of latency. Furthermore, the recovery time, that consists in the time that the system needs to recover from a fault, is an additional source of latency to real-time operations. There are well known replication styles that offer different trade-offs between state consistency and latency. Our target systems have different traffic types with distinct deadlines requirements that must be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernet networking) and software (e.g., Linux). This requires that the RT mechanisms leverage the available resources, through resource reservation, while providing different threading strategies that allow different trade-offs between latency and throughput. To overcome the overhead introduced by the FT mechanisms, it must be possible to employ a replication algorithm that do not compromises the RT requirements. Replication algorithms that offer a higher degree of consistency introduce a higher level of latency [1, 2] that may be prohibitive for certain traffic types. On the other hand, certain replication algorithms exhibit a lower resource consumption and latency at the expense of a longer recovery time, that may also be prohibitive. Considering current state-of-the-art research we see many opportunities to address the previous challenges. One is the use of COTS operating system that allow for a faster implementation time, thus smaller development cost, while offering the necessary infrastructure to build a new middleware system. P2 P networks can be used to provide a resilient infra-structure that mirrors the physical deployments of our target systems, furthermore, different P2 P topologies offer different trade-offs between self-healing, resource consumption and latency in end-to-end oper- ations. Moreover, by directly implementing FT on the P2 P infra-structure we hope to lower resource usage and latency to allow the integration of RT. By using proven replication algorithms [1, 2] that offer well-known trade-offs regarding consistency, resource consumption and latency, we can focus on the actual problem of integrating real-time, fault-tolerance within a P2 P infrastructure. On the other hand, RT support can be achieve through the implementation of different threading strategies, resource reservation (through the Linux’s Control Groups) and by avoiding traffic multiplexing through the use of different access points to handle different traffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling would provide greater RT guarantees, this goal will not be pursued due the lack of maturity of the current EDF implementations in Linux (our reference COTS operating system). Because we are limited to use priority based scheduling and resource reservation, we can 29
  • 32. CHAPTER 1. INTRODUCTION only partially support our goal of providing end-to-end guarantees, more specifically, we enhance our RT guarantees through the use of RT scheduling policies with over- provisioning to ensure that deadlines are met. 1.3 Problem Definition The work presented in this thesis focuses on the integration of Real-Time (RT) and Fault-Tolerance (FT) in a scalable general purpose middleware system. This goal can only be achieved if the following premises are valid: (a) FT infrastructure cannot interfere in RT behavior, independently of the replication policy; (b) the network model must be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and aware of the underlying infrastructure, i.e. network model, operating system and physical environment. Our problem definition is a direct consequence of the requirements from our target systems, and it can be summarize with the following question: ”Can we opportunistically leverage and integrate these proven strategies to simultaneously support soft-RT and FT to meet the needs of our target systems even under faulty conditions?” In this thesis we argue that a lightweight implementation of fault-tolerance mechanisms in a middleware is fundamental for its successful integration with soft real-time support. Our approach is novel in that it explores peer-to-peer networking as a means to imple- ment generic, transparent, lightweight fault-tolerance support. We do this by directly embedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage of their scalable, decentralized and resilient nature. For example, peer-to-peer networks readily provide the functionality required to maintain and locate redundant copies of resources. Given their dynamic and adaptive nature, they are promising infra-structures for developing lightweight fault-tolerant and soft real-time middleware. Despite these a priori advantages, mainstream generic peer-to-peer middleware systems for QoS computing are, to our knowledge, unavailable. Motivated by this state of affairs, by the limitations of the current infra-structure for the information system we are managing at EFACEC (based on CORBA technology) and, last but not least, by the comparative advantages of flexible peer-to-peer network architectures, we have designed and implemented a prototype service-oriented peer-to-peer middleware framework. The networking layer relies on a modular infra-structure that can handle multiple peer- to-peer overlays. The support for fault-tolerance and soft real-time features is provided 30
  • 33. 1.4. ASSUMPTIONS AND NON-GOALS at this level through the implementation of efficient and resilient services for, e.g. resource discovery, messaging and routing. The kernel of the middleware system (the runtime) is implemented on top of these overlays and uses the above mentioned peer-to- peer functionalities to provide developers with APIs for customization of QoS policies for services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy, number of replicas). This approach was inspired in that of TAO [3], that allows for distinct strategies for the execution of tasks by threads to be defined. 1.4 Assumptions and Non-Goals The distributed model used in this thesis is based on a partial asynchronous model computing model, as defined in [2], extended with fault-detectors. The services and P2 P plugin implemented in this thesis only support crash failures. We consider a crash failure [1] to be characterized as a complete shutdown of a computing instance in the event of a failure, ceasing to interact any further with the remaining entities of the distributed system. The timing faults are handled differently by services and the P2 P plugin. In our service implementations a timing fault is logged (for analysis) with no other action being performed, whereas, in the P2 P layer we consider a timing fault as a crash failure, i.e., if the remote creation of a service exceeds its deadline, the peer is considered crashed. This method is also called as process controlled crash, or crash control, as defined in [4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect of being crashed, it does not get killed or commits suicide, instead it gets shunned, that is, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it must rebind using the membership service in the P2 P layer. The fault model used was motivated by the author’s experience on several field deploy- ments of ligth-train transportation systems, such as the Oporto, Dublin and Tenerife Light Rail solutions [5]. Due to the use of highly redundant hardware solutions, such as redundant power supplies and redundant 10-Gbit network ring links, network failures tend to be short. The most common cause for downtime is related with software bugs, that mostly results in a crashing computing node. While simultaneous failures can happen, they are considered rare events. We also assume that the resource-reservation mechanisms are always available. In this thesis we do not address value faults and byzantine faults, as they are not a 31
  • 34. CHAPTER 1. INTRODUCTION requirement for our target systems. Furthermore, we do not provide a formal specifica- tion and verification of the system. While this would be beneficial to assess system correctness, we had to limit the scope of this thesis. Nevertheless, we provide an empirical evaluation of the system. We also do not address hard real-time because the lack of a mature support for EDF scheduling in the Linux kernel. Furthermore, we do not provide a fully optimized implementation, but only a proof-of-concept to validate our approach. Testing the system in a production environment is left for future work. 1.5 Contributions Before undertaking the task of building an entire new middleware system from scratch, we explored current solutions, presented in Chapter 2, to see if any of them could support the requirements from our target system. As we did not find any suitable solution, we then assessed if it was possible to extend an available solution to meet those requirements. In our previous work, DAEM [6], we explored the use of JGroups [7] within an hierarchical P2 P mesh, and concluded that the simultaneous support for real- time, fault-tolerance and P2 P requires fine grain control of resources that is not possible with the use of ”black-box” solutions, for example, it is impossible to have out-of-the- box support for resource reservation in JGroups. Given these assessments, we have designed and implemented Stheno, that to the best of our knowledge is the first middleware system to seamlessly integrate fault-tolerance and real-time in a peer-to-peer infrastructure. Our approach was motivated by the lack of support of current solutions for the timing, reliability and physical deployment characteristics of our target systems. For that, a complete architectural design is proposed that addresses the levels of the software stack, including kernel space, network, runtime and services, to achieve a seamless integration. The list of contributions include: (a) a full specification of a user Application Programming Interface (API); (b) pluggable P2 P network infrastructure aiming to better adjust to the target application; (c) support for configurable FT on the P2 P layer with the goal of providing lightweight FT mechanisms, that fully enable RT behavior, and; (d) integration of resource reservation at all the levels of runtime, enabling (partial) end-to-end Quality-of-Service (QoS) guarantees. Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioning for real-time systems. In this thesis we present, Euryale, a QoS network oriented 32
  • 35. 1.6. THESIS OUTLINE framework that features resource reservation with support for a broader range of sub- systems, including CPU, memory, I/O and network bandwidth for a general purpose operating system as Linux. At the heart of this infrastructure resides Medusa, a QoS daemon that handles admission and management of QoS requests. Current well-known threading strategies, such as Leader-Followers [11], Thread-per- Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la- tency and resource usage [3, 14]. However, they do not support resource reservation, namely, CPU partitioning. In order to suppress this limitation, this thesis provides an additional contribution with the introduction of a novel design pattern (Chapter 4) that is able to integrate multi-core computing with resource reservation within a configurable framework that supports these well-known threading strategies. For example, when a client connects to a service it can specify, through the QoS real-time parameters, for a particular threading strategy that best meets its requirements. We present a full implementation that covers all the previously architectural features, including a complete overlay implementation, inspired in the P3 [15] topology, that seamlessly integrates RT and FT. To evaluate our implementation and justify our claims, we present a complete evalua- tion for both mechanisms. The impact of the resource reservation mechanism is also evaluated, as well as a comparative evaluation of RT performance against state-of-the- art middleware systems. The experimental results show that Stheno meets and exceeds target system requirements for end-to-end latency and fail-over latency. 1.6 Thesis Outline The focus of this thesis is on the design, implementation and evaluation of a scalable general purpose middleware that provides the seamless integration of RT and FT. The remaining of this thesis is organized as follows. Chapter 2: Overview of Related Work. This chapter presents an overview on related middleware systems that exhibit support for RT, FT and P2 P, the mandatory requirements from our target system. We started by searching for an available off-the-shelf solution that could support all of these requirements, or in its absence, identifying a current solution that could be extended in order to avoid creating a new middleware solution from scratch. Chapter 3: Architecture. 33
  • 36. CHAPTER 1. INTRODUCTION Chapter 3 describes the runtime architecture on the proposed middleware. We start by providing a detailed insight on the architecture, covering all layers present in the runtime. Special attention is given to the presentation of the QoS and resource reser- vation infrastructure. This is followed by an overview of the programming model that describes the most important interfaces present in the runtime, as well the interactions that occur between them. The chapter ends with the description of the fundamental runtime operations, namely: the creation of services with and without FT support, deployment strategy, and client creation. Chapter 4: Implementation. Chapter 4 describes the implementation of a prototype based on the aforementioned architecture, and is divided in four parts. In the first part, we present a complete implementation of P2 P overlay that is inspired on the P3 [15] topology, while providing some insight on the limitations of the current prototype. The second part of this chapter focuses on the implementation of three types of user services, namely, Remote Procedure Call (RPC), Actuator, and Streaming. These services are thoroughly evaluated in Chapter 5. In the third part, we describe our support for multi-core computing, through the presentation of a novel design pattern, the Execution Model/Context. This design pattern is able to integrate resource reservation, especially CPU partitioning, with different well-known (and configurable) threading strategies. The fourth and final part of this chapter describes the most relevant parameters used in the bootstrap of the runtime. Chapter 5: Evaluation. The experimental results are presented in this chapter. It starts by providing details of physical setup used throughout the evaluation. Then it describes the parameters used in the testbed suite, that is composed by the three services previously described in Chapter 4. We then focus on presenting the results for the benchmarks, including the assessment of the impact of FT on RT, and the impact of the resource reservation infra- structure in the overall performance. The chapter ends with a comparative evaluation against well-known middleware systems. Chapter 6: Conclusion and Future Work. This last chapter presents the concluding remarks. It highlights the contributions of the proposed and implemented middleware, and provides 34
  • 37. –By failing to prepare, you are preparing to fail. 2 Benjamin Franklin Overview of Related Work 2.1 Overview This chapter presents an overview of the state-of-the-art on related middleware systems. As illustrated in Figure 2.1, we are mostly interested in systems that exhibit support for real-time (RT), fault-tolerance (FT) and peer-to-peer (P2 P), the mandatory require- ments from our target system. We started by searching for an available off-the-shelf solution that could support all of these requirements, or in its absence, identify a current solution that could be extended, and thus avoid the creation of a new middleware solution from the ground up. For that reason, we have focused on the intersecting domains, namely, RT+FT, RT+P2 P and FT+P2 P, since the systems contained in these domains are closer to meet the requirements of our target system. From an historic perspective, the origins of modern middleware systems can be traced back to the 1980s, with the introduction of the concept of ubiquitous computing, in which computational resources are accessible and seen as ordinary commodities such as electricity or tapwater [2]. Furthermore, the interaction between these resources and the users was governed by the client-server model [16] and a supporting protocol called RPC [17]. The client-server model is still the most prevalent paradigm in current distributed systems. An important architecture for client-server systems was introduced with the Common Object Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did not address real-time or fault-tolerance. Only recently both real-time and fault-tolerance specifications were finalized but remained mutually exclusive. This means that a system supporting the real-time specification will not be able to support the fault- 35
  • 38. CHAPTER 2. OVERVIEW OF RELATED WORK DDS Video Streaming RT RT+P2P CORBA RT FT RT+FT RT+FT+P2P P2P FT+P2P FT Pastry Distributed storage CORBA FT Stheno Figure 2.1: Middleware system classes. tolerance specification, and vice-versa. Nevertheless, seminal work has already ad- dressed these limitations and offered systems supporting both features, namely, TAO [3] and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared as a Java alternative capable of providing a more flexible and easy-to-use environment. In recent years, CORBA entered in a steady decline [20] in favor of web-oriented platforms, such as J2EE [21], .NET [22] and SOAP [23], and P2 P systems. The web-oriented platforms, such as the JBoss [24] application server, aim to integrate availability with scalability, but they remain unable to support real-time. Moreover, while partitioning offers a clean approach to improve scalability, it fails to support large scale distributed systems [2]. Alternatively, P2 P systems focused on providing logical organizations, i.e., meshes, that abstract the underlying physical deployment while providing a decentralized architecture for increased resiliency. These systems focused initially on resilient distributed storage solutions, such as Dynamo [25], but progressively evolved to support soft real-time systems, such as video streaming [26]. More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributed message passing infrastructure based on an asynchronous interaction model, that is able to suppress the scaling issues present in RPC. A considerable amount of im- plementations exist, including Tibco [28], Websphere MQ [29] and Java Messaging Service (JMS) [30]. MOM sometimes are integrated as subsystems in the application server infrastructures, such as JMS in J2EE and Websphere MQ in the Websphere Application Server. A substantial body of research has focused on the integration of real-time within 36
  • 39. 2.2. RT+FT MIDDLEWARE SYSTEMS CORBA-based middleware, such as TAO [3] (that later addressed the integration of fault-tolerance). More recently, QoS-enabled publish-subscribe middleware systems based on the JAIN SLEE specification [31], such as Mobicents [32], and in the Data Distribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34] and OpenSplice [35], appeared as a way to overcome the current lack of support for real-time applications in SOA-based middleware systems. The introduction of fault-tolerance in middleware systems also remains an active topic of research. CORBA-based middleware systems were a fertile ground to test fault- tolerance techniques in a general purpose platform, resulting in the creation of the CORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA- based platforms, such as J2EE. One of the most popular deployments, JBoss, supports scalability and availability through partitioning. Each partition is supported by a group communication framework based on the virtual synchrony model, more specifically, the JGroups [7] group communication framework. 2.2 RT+FT Middleware Systems This section overviews systems that provide simultaneous support for real-time and fault-tolerance. These systems are divided into special purposed solutions, designed for specific application domains, and CORBA-based solutions, aimed for general purposed computing. 2.2.1 Special Purpose RT+FT Systems Special purpose real-time fault-tolerant systems introduced concepts and implementa- tion strategies that are still relevant on current state-of-the-art middleware systems. Armada Armada [37] focused on providing middleware services and a communication infrastruc- ture to support FT and RT semantics for distributed real-time systems. This was pursued in two ways, which we now describe. The first contribution was the introduction of a communication infrastructure that is able to provide end-to-end QoS guarantees, in both unicast and multicast primitives. This was supported by a control signaling and a QoS-sensitive data transfer (as in the newer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)). 37
  • 40. CHAPTER 2. OVERVIEW OF RELATED WORK The network infrastructure used a reservation mechanism based on EDF scheduling policy that was built on top of the Mach OS priority based scheduling. The initial implementation was done in the user-level but subsequently migrated to the kernel level with the goal of reducing latency. Much of the architectural decisions regarding RT support were based on the available operating system at the time, mainly Mach OS. Despite the advantages of a micro- kernel approach, its application remains restricted by the underlying cost associated with message passing and context switching. Instead, a large body of research has been made on monolithic kernels, specially in Linux OS, that are able to offer the advantages of the micro-kernel approach, through the introduction of kernel modules, and the speed of monolithic kernels. The second contribution came in the form of a group communication infrastructure based on a ring topology that ensured the delivery of messages in a reliable and total order fashion within a bounded time. It also had support for membership management that offered consistent views of the group through the detection of process and commu- nication failures. These group communication mechanisms enabled the support for FT through the use of a passive replication scheme, that allowed for some inconsistencies between the primary and the replicas, where the states of the replicas could lag behind the state of the primary, up to a bounded time window. Mars Mars [38] provided support for the analysis and deployment of synchronous hard real- time systems through a static off-line scheduler for CPU and Time Division Multiple Access (TDMA) bus. Mars is able to offer FT support through the use of active redundancy on the TDMA bus, i.e. sending multiple copies of the same message, and self-checking mechanisms. Deterministic communications are achieved though the use of a time-triggered protocol. The project focused on the RT process control, where all the intervening entities are known in advance. So it does not offer any type of support for dynamical admission of new components, neither it supports on-the-fly fault-recovery. ROAFTS ROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis- tributed RT applications, consisting in a network of Time-triggered Message-triggered Objects [41] (TMO’s), whose execution is managed by a TMO support manager. The FT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic 38
  • 41. 2.2. RT+FT MIDDLEWARE SYSTEMS fault server ; (b) and a network surveillance [42] manager. Fault-detection is assured by the network surveillance TMO, and used by the generic fault-server to change the FT policy with the goal of preserving RT semantics. The system assumes that RT can live with lesser reliability assurances from the middleware, under highly dynamic environments. Maruti Maruti [43] aimed to provide a development framework and an infrastructure for the deployment of hard real-time applications within a reactive environment, focusing on real-time requirements on a single-processor system. The reactive model is able to offer runtime decisions on the admission of new processing requests without producing adverse effects on the scheduling of existing requests. Fault-tolerance is achieved by redundant computation. A configuration language allows the deployment of replicating modules and services. Delta-4 Delta-4 [44] provided an in-depth characterization of fault assumptions, for both the host and the network. It also demonstrated various techniques for handling them, namely, passive and active replication for fail-silent hosts and byzantine agreement for fail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi- tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 framework through the introduction of the Leader/Follower replication model (better known as semi-active replication) for fail-silent hosts. This work also lead to the extension to the communication system to support additional communication primitives (the original work on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, and AtLeastTo. 2.2.2 CORBA-based RT+FT Systems The support for RT and FT in general purpose distributed platforms remains mostly restricted to CORBA. While some support was carried out by Sun to introduce RT sup- port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46, 47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen- tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-Time VM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attempted to provide support for RT in a J2EE environment. Nevertheless, this support seems to be confined to the introduction of a deterministic garbage collector, through the use of 39
  • 42. CHAPTER 2. OVERVIEW OF RELATED WORK the RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbage collection [51]. Previous work on integration of RT and FT in CORBA context systems can be catego- rized into three distinct approaches: (a) integration, where the base ORB is modified; (b) services, systems that rely on high-level services to provide FT (and indirectly, RT), and; (c) interception, systems that perform interception on client request to provide transparent FT and RT. Integration Approach Past work on the integration of fault-tolerance in CORBA-like systems was done in Electra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors of the CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage- ment Architecture (OMA) to support transparent and non-transparent fault-tolerance capabilities. Instead of using message queues or transaction monitors [56], it relied on object-communication groups [57, 58]. Maestro [53] is a distribute layer built on top of the Ensemble [59] group communication, that was used by Electra [52] in the Quality of Service for CORBA Objects (QuO) project [60]. Its main focus was to provide an efficient, extensible and non disruptive integration of the object layers with the low- level QoS system properties. The AQuA [54] system uses both QuO and Maestro on top of the Ensemble communication groups, to provide a flexible and modular approach that is able to adapt to faults and changes in the application requirements. Within its framework a QuO runtime accepts availability requests by the application and relays them to a dependability manager, that is responsible to leverage the requests from multiple QuO runtimes. TAO+QuO The work done in [61] focused on the integration of QoS mechanisms, for both CPU and network resources while supporting both priority- and reservation-based QoS semantics, with standard COTS Distributed Real-Time and Embedded (DRE) middleware, more precisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. The priority-based approach was built on top of the RT-CORBA specification, and it defined a set of standard features in order to provide end-to-end predictability for operations within a fixed priority context [62]. The CPU priority-based resource management is left to the scheduling of the underlying Operating Systems (OS), whereas the network priority-based management is achieved through the use of the DiffServ architecture [63], by setting the DSCP codepoint on the IP header of the GIOP requests. Based on various factors, the QuO runtime can dynamically change this priority to adjust to 40
  • 43. 2.2. RT+FT MIDDLEWARE SYSTEMS environment changes. Alternatively, the network reservation-based approach relies on the RSVP [64] signaling protocol to guarantee the desired network bandwidth between hosts. The QuO runtime monitors the RSVP connections and makes adjustments to overcome abnormal conditions. For example, in a video service it can drop frames to maintain stability. The cpu-reservation is made using reservation mechanisms present in the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservations policies. This was done to preserve the end-to-end QoS semantics that is only available at a higher level of the middleware. CIAO+QuO CIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built on top of TAO [3] that aims to alleviate the complexity of integrating real-time features on DRE using Distributed Object Computing (DOC) middleware. These DOC systems, of which TAO is an example, offer configurable policies and mechanisms for QoS, namely real-time, but lack a programming model that is capable of separating systemic aspects from applicational logic. Furthermore, QoS provisioning must be done in an end-to-end fashion, thus having to be applied to several interacting components. It is difficult, or nearly impossible, to properly configure a component without taking into account the QoS semantics for interacting entities. Developers using standard DOC middleware systems are susceptible to produce misconfigurations that cause an overall system misbehavior. CIAO overcomes these limitations by applying a wide range of aspect-oriented development techniques that support the composition of real- time semantics without intertwining configurations concerns. The support for CIAO’s CCM architecture was done in CORFU [66] and is described below. Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67]. The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi- sioning to a total provisioning middleware that is also able to accommodate dynamical and adaptive QoS provisioning. For example, the setup of a RSVP [64] connection would require the explicit configuration from the developer, defeating the purpose of CIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], it does not provide a solution for component cross-cutting. DynamicTAO DynamicTAO [69] focused on providing a reflective model middleware that extends TAO to support on-the-fly dynamic reconfiguration of its component behavior and resource management through meta-interfaces. It allows the application to inspect the internal state/configuration and, if necessary, to reconfigure it in order to adapt 41
  • 44. CHAPTER 2. OVERVIEW OF RELATED WORK to environment changes. Subsequently, it is possible to select networking protocols, encoding and security policies to improve the overall system performance in the presence of unexpected events. Service-based Approach An alternative, high-level service approach for CORBA fault-tolerance was taken by Distributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service (OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providing replica management, fault-detection and fault-recovery as a CORBA high-level service. It did group communication and it mainly focused on passive replication, but allowed the developer to select the desired level of reliability (number of replicas), replication policy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recovery strategy. OGS improved over prior approaches by using a group communication protocol that imposes consensus semantics. Instead of adopting an integrated approach, group communication services are transparent to the ORB, by providing a request level bridging. Newtop followed a similar approach to OGS but augmented the support for network partition, allowing the newly formed sub-groups to continue to operate. TAO TAO [3] is a CORBA middleware with support for RT and FT middleware, that is compliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. The support for RT includes priority propagation, explicit binding, and RT thread pools. The FT is supported through the of a high level service, the Replication Manager, that sits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure, acting as a rendezvous for all the remaining components, more precisely, monitors that watch the status of the replicas, replica factories that allow the creation of new replicas, and fault notifiers that inform the manager of failed replicas. TAO’s architecture is further detailed in Section 2.6 of Chapter 3. FLARe and CORFU FLARe [74] focus on proactively adapting the replication group to underlying changes on resource availability. To minimize resource usage, it only supports passive replica- tion [75]. Its implementation is based on TAO [3]. It adds three new components to the existing architecture: (a) Replication Manager high level service that decides on the strategy to be employed to address the changes on resource availability and faults; (b) a client interceptor that redirects invocations to the active primary; (c) a redirection agent that receives updates from the Replication Manager and is used by the interceptor, 42
  • 45. 2.2. RT+FT MIDDLEWARE SYSTEMS and; (d) a resource monitor that watches the load on nodes and periodically notifies the Replication Manager. In the presence of faulty conditions, such as overload of a node, the Replication Manager adapts the replication group to the changing conditions, by activating replicas on nodes that have a lower resource usage, and additionally, change the location of the primary node to a better suitable placement. CORFU [66] extends FLARe to support real-time and fault-tolerance for the Lightweight CORBA Component Model (LwCCM) [76] standard for DRE systems. It provides fail-stop behavior, that is, when one component on a failover unit fails, then all the remaining components are stopped, allowing for a clean switch to a new unit. This is achieved through a fault mapping facility that allows the correspondence of the object failure into the respective plan(s), with the subsequent component shutdown. DeCoRAM The DeCoRAM system [77] aims to provide RT and FT properties through a resource- aware configuration, executed using a deployment infrastructure. The class of supported systems is confined to closed DRE, where the number of tasks and their respective execution and resource requirements are known a priori and remain invariant thought the system’s life-cycle. As the tasks and resources are static, it is possible to optimize the allocation of the replicas on available nodes. The allocation algorithm is configurable allowing for a user to choose the best approach to a particular application domain. DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real- Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti- mization problem, while satisfying both RT and FT system constraints. Because of the limited resources normally available on DRE systems, DeCoRAM only supports passive replication [75], thus avoiding the high overhead associated with active replication [78]. The allocation algorithm calculates the components inter-dependencies and deploys the execution plan using the underlying middleware infrastructure, which is provided by FLARe [74]. Interception-based Approach The work done in Eternal [79, 80] focused on providing transparent fault-tolerance for CORBA ensuring strong replica consistency through the use of reliable totally-ordered multicast protocol. This approach alleviated the developer from having to deal with low- level mechanisms for supporting fault-tolerance. In order to maintain compatibility with the CORBA-FT standard, Eternal exposes the replication manager, fault detector, and fault notifier to developers. However, the main infrastructure components are located below the ORB for both efficiency and transparency purposes. These components 43
  • 46. CHAPTER 2. OVERVIEW OF RELATED WORK include logging-recovery mechanisms, replication mechanisms, and interceptors. The replication mechanisms provide support for warm and cold passive replication and active replication. The interceptor captures the CORBA IIOP requests and replies (based on TCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recovery mechanisms are responsible for managing the logging, checkpointing, and performing the recovery protocols. MEAD MEAD focuses on providing fault-tolerance support in a non intrusive way by en- hancing distributed RT systems with (a) a transparent, although tunable FT, that is (b) proactively dependable through (c) resource awareness, that has (d) scalable and fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-concept. The paper makes an important contribution by leveraging fault- tolerance resource consumption for providing RT behavior. MEAD is detailed further in Section 2.6 of Chapter 3. 2.3 P2P+RT Middleware Systems While most of the focus on P2 P systems has been on the support of FT, there is a growing interested in using these systems for RT applications, namely, in streaming and QoS support. This section provides an overview on P2 P systems that support RT. 2.3.1 Streaming Streaming and specially Video on Demand (VoD), were a natural evolution of the first file sharing P2 P systems [81, 82]. With the steady increase of network bandwidth on the Internet, it is now possible to have high-quality multimedia streaming solutions to the end-user. These focus on providing near soft real-time performance resorting to streams split through the use of distributed P2 P storage and redundant network channels. PPTV The work done in [26] provides the background for the analysis, design and behavior of VoD systems, focusing on the PPTV system [83]. An overview of the different replication strategies and their respective trade-offs is presented, namely, Least Recently Used (LRU) and Least Frequently Used (LFU). The later uses a weighted estimation based on the local cache completion and by the availability to demand ratio (ATD). 44
  • 47. 2.3. P2 P+RT MIDDLEWARE SYSTEMS Each stream is divided into chunks. The size of these chunks have a direct influence on the efficiency of the streaming, with smaller size pieces facilitating replication and thus overall system load-balancing, whereas bigger pieces decrease the resource overhead associated with piece management and bandwidth consumption due to less protocol control. To allow for a more efficient piece selection three algorithms are proposed: sequential, rarest first and anchor-based. To ensure real-time behavior the system is able to offer different levels of aggressiveness, including: simultaneous requests of the same type to neighboring peers; simultaneous sending different content requests to multiple peers, and; requesting to a single peer (making a more conservative use of resources). Thicket Efficient data dissemination over unstructured P2 P was addressed by Thicket [84]. The work used multiple trees to ensure efficient usage of resources while providing redundancy in the presence of node failure. In order to improve load-balancing across the nodes, the protocol tries to minimize the existence of nodes that act as interior nodes on several of trees, thus reducing the load produced from forwarding messages. The protocol also defines a reconfiguration algorithm for leveraging load-balance across neighbor nodes and a tree repair procedure to handle tree partitions. Results show that the protocol is able to quickly recover from a large number of simultaneous node failures and leverage the load across existing nodes. 2.3.2 QoS-Aware P2 P Until recently, P2 P systems have been focused on providing resiliency and throughput, and thus, not addressing the increasing need for QoS on latency-sensitive applications, such as VoD. QRON QRON [85] aimed to provide a general unified framework in contrast to application- specific overlays. The overlays brokers (OBs), present at each autonomous system in the Internet, support QoS routing for overlay applications through resource negotiation and allocation, and topology discovery. The main goal of QRON is to find a path that satisfies the QoS requirements, while balancing the overlay traffic across the OBs and overlay links. For this it proposes two distinct algorithms, a “modified shortest distance path” (MSDP) and “proportional bandwidth shortest path (PBSP). 45
  • 48. CHAPTER 2. OVERVIEW OF RELATED WORK GlueQoS GlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoS features from two communicating processes. It provides a declarative language that allows the specification of the feature QoS set (and possible conflicts) and a runtime negotiation mechanism that finds a set of valid QoS features that is valid in the both ends of the interacting components. Contrary to aspect-oriented programming [65], that only enforces QoS semantics at deployment time, GlueQoS offers a runtime solution that remains valid throughout the duration of the session between a client and a server. 2.4 P2P+FT Middleware Systems The research on P2 P systems has been largely dominated by the pursuit for fault- tolerance, such as in distributed storage, mainly due to the resilient and decentralized nature of P2 P infrastructures. 2.4.1 Publish-subscribe P2 P publish-subscribe systems are a set of P2 P systems that implement a message pattern where the publishers (senders) do not have a predefined set of subscribers (receivers) to their messages. Instead, the subscribers must first register their interests with the target publisher, before starting to receive published messages. This decou- pling between publishers and subscribers allows for a better scalability, and ultimately, performance. Scribe Scribe [87] aimed to provided a large scale event notification infrastructure, built on top of Pastry [88], for topic-based publish-subscribe applications. Pastry is used to support topics and subscriptions and build multicast trees. Fault-Tolerance is provided by the self-organizing capabilities of Pastry, through the adaptation to network failures and subsequent multicast tree repair. The event dissemination performed is best- effort oriented and without any delivery order guarantees. Nevertheless, it is possible to enhance Scribe to support consistent ordering thought the implementation of a sequential time stamping at the root of the topic. To ensure strong consistency and tolerate topic root node failures, an implementation of a consensus algorithm such as Paxos [89] is needed across the set of replicas (of the topic root). 46
  • 49. 2.4. P2 P+FT MIDDLEWARE SYSTEMS Hermes Hermes [90] focused on providing a distributed event-based middleware with an underly- ing P2 P overlay for scalability and reliability. Inspired by work done in Distributed Hash Table (DHT) overlay routing [88, 91], it also has some notions of rendezvous similar to [81]. It bridges the gap between programming language type semantics and low-level event primitives, by introducing the concepts of event-type and event-attributes that have some common ground with Interface Description Language (IDL) within the RPC context. In order to improve performance, it is possible in the subscription process to attach a filter expression to the event attributes. Several algorithms are proposed for improving availability, but they all provide weak consistency properties. 2.4.2 Resource Computing There is a growing interest on harvesting and managing the spare computing power from the increasing number of networked devices, both public and private, as reported in [92, 93, 94, 95]. Some relevant examples are: BOINC BOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili- tate the harvesting of public resource computing by the scientific research community. BOINC implements a redundant computing mechanism to prevent malicious or erro- neous computational results. Each project specifies the number of results that should be created for each “workunit”, i.e. the basic unit of computation to be performed. When some number of the results are available, an application specific function is called to evaluate the results and possibly choosing a canonical result. If no consensus is achieved, or if simply the results fail, a new set o results are computed. This process repeats until a successful consensus is achieved or an application defined timeout occurs. P2 P-MapReduce Developed at Google, MapReduce [97] is a programming model that is able parallelize the processing of large data sets in a distributed environment. It follows a master-slave model, where a master distributes the data set across a set of slaves, returning at end the computational results (from the map or reduce tasks). MapReduce provides fault- tolerance for slave nodes by reassigning the failed job to an alternative active slave, but lacks support for master failures. P2 P-MapReduce [98] provides fault-tolerance by resorting to two distinct P2 P overlays, one containing the current available masters in 47
  • 50. CHAPTER 2. OVERVIEW OF RELATED WORK the system, and the other with the active slaves. When an user submits a MapReduce job, it queries the master overlay for a list of the available masters (ordered by their workload). It then selects a master node and the number of replicas. After this, the master node notifies its replicas that they will participate on the current job. A master node is responsible for periodically synchronizing the state of the job over its replica set. In case of failure, a distributed procedure is executed to elect the new master across the active replicas. Finally, the master selects the set of slaves using a performance metric based on workload and CPU performance from the slave overlay and starts the computation. 2.4.3 Storage Storage systems were one of the most prevalent applications on first generation P2 P sys- tems. Evolving from early file-sharing systems, and with the help of DHT middlewares, they have now become the choice for large-scale storage systems in both industry and academia. openDHT Work done in [99] aimed to provide a lightweight framework for P2 P storage using DHTs (such in [88, 91]) in a public environment. The key challenge was to handle mutually untrusting clients, while guarantying fairness in the access and allocation of storage. The work was able to provide a fair access to the underlying storage capacity, while taking the assumption that storage capacity is free. Because of its intrinsic fair approach, the system is unable to provide any type of Service Level of Agreement (SLA) to the clients, so reducing the domain of applications that can use it. Dynamo Recent research on data storage [25] and distribution at Amazon, focus on key-value approaches using P2 P overlays, more precisely DHT, to overcome the well explored limitation of simultaneous providing high availability and strong consistency (through synchronous replication) [100, 101]. The approach taken was to use an optimistic replication scheme that relied on asynchronous replica synchronization (also known as passive replication). The consistency conflicts between different replicas, that are caused by network and server failures, are resolved in ’read time’, as opposed to the more traditional ’write time’ strategy, with this being done to maximize the write availability in the system. Such conflicts are resolved by the services, allowing for a more efficient resolution (although the system offers a default ’last value holds’ strategy 48
  • 51. 2.5. P2 P+RT+FT MIDDLEWARE SYSTEMS to the services). Dynamo offers efficient key-value storage, while maximizing write operations availability. Nevertheless, the ring based overlay hampers the scalability of the system, and depending on the partitioning strategy used, the membership process does not seem efficient. 2.5 P2P+RT+FT Middleware Systems These types of systems offer a natural evolution over previous FT-RT middleware systems. They aim to provide scalability and resilience through a P2 P network infra- structure that is able to provide lightweight FT mechanisms, allowing them to support soft RT semantics. We first proposed an architecture [102, 103] for a general purpose middleware that aimed to integrate FT into the P2 P network layer, while being able to provide RT support. The first implementation, in Java, of the architecture was done in DAEM [6, 104]. This work used an hierarchical tree P2 P based on P3 [15]. The FT support was performed in all levels of the tree, resulting in a high availability rate but the use of JGroups [7] for maintaining strong consistency, both for mesh and service data, resulted in high overhead. Due to its highly coupled tree architecture, faults had a major impact on availability when they occurred near the root node, as they produced a cascade failure. Initial support for RT was provided, but the high overhead of the replication infrastructure limited its applicability. 2.6 A Closer Look at TAO, MEAD and ICE This section provides a closer look at middleware systems that have provided us with several strategies and insights that we used to design and implement Stheno, our middleware solution that is able to support RT, FT and P2 P. All the referred systems share a service oriented architecture with a client-server network model, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEAD support the RT-CORBA standard, while ICE only supports best-effort invocations. As for FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid, that combines both low and high-level services. 49
  • 52. CHAPTER 2. OVERVIEW OF RELATED WORK 2.6.1 TAO TAO is a classical RPC middleware and therefore only supports the client-server network model. Name resolution is provided by a high-level service, representing a clear point- of-failure and a bottleneck. RT Support. TAO supports the RT CORBA specification 1.0., with the most impor- tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT thread pools. The priority propagation ensures that a request maintains its priority across a chain of invocations. A client issues a request to an Object A, that in turn, issues an invocation to other Object B. The request priority at Object A is then used to make the invocation at Object B. There are two types of propagation: a server declared priorities, and client propagated priorities. In the first type, a server dictates the priority that will be used when processing an incoming invocation. In the other type, the priority of the invocation is encoded within the request, so the server processes the request at the priority specified by the client. A source of unbound priority inversion is caused by the use of multiplexed communica- tion channels. To overcome this, the RT CORBA specification defines that the network channels should be pre-established, avoiding the latency caused by their creation. This model allows two possible policies: (a) private connection between the client and the server, or; (b) priority banded connection that can be shared but limits the priority of the requests that can be made on it. In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], with the support of a reactor (an object that handles network event de-multiplexing), and is normally associated with an acceptor (an entity that handles the incoming connections), a connection cache, and a memory pool. In classic CORBA a high priority thread can be delayed by a low priority one, leading to priority inversion. So in an effort to avoid this unwanted side-effect, the RT-CORBA specification defines the concept of thread pool lanes. All the threads belonging to a thread pool lane have the same priority, and so, only process invocation that have the same priority (or a band that contains that priority). Because each lane has it own acceptor, memory pool and reactor, the risk of priority inversion is greatly minimized at the expense of greater resource usage overhead. FT Support. In a effort to combine RT and FT semantics, the replication style 50
  • 53. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE proposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids the latency associated with both warm and cold passive replication [105] and the high overhead and non-determinism of active replication, but represents an extension to the FT specification. Figure 2.2: TAO’s architectural layout (adapted from [3]). Figure 2.2 shows the architectural overview of TAO. The support for FT is achieved through the use of a set of high-level services built on top of TAO. These services include a Fault Notifier, a Fault Detector and a Replication Manager. The Replication Manager is the central component of the FT infrastructure. It acts as central rendezvous to the remaining FT components, and it has the responsibilities of managing the replication groups life-cycle (creation/destruction) and perform group maintenance, that is the election of a new primary, removal of faulty replicas, and updating group information. It is composed by three sub-components: (a) a Group Manager, that manages the group membership operations (adds and removes elements), allows the change of the primary of a given group (for passive replication only), and allows manipulation and retrieval of group member localization; (b) a Property Manager, that allows the manipulation of 51
  • 54. CHAPTER 2. OVERVIEW OF RELATED WORK replication properties, like replication style; and (c) a Generic Factory, the entry point for creating and destroying objects. The Fault Detector is the most basic component of the FT infrastructure. Its role is to monitor components, processes and processing nodes and report eventual failures to the Fault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwards them to the Replication Manager. The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next; (b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that (d) finds the Replication Manager and registers itself with it. As a response, e) the Replication Manager connects as a consumer to the Fault Notifier. (f) For each node that is going to participate, starts a Fault Detector Factory and a Replica Factory, that in turn register themselves in the Replication Manager. (g) A group creation request is made to the Replication Manager (by an foreign entity, that is referred as Object Group Creator ), followed by the request of a list to the available Fault Detector Factories and a Replica Factories; (h) this is followed by a request to create an object group in the Generic Factory. (i) The Object Group Creator then bootstraps the desired number of replicas using the Replica Factory at each target node, and in turn, each Replica Factory creates the actual replica, and at the same time, it starts a Fault Detector at each site using the Fault Detector Factory. Each one of these detectors, finds the Replication Manager and retrieves the reference to the Fault Notifier and connects to it as a supplier. (j) Each replica is added to the object group by the Object Group Creator by using the Group Manager at the Replication Manager. (k) At this point, a client is started and retrieves the object reference from the naming service, and makes an invocation to that group. This is then carried out by the primary of the replication group. Proactive FT Support. An alternative approach has been proposed by FLARe [74], that focus on proactively adapting the replication group to the load present in the system. The replication style is limited to semi-active replication using state-transfer, that is commonly referred solely as passive replication . Figure 2.3 shows the architectural overview of FLARe. This new architecture presents three new components to TAO’s FT infrastructure: (a) a client interceptor, that redi- rects the invocations to the proper server, as the initial reference could have been changed by the proactive strategy, in response to a load change; (b) a redirection agent that receives the updates with these changes from the Replication Manager; and (c) a resource monitor that monitors the load on a processing node and sends periodical 52
  • 55. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE Figure 2.3: FLARe’s architectural layout (adapted from [74]). updates to the Replication Manager. In the presence of abnormal load fluctuations the Replication Manager changes the replication group to adapt to these new conditions, by creating replicas on lower usage nodes and, if required, by changing the primary to a better suitable replica. TAO’s fault tolerance support relies on a centralized infrastructure, with its main component, the Replication Manager, representing a major obstacle in the system’s scalability and resiliency. No mechanisms are provided to replicate this entity. 2.6.2 MEAD MEAD focused on providing fault-tolerance support in a non intrusive way for enhancing distributed RT systems by providing a transparent, although tunable FT, that is proactively dependable through resource awareness, that has scalable and fast fault- detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of- concept. Transparent Proactive FT Support. MEAD’s architecture contains three major components, namely, the Proactive FT Manager, the Mead Recovery Manager and the 53
  • 56. CHAPTER 2. OVERVIEW OF RELATED WORK Mead Interceptor. The underlying communication is provided by Spread, an group communication framework that offers reliable total ordered multicast, for guaranteeing consistency for both component and node membership. The Mead Interceptor provides the usual interception of system calls between the application and underlying operating system. This approach allows a transparent and non-intrusive way to enhance the middleware with fault-tolerance. Figure 2.4: MEAD’s architectural layout (adapted from [14]). Figure 2.4 shows the architectural overview of MEAD. The main component of the MEAD system is the Proactive FT Manager, and is embedded within the interceptors in both server and client. It has the responsibility of monitoring the resource usage at each server, initialization a proactive recovery schema based on a two-step threshold. When the resource usage gets higher then the first threshold, the proactive manager sends a request to the MEAD Recover Manager to launch a new replica. If the usage gets higher than the second threshold then the proactive manager starts migrating the replica’s clients to the next non-faulty replica server. The Mead Recovery Manager has some similarities with the Replication Manager of CORBA-FT, as it also must launch new replicas in the presence of failures (node or server). In MEAD, the recovery manager does not follow a centralized architecture, as in TAO or FLARe, where all the components of the FT infrastructure are connected to the replication manager, instead, they are connected by a reliable total ordered group communication framework that establishes an implicit agreement at each communica- tion round. These frameworks also provide a notion of view, i.e. an instantaneous 54