Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms

International Journal of Computer Engineering (IJCET), ISSN 0976 – 6367(Print),
International Journal of Computer Engineering and Technology
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
and Technology (IJCET), ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online) Volume 1
IJCET
Number 1, May - June (2010), pp. 46-56 ©IAEME
© IAEME, http://www.iaeme.com/ijcet.html

A COMPARATIVE ANALYSIS OF MINIMUM-PROCESS
COORDINATED CHECKPOINTING ALGORITHMS FOR
MOBILE DISTRIBUTED SYSTEMS
Preeti gupta
Singhania University
Pacheri Bari, Rajasthan

Parveen kumar
Meerut Institute of Engineering & Technology
Meerut, Email:pk223475@yahoo.com

Anil kumar solanki
Meerut Institute of Engineering & Technology
Meerut, India
ABSTRACT
Fault Tolerance Techniques enable systems to perform tasks in the presence of
faults. A checkpoint is a local state of a process saved on stable storage. While dealing
with Mobile Distributed systems, we come across some issues like: mobility, low
bandwidth of wireless channels and lack of stable storage on mobile nodes,
disconnections, limited battery power and high failure rate of mobile nodes. These
issues make traditional check pointing techniques designed for Distributed systems
unsuitable for Mobile environments. To take a checkpoint, an MH has to transfer a large
amount of checkpoint data to its local MSS over the wireless network. Since the wireless
network has low bandwidth and MHs have low computation power, all-process check
pointing will waste the scarce resources of the mobile system on every checkpoint.
Minimum-process coordinated check pointing is a preferred approach for mobile
distributed systems. In this paper, we discuss various existing minimum-process check
pointing protocols for mobile distributed systems.
Keywords: Check pointing algorithms; parallel & distributed computing; rollback
recovery; fault-tolerant systems

46

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),

1. INTRODUCTION
Parallel computing with clusters of workstations is being used extensively as they
are cost-effective and scalable, and are able to meet the demands of high performance
computing. Increase in the number of components in such systems increases the failure
probability. It is, thus, necessary to examine both hardware and software solutions to
ensure fault tolerance of such parallel computers. To provide fault tolerance, it is
essential to understand the nature of the faults that occur in these systems. There are
mainly two kinds of faults: permanent and transient. Permanent faults are caused by
permanent damage to one or more components and transient faults are caused by changes
in environmental conditions. Permanent faults can be rectified by repair or replacement
of components. Transient faults remain for a short duration of time and are difficult to
detect and deal with. Hence it is necessary to provide fault tolerance particularly for
transient failures in parallel computers. Fault-tolerant techniques enable a system to
perform tasks in the presence of faults. It is easier and more cost effective to provide
software fault tolerance solutions than hardware solutions to cope with transient failures
[1, 2].
Local checkpoint is the saved state of a process at a processor at a given instance.
Global checkpoint is a collection of local checkpoints, one from each process. A global
state is said to be “consistent” if it contains no orphan message; i.e., a message whose
receive event is recorded, but its send event is lost. A transit message is a message whose
send event has been recorded by the sending process but whose receive event has not
been recorded by the receiving process [1, 7].
The problem of taking a checkpoint in a message passing distributed system is
quite complex because any arbitrary set of checkpoints cannot be used for recovery [9].
This is due to the fact that the set of checkpoints used for recovery must form a consistent
global state.
Checkpointing is classified into following categories:
• Asynchronous/Uncoordinated Checkpointing
• Synchronous/Coordinated Checkpointing
• Quasi-Synchronous or Communication-induced Checkpointing

47


• Message Logging based Checkpointing
The problem of taking a checkpoint in a message passing distributed system is
quite complex because any arbitrary set of checkpoints cannot be used for recovery.
This is due to the fact that the set of checkpoints used for recovery must form a consistent
global state.
1.1 Asynchronous/ Checkpointing
Under the asynchronous approach, checkpoints at each process are taken
independently without any synchronization among the processes. Because of absence of
synchronization, there is no guarantee that a set of local checkpoints taken will be a
consistent set of checkpoints. Thus, a recovery algorithm has to search for the most recent
consistent set of checkpoints before it can initiate recovery [8], [9].
1.2 Synchronous / Coordinated Checkpointing
In coordinated or synchronous checkpointing, processes coordinate their local
checkpointing actions such that the set of all recent checkpoints in the system is
guaranteed to be consistent [add reference list……]. In case of a fault, every process
restarts from its most recent permanent/committed checkpoint. Hence, this approach
simplifies recovery and it does not suffer from domino-effect. Furthermore, coordinated
checkpointing requires each process to maintain only one permanent checkpoint on stable
storage, reducing storage overhead and eliminating the need for garbage collection. Its
main disadvantage is the large latency involved in output commits [15].
A straightforward approach to coordinate checkpointing is to block
communications while the checkpointing process executes. A coordinator takes a
checkpoint and broadcasts a request message to all processes, asking them to take a
checkpoint. When a process receives a message, it stops its execution, flushes all the
communication channels, takes a tentative checkpoint, and sends an acknowledgement
message back to the coordinator. After the coordinator receives acknowledgement from
all processes, it broadcasts a commit message that completes the two phase checkpointing
protocol. After receiving the commit message, each process receives the old permanent
checkpoint and makes the tentative checkpoint permanent. The process is then free to
resume execution and exchange messages with other processes. The coordinated

48


checkpointing algorithms can also be classified into following two categories: minimum-
process and all process algorithms. In all-process coordinated checkpointing algorithms,
every process is required to take its checkpoint in an initiation [7], [8]. In minimum-
process algorithms, minimum interacting processes are required to take their checkpoints
in an initiation.
The coordinated checkpointing protocols can be classified into two types:
blocking and non-blocking. In blocking algorithms, as mentioned above, some blocking
of processes takes place during checkpointing [4]. In non-blocking algorithms, no
blocking of processes is required for checkpointing [5].
In a centralized algorithm like Chandy-lamport [7], there is one node which
always initiates the checkpoints and coordinates the participating nodes. The
disadvantage of a centralized algorithm is that all nodes have to initiate checkpoints
whenever the centralized node decides to checkpoint. Nodes can be given autonomy in
initiating checkpoints by allowing any node in the system to initiate checkpoints.
1. 3 Mobile Distributed Computing System
A mobile computing system consists of a large number of MHs and relatively
fewer MSSs. The distributed computation we consider consists of n spatially separated
sequential processes denoted by P0, P1, ..., Pn-1, running on fail-stop MHs or on MSSs.
Each MH or MSS has one process running on it. The static network provides reliable,
sequenced delivery of messages between any two MSSs, with arbitrary message latency.
The wireless network within a cell ensures FIFO delivery of messages between an MSS
and a local MH, i.e., there exists a FIFO channel from an MH to its local MSS, and
another FIFO channel from the MSS to the MH. If an MH did not leave the cell, then
every message sent to it from the local MSS would be received in the sequence in which
they are sent [2].
To send a message from an MH h1 to another MH h2, h1 first sends the message to
its local MSS over the wireless network. This MSS then forwards the message to the
local MSS of h2 which forwards it to h2 over its local wireless network. Since the location
of an MH within the network is neither fixed nor universally known to the network,
because, an MH switches cells frequently. The local MSS of h1 needs to first determine

49


the MSS that currently serves h2. This is essentially the problem that has been tackled
through a variety of routing protocols at the network layer. Hence, the cost incurred for
routing and delivering a message to an MH, varies with the specific routing protocol
being used. Here, we have assumed that any message destined for an MH incurs a fixed
search cost [2].
1.4 Checkpointing Issues in Mobile Distributed Computing Systems
The existence of mobile nodes in a distributed system introduces new issues that
need proper handling while designing a checkpointing algorithm for such systems. These
issues are mobility, disconnections, finite power source, vulnerable to physical damage,
lack of stable storage etc. [2]. The location of an MH within the network, as represented
by its current local MSS, changes with time. Checkpointing schemes that send control
messages to MH’s, will need to first locate the MH within the network, and thereby incur
a search overhead [2]. Due to vulnerability of mobile computers to catastrophic failures,
disk storage of an MH is not acceptably stable for storing message logs or local
checkpoints. Checkpointing schemes must therefore, rely on an alternative stable
repository for an MH’s local checkpoint [2]. Disconnections of one or more MHs should
not prevent recording the global state of an application executing on MHs. It should be
noted that disconnection of an MH is a voluntary operation, and frequent disconnections
of MHs is an expected feature of the mobile computing environments [2]. The battery at
the MH has limited life. To save energy, the MH can power down individual components
during periods of low activity [2]. This strategy is referred to as the doze mode operation.
An MH in doze mode is awakened on receiving a message. Therefore, energy
conservation and low bandwidth constraints require the checkpointing algorithms to
minimize the number of synchronization messages and the number of checkpoints.
2 SOME MINIMUM-PROCESS COORDINATED CHECKPOINTING
PROTOCOLS FOR MOBILE DISTRIBUTED SYSTEMS
2.1 GUOHONG CAO AND MUKESH SINGHAL ALGORITHM [5]
Cao and Singhal achieved non-intrusiveness in the minimum-process algorithm
by introducing the concept of mutable checkpoints. In their algorithm, initiator, say Pin,
sends the checkpoint request to any process, say Pj, only if Pin receives m from Pj in the

50


current CI. Pj takes its tentative checkpoint if Pj has sent m to Pin in the current CI;
otherwise, Pj concludes that the checkpoint request is a useless one. Similarly, when Pj
takes its tentative checkpoint, it propagates the checkpoint request to other processes.
This process is continued till the checkpoint request reaches all the processes on which
the initiator transitively depends and a checkpointing tree is formed. During
checkpointing, if Pi receives m from Pj such that Pj has taken some checkpoint in the
current initiation before sending m, Pi may be forced to take a checkpoint, called mutable
checkpoint. If Pi is not in the minimum set, its mutable checkpoint is useless and is
discarded on commit. The huge data structure MR[] is also attached with the checkpoint
requests to reduce the number of useless checkpoint requests. The response from each
process is sent directly to initiator.
2.2 KUMAR AND KUMAR ALGORITHM [14]
They proposed an algorithm which is based on keeping track of direct
dependencies of processes. Initiator MSS collects the direct dependency vectors of all
processes, computes the tentative minimum set (minimum set or its subset), and sends the
checkpoint request along with the tentative minimum set to all MSSs. This step is taken
to reduce the time to collect the coordinated checkpoint. It will also reduce the number of
useless checkpoints and the blocking of the processes. Suppose, during the execution of
the checkpointing algorithm, Pi takes its checkpoint and sends m to Pj. Pj receives m
such that it has not taken its checkpoint for the current initiation and it does not know
whether it will get the checkpoint request. If Pj takes its checkpoint after processing m, m
will become orphan. In order to avoid such orphan messages, they propose the following
technique. If Pj has sent at least one message to a process, say Pk and Pk is in the
tentative minimum set, there is a good probability that Pj will get the checkpoint request.
Therefore, Pj takes its induced checkpoint before processing m. An induced checkpoint is
similar to the mutable checkpoint [18]. In this case, most probably, Pj will get the
checkpoint request and its induced checkpoint will be converted into permanent one.
There is a less probability that Pj will not get the checkpoint request and its induced
checkpoint will be discarded. Alternatively, if there is not a good probability that Pj will
get the checkpoint request, Pj buffers m till it takes its checkpoint or receives the commit
message. They have tried to minimise the number of useless checkpoints and blocking of

51


the process by using the probabilistic approach and buffering selective messages at the
receiver end. Exact dependencies among processes are maintained. It abolishes the
useless checkpoint requests and reduces the number of duplicate checkpoint requests.
2.3 SILVA AND SILVA ALGORITHM [15]
The authors proposed all process coordinated checkpointing protocol for
distributed systems. The non-intrusiveness during checkpointing is achieved by
piggybacking monotonically increasing checkpoint number along with computational
message. When a process receives a computational message with the high checkpoint
number, it takes its checkpoint before processing the message. When it actually gets the
checkpoint request from the initiator, it ignores the same. If each process of the
distributed program is allowed to initiate the checkpoint operation, the network may be
flooded with control messages and process might waste their time making unnecessary
checkpoints. In order to avoid this, Silva and Silva give the key to initiate checkpoint
algorithm to one process. The checkpoint event is triggered periodically by a local timer
mechanism. When this timer expires, the initiator process the checkpoint state of process
running in the machine and force all the others to take checkpoint by sending a broadcast
message. The interval between adjacent checkpoints is called checkpoint interval.
2.4 KOO-TOUEG’S ALGORITHM [10]
They proposed a minimum process blocking checkpointing algorithm for
distributed systems. The algorithm consists of two phases. During the first phase, the
checkpoint initiator identifies all processes with which it has communicated since the last
checkpoint and sends them a request. Upon receiving the request, each process in turn
identifies all process it has communicated with since the last checkpoint and sends them a
request, and so on, until no more processes can be identified. During the second phase, all
process identified in the first phase take a checkpoint. The result is a consistent
checkpoint that involves only the participating processes. In this protocol, after a process
takes a checkpoint, it cannot send any message until the second phase terminates
successfully, although receiving messages after the checkpoint is permissible.
2.5 COA AND SINGHAL BLOCKING ALGORITHM [4]
They presented a minimum process checkpointing algorithm in which, the
dependency information is recorded by a Boolean vector. This algorithm is a two-phase

52


protocol and saves two kinds of checkpoint on the stable storage. In the first phase, the
initiator sends a request to all processes to send their dependency vector. On receiving the
request, each process sends its dependency vector. Having received all the dependency
vectors, the initiator constructs an N*N dependency matrix with one row per process,
represented by the dependency vector of the process, Based on the dependency matrix,
the initiator can locally calculate all the process on which the initiator transitively depend.
After the initiator finds all the process that need to take their checkpoints, it adds them to
the set Sforced and asks them to take checkpoints. Any process receiving a checkpoint
request takes the checkpoint and sends a reply. The process has to be blocked after
receiving the dependency vectors request and resumes its computation after receiving a
checkpoint request.
2.6 PRAKASH-SINGHAL ALGORITHM [16]
It forces only a minimum number of processes to take checkpoints and does not
block the underlying computation during checkpointing. Prakash Singhal algorithm
forces only part of processes to take checkpoints, the csn of some processes may be out-
of-date, and may not be able to avoid inconsistencies [8]. Prakash-Singhal algorithm
attempts to solve this problem by having each process maintains an array to save the csn,
where csni[i] has been the expected csn of Pi. Note that Pi’s csnj[i] may be different from
Pj's csnj [i] if there is no communication between Pi and Pj for several checkpoint
intervals. By using csn and the initiator identification number, they claim that their non-
blocking algorithm can avoid inconsistencies and minimize the number of checkpoints
during checkpointing.
2.7. P. KUMAR’S HYBRID ALGORITHM [13]
In minimum-process coordinated checkpointing, some processes may not
checkpoint for several checkpoint initiations. In the case of a recovery after a fault, such
processes may rollback to far earlier checkpointed state and thus may cause greater loss
of computation. In all-process coordinated checkpointing, the recovery line is advanced
for all processes but the checkpointing overhead may be exceedingly high. To optimize
both matrices, the checkpointing overhead and the loss of computation on recovery,
P.Kumar proposed a hybrid checkpointing algorithm, wherein an all-process coordinated
checkpoint is taken after the execution of minimum-process coordinated checkpointing

53


algorithm for a fixed number of times. Thus, the Mobile nodes with low activity or in
doze mode operation may not be disturbed in the case of minimum-process checkpointing
and the recovery line is advanced for each process after an all-process checkpoint.
Additionally, he tried to minimize the information piggybacked onto each computation
message. For minimum-process checkpointing, he designed a blocking algorithm, where
no useless checkpoints are taken and an effort has been made to optimize the blocking of
processes. He proposed to delay selective messages at the receiver end. By doing so,
processes are allowed to perform their normal computation, send messages and partially
receive them during their blocking period. The proposed minimum-process blocking
algorithm forces zero useless checkpoints at the cost of very small blocking.
3. CONCLUSION
The existence of mobile nodes in a distributed system introduces new issues that
need proper handling while designing a checkpointing algorithm for such systems. These
issues are mobility, disconnections, finite power source, vulnerable to physical damage,
lack of stable storage, etc. The new issues make traditional checkpointing techniques
unsuitable to checkpoint mobile distributed systems. A good checkpointing protocol for
mobile distributed systems should have low memory overheads on MHs, low overheads
on wireless channels and should avoid awakening of an MH in doze mode operation. The
disconnection of an MH should not lead to infinite wait state. The algorithm should be
non-intrusive and should force minimum number of processes to take their local
checkpoints.
Minimum-process coordinated checkpointing is a suitable approach to introduce
fault tolerance in mobile distributed systems transparently. This approach is domino-free,
requires at most two checkpoints of a process on stable storage, and forces only a
minimum number of processes to checkpoint. It may require blocking of processes, extra
synchronization messages, piggybacking of some information along with computation
messages, or taking some useless checkpoints.

54


In this paper we have given introductory concepts related to checkpointing in
mobile distributed systems. We also gave an analysis of various minimum-process
checkpointing schemes especially designed for mobile distributed systems.
REFERENCES
[1] Singhal, M, N G Shivratri, “Advanced concepts in Operating Systems, Tata Mc Graw
Hill, 1994.
[2] Acharya A., “Structuring Distributed Algorithms and Services for networks with
Mobile Hosts”, Ph.D. Thesis, Rutgers University, 1995.
[3] Cao G. and Singhal M., “On coordinated checkpointing in Distributed Systems”,
IEEE Transactions on Parallel and Distributed Systems, vol. 9, no.12, pp. 1213-1225,
Dec 1998.
[4] Cao G. and Singhal M., “On the Impossibility of Min-process Non-blocking
Checkpointing and an Efficient Checkpointing Algorithm for Mobile Computing
Systems,” Proceedings of International Conference on Parallel Processing, pp. 37-
44, August 1998.
[5] Cao G. and Singhal M., “Mutable Checkpoints: A New Checkpointing Approach for
Mobile Computing systems,” IEEE Transaction On Parallel and Distributed Systems,
vol. 12, no. 2, pp. 157-172, February 2001.
[6] Cao G. and Singhal M., “Checkpointing with Mutable Checkpoints”, Theoretical
Computer Science, 290(2003), pp. 1127-1148.
[7] Chandy K. M. and Lamport L., “Distributed Snapshots: Determining Global State of
Distributed Systems,” ACM Transaction on Computing Systems, vol. 3, No. 1, pp. 63-
75, February 1985.
[8] Elnozahy E.N., Alvisi L., Wang Y.M. and Johnson D.B., “A Survey of Rollback-
Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34,
no. 3, pp. 375-408, 2002.
[9] Kalaiselvi, S., Rajaraman, V., “A Survey of Checkpointing Algorithms for Parallel
and Distributed Systems”, Sadhna, Vol. 25, Part 5, October 2000, pp. 489-510.

55


[10] Koo R. and Toueg S., “Checkpointing and Roll-Back Recovery for Distributed
Systems,” IEEE Trans. on Software Engineering, vol. 13, no. 1, pp. 23-31, January
1987.
[11] Parveen Kumar, Lalit Kumar, R K Chauhan, V K Gupta “A Non-Intrusive Minimum
Process Synchronous Checkpointing Protocol for Mobile Distributed Systems”
Proceedings of IEEE ICPWC-2005, January 2005.
[12] Pushpendra Singh, Gilbert Cabillic, “A Checkpointing Algorithm for Mobile
Computing Environment”, LNCS, No. 2775, pp 65-74, 2003.
[13] Parveen Kumar, “A Low-Cost Hybrid Coordinated Checkpointing Protocol for
Mobile Distributed Systems”, Mobile Information Systems [An International Journal
from IOS Press, Netherlands] pp 13-32, Vol. 4, No. 1, 2007.
[14] Lalit Kumar, Parveen Kumar “A Synchronous Checkpointing Protocol for Mobile
Distributed Systems: A Probabilistic Approach”, International Journal of
Information and Computer Security [An International Journal from Inderscience
Publishers, USA], pp 298-314, Vol. 3 No. 1, 2007.
[15] Silva L, Silva J 1992 Global checkpointing for distributed programs. Proc. IEEE
11th Symp. On Reliable Distributed Syst. pp 155-162.
[16] R. Prakash and M. Singhal. “Low-Cost Checkpointing and Failure Recovery in
Mobile Computing Systems”. IEEE Trans. on Parallel and Distributed System,
pages 1035-1048,Oct. 1996.

56

Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (8)

Similar to Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms

Similar to Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms (20)

More from iaemedu

More from iaemedu (20)

Comparative Analysis of Minimum-Process Coordinated Checkpointing Algorithms