Research Article Open Access

Recovery in Distributed Systems from Transient and Permanent Faults

M. Aliouat1 and Z. Aliouat1
  • 1 ,
Journal of Computer Science
Volume 3 No. 8, 2007, 617-623

DOI: https://doi.org/10.3844/jcssp.2007.617.623

Submitted On: 22 May 2007 Published On: 31 August 2007

How to Cite: Aliouat, M. & Aliouat, Z. (2007). Recovery in Distributed Systems from Transient and Permanent Faults. Journal of Computer Science, 3(8), 617-623. https://doi.org/10.3844/jcssp.2007.617.623

Abstract

The recovery mechanism from transient fault in distributed systems has been intensively studied in the past, but to our best knowledge, none of these studies has been devoted to cope together with transient and permanent hard faults. Our study devoted to recovery processes in a distributed environment in case of hard faults like transient or permanent. The recovery mechanism we presented can be based on one of the six proposed strategies involving checkpointing and message logging between distributed application processes. This exhaustive number is system-dependant. The strategies have been examined with respect to propagation recovery through processes in order to prevent the fastidious well known domino effect problem. The considered framework was a distributed system composed of a set of autonomous nodes running each one a local system; and some of them were predisposed to replace failing ones in case of permanent fault. Our main contribution was to enable a distributed application to meet its requirements of terminating its mission in spite of node crash. Preliminary experimental results of a fault tolerant mechanism based upon one of the proposed strategies demonstrated that our proposals seem to be conclusive.

  • 1,035 Views
  • 1,610 Downloads
  • 1 Citations

Download

Keywords

  • recovery
  • distributed systems
  • transient permanent fault tolerance