Determining the Last Membership of a Process Group after a Total Failure (1997)

Author(s): Black D, Ezhilchelvan PD, Shrivastava SK

    Abstract: There is a large category of distributed systems that use component (e.g., process, object) replication for availability. A large part of the effort involved in crafting these systems lies in maintaining the cardinality of the set of replicas. For example in primary-secondary replication, in the event that one component crashes, it is necessary to create a replacement on some operational machine and hence maintain the cardinality of the set of components to at least two. In systems where failed components are recreated on other machines, the internal composition of the set of a component group (referred to as a unit) may be seen to `walk? over a number of machines during normal system operation. We are interested in the problem of recovery after a total failure of a unit ( a disaster ); that is, recovery after all or large number of unit members have failed or partitioned such that the unit can no longer function normally. Disaster recovery requires that once sufficient members belonging to the unit have restarted or got reconnected, the unit should resume functioning without further delay. A particular requirement is that only the components belonging to the last unit configuration be part of the post-disaster unit configuration. This paper presents an algorithm which a component can execute to determine whether it belonged to the last unit configuration. The algorithm has been developed in the context of an asynchronous distributed system where message delays are unknown and therefore a slow component can appear as crashed or disconnected.

      • Date: 1997
      • Series Title: Department of Computing Science Technical Report Series
      • Pages: 19
      • Institution: Department of Computing Science, University of Newcastle upon Tyne
      • Publication type: Report
      • Bibliographic status: Published

      Keywords: distributed systems, failure detection, group membership, replication, total failure

      Staff

      Dr Paul Ezhilchelvan
      Reader in Distributed Computing

      Emeritus Professor Santosh Shrivastava
      Senior Research Investigator