Improving the Reliability of Cooperative Concurrent Systems with Exception Flow Analysis (2008)

Author(s): Castor Filho F, Romanovsky A, Rubira CMF

    Abstract: Developers of fault-tolerant distributed systems must guarantee that the fault tolerance mechanisms they build are, themselves, reliable. Otherwise, these mechanisms might end up contributing negatively to overall system dependability, thus defeating the purpose of introducing fault tolerance into the system. To achieve the desired levels of reliability, the development of mechanisms for detecting and handling errors should be rigorous or formal. We present an approach to modeling and verifying fault-tolerant distributed systems that use exception handling as the main fault tolerance mechanism. The proposed approach is based on a formal model for specifying the structure of a system in terms of cooperating participants that handle exceptions in a coordinated manner. We employ coordinated atomic actions as a representative of mechanisms for exception handling in concurrent systems. We have validated the proposed approach by means of two case studies: (i) a system responsible for managing a production cell; and (ii) a medical control system. For both systems, the proposed approach helped us to uncover design faults in the form of implicit assumptions and omissions in the original specifications.

      • Date: June 2008
      • Series Title: School of Computing Science Technical Report Series
      • Pages: 43
      • Institution: School of Computing Science, University of Newcastle upon Tyne
      • Publication type: Report
      • Bibliographic status: Published

      Keywords: fault tolerance, Alloy, Event B

      Staff

      Professor Alexander Romanovsky
      Prof of Computing Science