Study Abroad and Exchanges



CSC3622 : Reliability and Fault Tolerance

Semester 2 Credit Value: 10
ECTS Credits: 5.0


Overview of the concepts of reliability, and a systems approach to the design,
evaluation, and implementation of fault tolerance in computer systems,
exemplified by case studies of present-day systems.

The module aims to provide an overview of the concepts of reliability and a systems approach to the design, evaluation and implementation of fault tolerance in computer systems exemplified by case studies of present-day systems. Topics covered in the syllabus include: Need for reliability, system dependability concepts and terminology; fault tolerance principles; error detection and recovery; software and hardware fault tolerance; case studies from Mars and Delta-4.

Outline Of Syllabus

Need for reliability: Faults as the sources of unreliability; anticipated and unanticipated faults; fault prevention and fault tolerance approaches to achieving reliability.
System dependability concepts and terminology: failures, error, design and component faults.
Fault tolerance: principles, error detection, damage assessment, error recovery, fault treatment; redundancy; TMR systems; programming with exception and exception handlers.
Error detection: Ideal measures for error detection; replication checks; timing checks; coding checks.
Error recovery: Forward and backward error recovery; their advantages and limitations; implementation issues in backward error recovery; co-operating processes and recovery lines.
Software fault tolerance: N-version programming, recovery blocks.
Hardware fault tolerance: fault calssification and replication strategies; need for agreement among replicas; evaluation of redundancy requirements.
Case studies Mars, Delta-4

Teaching Methods

Teaching Activities
Category Activity Number Length Student Hours Comment
Scheduled Learning And Teaching ActivitiesLecture221:0022:00N/A
Scheduled Learning And Teaching ActivitiesLecture240:3012:00Revision for end of Semester exam & exam duration.
Guided Independent StudyAssessment preparation and completion221:0022:00Lecture follow-up
Scheduled Learning And Teaching ActivitiesPractical111:0011:00N/A
Guided Independent StudyProject work111:0011:00Coursework
Guided Independent StudyIndependent study221:0022:00Background reading
Teaching Rationale And Relationship

Techniques and theory are presented in lectures. Supervised practicals on a PC cluster room provide experience of writing programming and using PCs with help available. Further practical work takes place during the private study hours.

Assessment Methods

The format of resits will be determined by the Board of Examiners

Description Length Semester When Set Percentage Comment
Written Examination902A80N/A
Other Assessment
Description Semester When Set Percentage Comment
Report2M20Research report on a given topic. 1,000 words max.
Assessment Rationale And Relationship

- The mandatory question requires the students to demonstrate their
understanding of the theories and approaches covered in the module (by solving specific problems), and it also assesses the students' ability to recognise patterns and relationships between various components of the module.
- The two questions in Section B tend to be more in-depth on
specific techniques and they cover recalling information, summarising facts, comparing approaches, and solving a particular problem.

The coursework requires students to carry out an independent research on one of the two suggested topics relevant to reliability and fault tolerance. This is important because this is one of the many skills that these students will need to have in their career.

N.B. This module has both “Exam Assessment” and “Other Assessment” (e.g. coursework). If the total mark for either assessment falls below 35%, the maximum mark returned for the module will normally be 35%.

Reading Lists