CSC3633 : Reliability and Fault Tolerance (Inactive)

Semester 1 Credit Value: 10
ECTS Credits: 5.0


Overview of the concepts of reliability, and a systems approach to the design, evaluation, and implementation of fault tolerance in computer systems, exemplified by case studies of present-day systems.
The module aims to provide an overview of the concepts of reliability and a systems approach to the design, evaluation and implementation of fault tolerance in computer systems exemplified by case studies of present-day systems. Topics covered in the syllabus include: Need for reliability, system dependability concepts and terminology; fault tolerance principles; error detection and recovery; software and hardware fault tolerance; case studies from Mars and Delta-4.

Outline Of Syllabus

Need for reliability: Faults as the sources of unreliability; anticipated and unanticipated faults; fault prevention and fault tolerance approaches to achieving reliability.
System dependability concepts and terminology: failures, error, design and component faults. Fault tolerance: principles, error detection, damage assessment, error recovery, fault treatment; redundancy; TMR systems; programming with exception and exception handlers.
Error detection: Ideal measures for error detection; replication checks; timing checks; coding checks.
Error recovery: Forward and backward error recovery; their advantages and limitations; implementation issues in backward error recovery; co-operating processes and recovery lines.
Software fault tolerance: N-version programming, recovery blocks.
Hardware fault tolerance: fault calssification and replication strategies; need for agreement among replicas; evaluation of redundancy requirements.
Case studies Mars, Delta-4

Teaching Methods

Teaching Activities
Category Activity Number Length Student Hours Comment
Guided Independent StudyAssessment preparation and completion220:3011:00Revision for final exam
Scheduled Learning And Teaching ActivitiesLecture221:0022:00Lectures
Scheduled Learning And Teaching ActivitiesPractical111:0011:00Practicals
Guided Independent StudyProject work111:0011:00Practical coursework
Guided Independent StudyIndependent study231:0023:00background reading
Guided Independent StudyIndependent study221:0022:00Lecture follow-up
Teaching Rationale And Relationship

Techniques and theory are presented in lectures. Supervised practicals on a PC cluster room provide experience of
writing programming and using PCs with help available. Further practical work takes place during the private study

Assessment Methods

The format of resits will be determined by the Board of Examiners

Description Length Semester When Set Percentage Comment
PC Examination901A80N/A
Other Assessment
Description Semester When Set Percentage Comment
Practical/lab report1M20equivalent of 1000 words
Assessment Rationale And Relationship

- The mandatory question requires the students to demonstrate their
understanding of the theories and approaches covered in the module (by solving specific problems), and it also
assesses the students' ability to recognise patterns and relationships between various components of the module.
- The two questions in Section B tend to be more in-depth on
specific techniques and they cover recalling information, summarising facts, comparing approaches, and solving a
particular problem.
The coursework requires students to carry out an independent research on one of the two suggested topics relevant
to reliability and fault tolerance. This is important because this is one of the many skills that these students will need
to have in their career.
N.B. This module has both “Exam Assessment” and “Other Assessment” (e.g. coursework). If the total mark for either
assessment falls below 35%, the maximum mark returned for the module will normally be 35%.

Reading Lists