CSC3622 : Reliability and Fault Tolerance
- Offered for Year: 2018/19
- Module Leader(s): Dr Neil Speirs
- Lecturer: Professor Tom Anderson, Professor Alexander Romanovsky
- Other Staff: Mr Matthew Collison
- Owning School: Computing
- Teaching Location: Newcastle City Campus
|Semester 2 Credit Value:||10|
Overview of the concepts of reliability, and a systems approach to the design,
evaluation, and implementation of fault tolerance in computer systems,
exemplified by case studies of present-day systems.
The module aims to provide an overview of the concepts of reliability and a systems approach to the design, evaluation and implementation of fault tolerance in computer systems exemplified by case studies of present-day systems. Topics covered in the syllabus include: Need for reliability, system dependability concepts and terminology; fault tolerance principles; error detection and recovery; software and hardware fault tolerance; case studies from Mars and Delta-4.
Outline Of Syllabus
Need for reliability: Faults as the sources of unreliability; anticipated and unanticipated faults; fault prevention and fault tolerance approaches to achieving reliability.
System dependability concepts and terminology: failures, error, design and component faults.
Fault tolerance: principles, error detection, damage assessment, error recovery, fault treatment; redundancy; TMR systems; programming with exception and exception handlers.
Error detection: Ideal measures for error detection; replication checks; timing checks; coding checks.
Error recovery: Forward and backward error recovery; their advantages and limitations; implementation issues in backward error recovery; co-operating processes and recovery lines.
Software fault tolerance: N-version programming, recovery blocks.
Hardware fault tolerance: fault calssification and replication strategies; need for agreement among replicas; evaluation of redundancy requirements.
Case studies Mars, Delta-4
|Scheduled Learning And Teaching Activities||Lecture||24||0:30||12:00||Revision for end of Semester exam & exam duration.|
|Guided Independent Study||Assessment preparation and completion||22||1:00||22:00||Lecture follow-up|
|Scheduled Learning And Teaching Activities||Lecture||22||1:00||22:00||N/A|
|Scheduled Learning And Teaching Activities||Practical||11||1:00||11:00||N/A|
|Guided Independent Study||Project work||11||1:00||11:00||Coursework|
|Guided Independent Study||Independent study||22||1:00||22:00||Background reading|
Teaching Rationale And Relationship
Techniques and theory are presented in lectures. Supervised practicals on a PC cluster room provide experience of writing programming and using PCs with help available. Further practical work takes place during the private study hours.
The format of resits will be determined by the Board of Examiners
|Report||2||M||20||Research report on a given topic. 1,000 words max.|
Assessment Rationale And Relationship
- The mandatory question requires the students to demonstrate their
understanding of the theories and approaches covered in the module (by solving specific problems), and it also assesses the students' ability to recognise patterns and relationships between various components of the module.
- The two questions in Section B tend to be more in-depth on
specific techniques and they cover recalling information, summarising facts, comparing approaches, and solving a particular problem.
The coursework requires students to carry out an independent research on one of the two suggested topics relevant to reliability and fault tolerance. This is important because this is one of the many skills that these students will need to have in their career.
N.B. This module has both “Exam Assessment” and “Other Assessment” (e.g. coursework). If the total mark for either assessment falls below 35%, the maximum mark returned for the module will normally be 35%.