School of Computing


Infrastructure for Responsive and Available Distributed Systems

Performance and reliability are two key attributes of any computer system. Designing a distributed system that continues to provide specified services in the presence of node and network related failures without appreciable degradation in performance is a challenging task. Services that we are considering are required to be responsive: a response is required to be correct in value as well as time domain. However, we are considering 'soft real-time' applications where occasional failures in the time domain (late responses) are not a disaster. It is also expected that for availability and performance reasons, services may need to be replicated, in which case, the requirement of strict replica consistency under certain conditions can be waived in an application specific manner for gaining both performance and availability. In this project we are considering on a very interesting system design problem: services are required to be highly available, but a degree of freedom in sacrificing the correctness and consistency of a response is permissible to achieve the required performance targets (that would otherwise be not achievable). As we see it, the main system design problem is that there is no systematic way of selecting the 'right box of tricks' for building a service. A bewildering choice of fault-tolerance and consistency approaches confronts a designer: synchronous systems vs asynchronous systems, optimistic vs pessimistic message logging, transactions vs virtual synchrony, replication vs caching, optimistic concurrency control vs pessimistic concurrency control etc. Ideally, given a set of application requirements, we would like to be able to chose the right set of techniques that are somehow 'optimal'. We would also like the ability to maintain this optimality even if operating conditions change significantly from those originally envisaged (eg, there is a sustained surge of update requests to what was once a read mostly service). This suggests a strong requirement for monitoring the operating conditions and the ability to change dynamically one or more consistency/fault-tolerance mechanisms (considering the previous example, change from optimistic to pessimistic concurrency control). In this collaborative project with BT, we are embark on a research programme that addresses these issues within the context of large scale widely distributed systems.