EPSRC Centre for Doctoral Training Cloud Computing for Big Data


Chris Johnson

Distributed stream processing systems (DSPS) are a class of applications that process large amounts of data in real time, with the processing distributed over a cluster of machines. Storm, Flink, Heron and Spark Streaming are some examples of a DSPS.

DSPSs are an important tool in data analytics, and their usage will only increase with increases in data being collected and desire for faster processing.

Knowing a priori the exact number of machines that are needed for each stage of the processing pipeline can be a challenge. Current practice is to reactively respond to bottlenecks or over provisioning, resulting in degradation in service or wasted resources in the form of under utilised machines. Even if a topology meets throughput requirements, some use cases (such as credit card fraud detection) have strict latency requirements that must be met.

My research aims to create a model that will predict the ideal size of a cluster to run a given topology, given throughput and latency requirements. This would allow data engineers to quickly tune their topologies to what is required and sufficient, bypassing the tedious trial and error process, or pre-emptively scale a topology to meet future demands when paired with a forecasting model.

My research builds on earlier work carried out within the CDT that resulted in a 'proof of concept' model for Apache Storm using queuing theory. My research aims to extend this, generalising the model to all streaming systems, and investigating how the model may need to be modified under more complicated circumstances such as co-location of executors with other applications, and operators such as windowing and joins.


Paul Ezhilchelvan, Paul Watson, Isi Mitrani