# Possible MPhil/PhD Projects in Statistics

Suggested projects for postgraduate research in statistics.

## Areas of expertise

Below is an indicative list of research projects which our staff would be interested in supervising. The list is not exhaustive, and we encourage interested applicants to also check the webpages of individual academics. If you are interested in any of these projects, or would like to discuss a project with someone, please feel free to approach the prospective supervisor by email. Please also check the details of our research degrees. We welcome applications online.

For further information, please contact the PG tutor/selector for statistics: Dr Jere Koskela

Heterogeneity among patients commonly exists in clinical studies, presenting challenges in medical research. It is widely acknowledged that the population comprises diverse subtypes, each distinct from the other. Precision medicine research aims at identifying the sub-types and thus tailoring disease prevention and treatment. The primary research challenge is to identify the sub-population types and determining the prediction variables for each sub-type. This PhD research project will explore how to integrate classification methods and variable selection techniques to effectively address these complexities.

Supervisor: Prof Hongsheng Dai

Phylogenetic or evolutionary trees are inferred from genetic sequence data and are a vital ingredient of many applications in molecular biology. During the inference process samples of trees are often obtained, but characterising and analyzing such samples is challenging because the space of all trees with a fixed leaf-set is not a Euclidean vector space. While various notions of phylogenetic tree space exist, there is often a lack of probabilistic machinery in these space. The project supervisor has developed a family of distributions in the well-known Billera-Holmes-Vogtmann (BHV) tree space which are akin to Guassian distributions, via kernels of Brownian motion, together with methods for fitting these distributions to samples of trees. The aim of this project is to further exploit this idea by developing a wider range of statistical models in BHV tree space with Gaussian kernels as the noise model, or by developing analogous methods in wald space. Wald space is an alternative to BHV tree space with a more complex geometry and for which statistical models are in their infancy. This project will involve development of the theory of stochastic processes and computational methods for performing inference in non-Euclidean geometries.

Supervisor: Tom Nye

Data sets in which each data point is a different edge-weighted network arise in a variety of different contexts such as neuroscience, molecular biology and computer science. Analysis of data sets of networks or graphs requires the development of new statistical methods, because such graphs do not live in a linear space.

Information geometry provides theory and methods for doing geometry on spaces of probability distributions. The project supervisor has recently used these tools to develop a new space of evolutionary trees, by mapping trees to certain probability distributions, and studying the geometry on trees induced by this embedding. The idea behind this project is to extend these ideas from collectins of edge-weighted evolutionary trees to collections of certain edge-weighted graphs. The project will involve stochastic processes on graphs, computational geometry and development of statistical methods in the novel setting of graph space.

Supervisor: Tom Nye

The standard way to choose a sample size a clinical trial is via a power calculation – we choose the smallest sample size that gives us 80% or 90% power to detect a certain treatment effect, based on a hypothesis test on the treatment effect at the end of the trial. This requires various model parameters to be known a priori, particularly for more complex trial designs such as cluster randomised trials. If instead we use prior distributions on the model parameters, then we can integrate over our uncertainty on them and calculate the Bayesian power, known as assurance, for the sample size calculation. In this case it seems sensible to also conduct a Bayesian analysis following the trial. Such a fully Bayesian design and analysis has been proposed for two arm cluster trials with a continuous outcome. This project would extend the approach to more complex trial designs such as crossover trials, stepped wedge trials and adaptive trials.

Supervisor: Kevin Wilson

In a reliability demonstration test the producer of a hardware product demonstrates to a consumer that the product meets a certain level of reliability. As most hardware products have very high reliability, such tests can be prohibitively expensive, requiring large sample sizes and long testing periods. Accelerated testing can reduce the testing time, but introduces the additional complication of having to infer the relationship between failure times of the stressor variable at accelerated and normal operator conditions.

Previous attempts to plan and analyse reliability demonstration tests have utilised power calculations and hypothesis tests or risk criteria. More recently, Wilson & Farrow (2019) proposed the use of assurance to design reliability demonstration tests and suitable Bayesian analyses of the test data. Assurance provides to unconditional probability that the reliability demonstration test will be passed. Work to date has focussed on Binomial and Weibull observations. This project would extend the use of assurance to design reliability demonstration tests, considering a wider class of failure time distributions and implementing an augmented MCMC scheme to evaluate the assurance more efficiently.

Supervisor: Kevin Wilson

Approximately 70% of clinical decisions are influenced by the use of in vitro diagnostics (IVDs), and conception to adoption of new diagnostics takes approximately 10 years. The ability to speed up the development of new diagnostics, including IVDs, is critical for the long-term sustainability of the UK diagnostics sector and efficiency of the NHS. This project would aim to reduce the time to market of diagnostics by developing novel Bayesian methods for the design and analysis of diagnostic studies, with a particular focus on post-marketing utility studies. Through development of seamless and adaptive designs for diagnostic studies, we can make best use of data to save time and money in test development and enable seamless transitions between stages. Adaptive designs will allow efficacy or futility of diagnostics to be determined more quickly. The project will apply the developed methods to assess real diagnostic devices that have been recently approved.

Supervisor: Kevin Wilson

Samples of DNA sequences in phylogenetics are positively correlated due to common ancestry. It is mathematically convenient to model the correlation by building a mathematical model for the ancestral tree of the sample. These trees can usually not be observed directly, and a gold-standard route to inference is to use MCMC to average over possible ancestral trees. However, the space of ancestral trees is a very irregular and challenging state space. As a result, existing MCMC proposal kernels are known to result in very slow mixing, leading to infeasibly long runtimes.

Recent work has successfully characterised at least some of the irregular features of the MCMC state space for simple models of DNA sequence evolution. The starting point for this project is to study how those characterisations can be used in the design of bespoke (and faster mixing) MCMC proposal kernels, and whether any gains in computational efficiency carry over to more realistic models of DNA sequence evolution.The project will involve a mix of code prototyping as well as probabilistic analysis, and the emphasis between them can vary based on applicant interests.

Supervisor: Jere Koskela

Ancestral recombination graphs (ARGs) are data structures describing common ancestry of chromosomes sampled from a population. Embedded into an ARG is a sequence of correlated trees, reflecting the fact that common ancestry of a sample at a single site of DNA is a tree, but that the biological process of recombination causes the tree to change across the genome. Inferring the underlying ARG from a sample of DNA sequences, or averaging across ARGs which are consistent with a sample, are two prototypical tasks in statistical population genetics.

Many sophisticated methods for these tasks are available when the DNA sequences are short enough that their ancestry can be modelled as a single tree. Often, these methods exploit e.g. the geometry of the space of trees, as well as modern MCMC sampling methods. Methods for ARGs are not similarly well-developed, chiefly because ARGs are much larger objects than single trees, and no tractable geometry of “ARG-space” is available.

This project is about finding ways to leverage modern statistical computing approaches for ARGs. Possible directions include design of gradient-based MCMC proposals, construction of suitable state spaces to facilitate efficient exploration of ARGs, and the derivation of approximate models which capture the key features of ARGs while improving computational tractability. The project will involve a mix of code prototyping as well as probabilistic analysis, and the emphasis between them can vary based on applicant interests.

Supervisor: Jere Koskela

Abstract: Classic models for count data can readily accommodate overdispersion relative to the Poisson model, but models for underdispersed counts - where the mean exceeds the variance - are less well established, and those that have been proposed are often hampered by, for instance: lack of natural interpretation, a restricted parameter space, computational difficulties in implementation or some combination of all three. At the individual level, one can often encounter both over- and underdispersion, or bidispersion, within the same dataset and failure to allow for this bidispersion leads to inferences on parameters that are either conservative or anti-conservative. In this project, models to handle such bidispersed data, typically at an individual level, will be developed and applied to case studies drawn from sport and the social sciences.

Supervisor: Pete Philipson & Daniel Henderson

Probabilistic numerics casts numerical tasks, such the numerical solution of differential equations, as inference problems to be solved. In this project, a convergent sequence of approximations to the quantity of interest constitute a dataset from which the limiting quantity of interest can be extrapolated, in a probabilistic analogue of Richardson's deferred approach to the limit. This approach provides probabilistic uncertainty quantification whilst inheriting the features and performance of state-of-the-art numerical methods. This project aims to develop and extend such methods for challenging numerical tasks, such as solving nonlinear partial differential equations and eigenvalue problems.

Supervisor: Chris Oates

Markov chain Monte Carlo is the engine of modern Bayesian statistics, used to approximate the posterior and derived quantities of interest.

Despite this, the issue of how the output from a Markov chain is post-processed and reported is often overlooked. Convergence diagnostics can be used to control bias via burn-in removal, but these do not account for (common) situations where a limited computational budget engenders a bias-variance trade-off. The aim of this project is to directly address the bias-variance trade-off, developing powerful post-processing techniques based on Stein discrepancy to improve Markov chain Monte Carlo output.

Supervisor: Chris Oates