School of Computing


Data Integration Approaches for Drug Repositioning

Project Description:

Funded by the EPSRC and in collaboration with GlaxoSmithKline (GSK), this project has 5 main objectives:

  1. To extend existing data integration platforms relevant to drug repositioning, particularly for rare diseases
  2. To research and implement appropriate strategies for semantic data integration of network construction including; Ondex, RDF and others
  3. To develop algorithms to search for topological & semantic network structures indicative of repositioning opportunities
  4. To evaluate the utility of motifs for mining semantically rich integrated networks'
  5. To develop novel approaches for describing the properties of relevant semantic network structures

The drug development process is increasing in cost and becoming less productive. Drugs based on novel chemistry take years to reach the market. Costs often surpass $1bn and a large amount of these novel compounds fail in (or before) the clinic (Adams & Brantner, 2006). Drug repurposing approaches focus on finding new uses for existing drugs or drug candidates for which there is already substantial safety information (Ashburn & Thor, 2004). Repurposing has the potential to vastly reduce both time and cost taken for a drug to reach market.

Many of today's repurposed drugs were discovered through serendipitous observation (e.g. sildenafil by Pfizer) as well as through rational observations (duloxetine) (Li et al, 2011). Industrial and academic focus is now turning to systems biology to investigate this re-purposing approach (Zimmer & Young, 2009). This relies on vast datasets that must represent an accurate integrated view of cellular and molecular processes. Successful production of a dataset will be a pivotal part of the project.

The amount of openly available biological and pharmacological data is vast ( > 1900 unique publicly available data sources exist (Brazas et al, 2012)). Data is, however, deposited in many distributed, heterogeneous and voluminous data sources. In order to fully exploit these data, integration must be achieved to produce homologous technical, semantic and syntactic data. Many data integration platforms and frameworks exist to achieve this. Ondex is a data integration platform that allows for the visualisation of integrated systems biology datasets in the form of a graph (Kohler, J. et al, 2006). One such dataset has been developed at Newcastle University for the in silico discovery of new drug repositioning candidates (Cockell et al, 2010). Other frameworks include: Resource Description Framework (RDF) and Kbase, an open source software framework and application system. RDF, a standard model for data interchange on the web represents data as 'statements'comprising of: a subject, a predicate and an object. RDF may be queried using SPARQL (a Protocol and RDF Query Language) to extract vital information that may be held within a dataset, known as data mining.

Data mining is the process of automatic discovery of novel and understandable models from patterns within a target dataset. It is this process that will reveal any potential drug repurposing leads and it will be important to consider both syntactic (topological) and semantic factors. The concept of manually exploring datasets for known examples of drug re-positioning is one of simplicity, but unrealistic for mass exploration and identification of possible new targets of interest and so novel mining techniques (including automated algorithms to explore topological and semantic network structures indicative of repositioning opportunities) will need to be explored.

Semantic Motifs (SM), a term introduced by Cockell et al (2010) at Newcastle University have been proposed for this purpose. The idea of using network topology in combination with the metadata in a sub graph for mining networks is novel. SMs are described as sub-graphs (or motifs) that match a particular metadata or semantic structure. It is proposed that taking the SM of a known example of re-purposing (i.e. Chlorpromazine) will allow for searching and pruning of the graph to uncover semantically (as well as topologically) similar network structures. Research into the definition and utility of SMs is currently underway as part.

Further to the development and mining of a dataset work will involve the interpretation of semantic network structures into a format easily understood by those with little understanding of network analysis. Natural Language Generation (NGL) involves the production of natural language from a machine generated system and, upon successful implementation, will allow for network structures to be summarised in simple text.