Non-coding RNAs

Novel non-coding regulators of gene expression

Current research in the group is investigating the role of non-coding RNAs in regulating gene expression, and builds on previous work using Schizosaccharomyces pombe (fission yeast) in which we identified sets of cis-acting non-coding RNAs that are differentially expressed as fission yeast undergoes meiosis. We used strand specific sequencing of total RNA to first reannotate the genome to include accurate representations of each gene’s untranslated regions (UTRs), before exploring how patterns of RNA abundance changed over the course of meiosis. When we did this, we identified a set of non-coding RNAs opposite protein coding genes that are critical regulators of sexual differentiation, leading us to speculate that they might act to control the activity of these key proteins. In collaboration with Cell Division, we were able to show that this was indeed the case: over-expression of an Antisense Regulatory Transcripts (ARTs) integrated ectopically into the genome phenocopied a deletion of the corresponding protein-coding gene. We also showed that their function was dependent on components of the RNA interference (RNAi) pathway, an aspect of the fission yeast genome that is conserved with H. sapiens, and suggesting that similar mechanisms might occur in human cells. Many of these transcripts arise from overlapping 3' UTRs of convergent adjacent gene pairs transcribing towards one another, and we were able to confirm that genes in this configuration can indeed co-regulate to modulate the expression of their neighbours.

Nearly 40% of known human genes are non-coding, and whilst the majority have yet to be assigned a function, an increasing body of research is showing that they can interact with DNA, with proteins and with other RNAs to modulate and control their function. Although it is not clear what proportion of these newly discovered non-coding RNAs are functional, rather than simply corresponding to background ‘chatter’ in the genome, their prevalence, and their ability to act through a diversity of mechanisms, raises the possibility that they may have a profound impact on our understanding of genes and gene expression. We are using high throughput genomics tools, including deep sequencing and quantitative protein mass spectrometry, to identify novel non-coding RNAs, to investigate the function of existing ones, and to search for those with behaviour that is altered in tumour cells. This work therefore integrates experimental and computational approaches and makes use of both in vitro and clinical datasets.

We are currently applying similar approaches to the ones we pioneered in fission yeast to the analysis of human cell lines and clinical datasets, allowing us to identify novel long intergenic non-coding-coding RNAs (lincRNAs) of potential relevance in cancer. We are in the process of characterising these further at the bench.

Analysis of archival material

Archival Formalin Fixed Paraffin Embedded (FFPE) tissue is an immensely valuable source of information pertaining to cancer. Unfortunately, the preservation process damages RNA, making it hard to use these samples as a source for systematic analysis of gene expression profiles, and difficult, therefore, to use these material in global genomic studies. The development of successful methods for measuring global RNA abundance in these samples, and for performing strict Quality Control, would be extremely beneficial. We have been collaborating with Translational Radiobiology to develop methods to support the analysis of RNA from FFPE samples. We have previously shown that it is possible to generate meaningful gene expression data from archival material, however in some samples, the RNA is of too poor a quality to be amenable for further study. Recently we showed that miRNAs, a type of short non-coding RNA, are less susceptible to the effects of preservation in FFPE than longer mRNA transcripts, and can be used to generate meaningful data from FFPE samples even when the mRNA has deteriorated beyond the point of utility (Hall JS, Taylor J et al. 2012 British Journal of Cancer 107(4):684-694). These, plus other advances in both biochemistry and bioinformatics, are raising the prospect of generating useful classifiers and biomarkers from archival material, further unlocking the potential of this enormously valuable resource.


In parallel, the group collaborates with many other groups in the Institute and the wider Manchester Cancer Research Centre (MCRC) to provide computational input into their research programmes (see for example work by the Leukaemia Biology group  inHarris WJ et al. 2012 Cancer Cell 21: 473-487 ). This last year has seen a significant increase in demand for support as groups exploit the power of our microarray, deep sequencing and mass spectrometry platforms. These fields are all moving rapidly, and the complexity of the datasets they generate often demands novel analyses, resulting in many projects requiring substantial research-level computational biology. We have met this demand with the development of a model in which Postdoctoral Fellow-level analysts develop extended collaborations with other research groups, allowing them to become immersed in the research question, and leading to a much deeper contribution than would be possible with a more traditional ‘analysis-as-service’ approach. This allows us not only to make use of the latest techniques emerging from the computational biology research community, but also to develop the novel algorithms and software tools we need to perform these analyses (Figure 1). We are active contributors to the Bioconductor project – an international collaboration to develop open source software packages for the analysis of biological data, and this is the primary route by which we make our software available to the wider community. Underpinning all of our research is a computational platform that includes a High Performance Computing (HPC) Linux cluster and associated Lustre file system. Programmers in the group have developed pipelines to process our deep sequencing and proteomics data, and Bioconductor packages that help support our downstream analysis. The goal is to take novel methods developed as part of our research, and turn them into tangible software tools when they will be used to support frequent analyses across multiple datasets. Since much of our research is directed at generating a better understanding of the less well-characterised regions of the genome (see below), the major computational focus of the group is on developing tools that allow fine-grained representations of gene structure to be integrated with statistical methods (Figure 2). This then allows these annotations to be used to provide detailed context when interpreting high throughput genomics datasets, and to help bring the data that emerges from different technologies together into a single integrated analysis. The group therefore sits at the interface of biology, mathematics and computing, and comprises a highly interdisciplinary team that incorporates both ‘wet’ and ‘dry’ science.


Figure 1. Annmap ( is a genome browser based on the Google Maps API. Each grey box represents a gene, with individual transcripts shown as horizontal tracks within. Coding exons are black, UTRs, grey. Regions encoding protein domains are identified in yellow. Green lines represent target locations for Affymetrix Exon 1.0ST microarrays. Associated with the genome browser is a Bioconductor package that provides programmatic access to the same annotation data, allowing it to be brought into the statistical context provided by Bioconductor and the R programming language.

Figure 2. Underpinning much of our work are genome annotation databases that link genes to the RNA and protein molecules they encode. The figure shows just a tiny subset of the millions of relationships we need to record. We use computer software to help interpret these networks.