Scientific Computing and Computational Biology Support

In 2014, the Institute has invested further in deep sequencing, with the purchase of a new Illumina HiSeq 2500 and a MiSeq. These platforms offer exciting opportunities, but come with an increased computational burden, not only in the sheer volume of information (each run of the 2500 can require terabytes of disk space to analyse) but also in the computational power required to process them.

Scientific Computing
In order to address the computational burden associated with our new sequencing platforms, we have established a new Scientific Computing Team, led by Wei Xing, to provide the next generation of High Performance Computing (HPC) necessary to deal with future deep sequencing, proteomics and imaging data. Wei has already helped manage the installation of additional processing and storage capacity, and over the next year will be coordinating the development of a new Cancer Genomics Data Centre to further extend our computing capabilities.

Computational Biology Support
As computational biology becomes central to increasing numbers of groups within the Institute, our collaborations have grown significantly. The close of 2013 saw a restructuring of the ACBB group through the formation of a distinct Computational Biology Support Team, to complement a new and separate RNA Biology Research Group focused on regulatory RNAs (see below).

A major aspect of the support team’s work has been to develop pipelines that streamline the pre-processing and alignment of deep sequencing data. These are parallelised across our HPC system, and handle many of the routine tasks associated with Illumina data, freeing valuable time to conduct downstream analysis and data interpretation. A significant part of this work has been to evaluate different aligners and to develop an understanding of the influence different parameter settings have on the performance of these tools.

Figure 1: Integrating strand specific RNA sequencing data with global quantitative protein mass spectrometry to identify gene expression changes in fission yeast cells. Rows in the heat map correspond to individual coding regions for which quantitative proteomics data were also available. Columns correspond to individual samples in a time course. A wild type and mutant strain were compared. Sense: Coding sequence expression levels. Antisense: antisense expression levels corresponding to these coding regions. Protein: protein abundance. Blue: low abundance, red high abundance.

Pipelines for DNA-, RNA- and ChIP-seq have all been built, and will continue to be refined in order to keep track with a fast moving field. The team has also developed proteomics tools to complement those provided for deep sequencing. These bring peptide-level tandem mass spectrometry datasets into a format that supports integrated analysis with RNA-sequencing data, in part by mapping them into a common genome-level coordinate system. This further extends approaches we developed to support the integration of genomics data (e.g. Bitton et al. Genetics 2011; Bitton et al. PLoS ONE 2010; and Bitton et al. BMC Bioinformatics 2008).

Since most of our downstream data analysis is performed using R and Bioconductor, these tools bring the data into the R programming environment, where they can be analysed using the comprehensive statistical toolkits that accompany this programming language, complementing other work in the ACBB group that has focused on bringing genome annotation data into the same environment (see http://annmap.cruk.manchester.ac.uk/, for example).

Collaborative Analysis
Additional collaborative support is provided by computational analysts that work closely with other research groups. Where demand is sufficient, these can be provided by full time embedded posts; both the CEP and the DDU Groups embed analysts in this way. In addition, we have established an exciting collaboration with the CEP group to develop bioinformatics approaches for the analysis of whole genome and transcriptome data derived from single cell genomics experiments.

While much of our work is focused on genome-wide datasets, we are also interested in statistical methods, and were able to contribute to work with the Signalling Networks in Cancer Group by providing the mathematical modelling necessary to analyse data from genetic dependency screens (Fawdar et al. Proc Natl Acad Sci USA 2013).

We have been collaborating with the Translational Radiobiology Group (part of the University of Manchester Institute of Cancer Sciences) on the development and application of RNA signatures of tumour hypoxia. (Eustace  et al. Clinical Cancer Research 2013; Ramachandran et al. Eur J Cancer  2013; and Hall et al. Int J Radiat Oncol Biol Phys. 2013). Many of these studies required the analysis of microarray data generated from Formalin fixed Paraffin Embedded Tissue (FFPET) samples, which presents a challenge because the RNA is degraded and chemically modified, requiring new bioinformatics strategies to deal with effectively.