[PhD] Ab-Initio Classification And Detection Of Non-Coding RNAs From Thermodynamics Principles

Biological context. Once overlooked by a protein-centric view of cellular mechanisms, noncoding RNAs (ncRNAs) have recently been found to play many unsuspected roles (Regulation, self-maturation, genome defense. . . ), either alone or through a complex with a protein. The action of ncRNAs has also been associated with diseases such as cancer (Lu et al., 2005), autism (Nakatani et al., 2009), Alzheimer’s disease (Faghihi et al., 2008). . . More generally the conclusions of the ENCODE effort (Consortium, 2007), which analyzed a portion of the humain genome, showed that a very large majority of DNA is transcribed at some stage of cell life or in some cellular context. This contrasts direly with the small (2-3%) proportion of genomes that are coding for protein genes, suggesting that a large amount of – currently unknown – ncRNAs might be involved in cellular mechanisms. This conclusion is further supported by the explosive growth of the number of functional families indexed in the reference RFAM database (Griffiths-Jones et al., 2003) (176 in 2005, 574 in 2007, 1500 in 2011, and 2497 in 2015). Understanding which, of the remaining transcripts, lead to functional ncRNAs is one of the key challenges of RNA computational biology.

Beyond Minimal Free Energy (MFE) models. The functional role of ncRNAs is mainly characterized by its structure and the secondary structure of RNA, a computationally-tractable relaxation of the 3D structure, constitutes a valuable tool for RNA bioinformaticians, e.g. for the characterization of families (Consensus structures of the RFam database (Griffiths-Jones et al., 2003)) and the prediction of its folding (Zuker and Stiegler, 1981). At the core of these methods, the Turner model assigns free-energies to components (or loops) of the secondary structure, and structure prediction can be performed from a single sequence through a minimization of the free-energy (Zuker and Stiegler, 1981). Lately, this approach was extended, based on the assumption that the different secondary structures compatible with a sequence co-exist within a Boltzmann distribution, yielding slightly more sensitive and more specific predictions on ncRNAs (Ding et al., 2005). Finally, features of the Boltzmann distribution, e.g. expectation and variance of the free-energy, can be efficiently extracted for a given sequence (Ding et al., 2014). Taking these features into account was shown by Miklos et al. (2005) to better discriminate between mRNAs and random sequence than the sole consideration of the free-energy. Extending this approach to other additive features of compatible structures in the Turner model may lead to an alternative characterization of ncRNA families based on thermodynamic signatures.

Toward thermodynamics-based models for ncRNAs detection. The goal of this project is to contribute a unifying algorithmic framework for the computation of RNA thermodynamic signatures, and to test their discriminatory power in the classification and identification of ncRNA families. Such signatures will primarily include the moments of the distribution for additive features (Free-energy, #hairpins, #unpaired positions, . . . ) in the Boltzmann ensemble. Such signatures, by capturing the whole folding landscape of ncRNAs in a weighted ensemble, are expected to be less prone to inaccuracies, for instance in the case of multistable or pseudoknotted RNAs. Although the project may eventually integrate evolutionary data (conservations or covariations) to improve its predictions, its primary emphasis is put on the extraction of sequence-only signals, since: 1) Most in silico approaches for the classification/detection of ncRNAs rely on the MFE paradigm, which has shown to be a limiting factor in many contexts, e.g. for the RNA folding problem; and 2) Such signals are associated with natural biochemical interpretations, from which bottom-up, and mechanical, biological hypotheses can be established and tested.
Towards this goal, the candidate will build on and extend a previous contribution by Ponty and Saule (2011) to compute arbitrary moments in generic dynamic programming schemes. He/she will combine grammar transformations with algebraic dynamic programming techniques (Sauthoff et al., 2013) to perform an automated generation of code for each features. These features will be systematically computed, and integrated in a machine learning approach to scan for occurrences of ncRNAs belonging to existing classes (ncRNA classification problem), and unravel new signals for the detection of novel classes of ncRNAs (ncRNA detection problem).

Main tasks. After a thorough literature search, especially critical in the context of an interdisciplinary project, the candidate will design and implement a compiler which, for any desired feature and statistical moment, will generate a suitable grammar for Bellman’s GAP compiler Sauthoff et al. (2013). The compilation of this grammar will yield the necessary C code for an automatic extraction of features (or correlations of features) in the Boltzmann distribution. The main reason for such a compilation is build on an existing implementation for a related problem using the latest thermodynamics parameters (Vienna package (Hofacker et al., 1994), whose reimplementation would be a tedious and unrewarding task).
Secondly, a list of features of interest will be established and a corresponding family of software will be automatically-produced using the software tool produced in the first step. The moments of these features will be systematically computed on selected RFam families (Griffiths-Jones et al., 2003), and the candidate will test their capacity to discriminate RNA sequences belonging to different functional families. To that purpose, a list of natural hypotheses will be tested, coupled with exploratory approaches using machine learning approaches based on the Weka toolbox (Bouckaert et al., 2010).
Finally, depending on the outcome of the previous phases, the signal extracted during the previous phase of the project will be validated on a larger scale, and validated against/used jointly with evolutionary information (covariation) to detect novel ncRNA sequences, with a special emphasis on multistable RNAs (riboswitches).

Contact: yann.ponty@lix.polytechnique.fr


  • Jun Lu, Gad Getz, Eric A Miska, et al. MicroRNA expression profiles classify human cancers. Nature, 435(7043):834-838, Jun 2005. doi: 10.1038/nature03702.
  • Jin Nakatani, Kota Tamada, Fumiyuki Hatanaka, et al. Abnormal behavior in a chromosome engineered mouse model for human 15q11-13 duplication seen in autism. Cell, 137(7):1235-1246, Jun 2009. doi: 10.1016/j.cell.2009.04.024.
  • Mohammad Ali Faghihi, Farzaneh Modarresi, Ahmad M Khalil, et al. Expression of a noncoding RNA is elevated in alzheimer’s disease and drives rapid feed-forward regulation of betasecretase. Nat Med, 14(7):723-730, Jul 2008. doi: 10.1038/nm1784.
  • The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447:799-816, 2007.
  • Sam Griffiths-Jones, Alex Bateman, Mhairi Marshall, Ajay Khanna, and Sean R Eddy. Rfam: an RNA family database. Nucleic Acids Res, 31(1):439-441, Jan 2003.
  • M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequencesusing thermodynamics and auxiliary information. Nucleic Acids Res., 9:133-148, 1981.
  • Y. Ding, C. Y. Chan, and C. E. Lawrence. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11:1157-1166, 2005.
  • Yang Ding, William A. Lorenz, Ivan Dotu, Evan Senter, and Peter Clote. Computing the probability of RNA hairpin and multiloop formation. J Comput Biol, 21(3):201-218, Mar 2014. doi: 10.1089/cmb.2013.0148.
  • Istvan Miklos, Irmtraud M Meyer, and Borbala Nagy. Moments of the boltzmann distribution for RNA secondary structures. Bull Math Biol, 67(5):1031-1047, Sep 2005. doi: 10.1016/j.bulm.2004.12.003.
  • Yann Ponty and Cédric Saule. A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms. In WABI – 11th Workshop on Algorithms in Bioinformatics – 2011, Saarbrucken, Allemagne, 2011.
  • Georg Sauthoff, Mathias Möhl, Stefan Janssen, and Robert Giegerich. Bellman’s GAP-a language and compiler for dynamic programming in sequence analysis. Bioinformatics, 29(5):551-560, Mar 2013. doi: 10.1093/bioinformatics/btt022.
  • I. L. Hofacker, W. Fontana, P. F. Stadler, et al. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly, 125(2):167-188, 1994.
  • Remco R. Bouckaert, Eibe Frank, Mark A. Hall, et al. WEKA – Experiences with a Java Open-Source Project. Journal of Machine Learning Research, 11:2533-2541, 2010.

Leave a Reply

Your email address will not be published.