Coordinator: (General) Pierre Peterlongo GenScale INRIA, (in Lyon) Vincent Lacroix BAMBOO INRIA-LBBE-UCBL, (in Montpellier) Eric Rivals LIRMM
The main goal of the Colib’read was to design new algorithms dedicated to the extraction of biological knowledge from raw (non assembled) data produced by High Throughput Sequencers (HTS), also called Next Generation Sequencers (NGS).
A few years ago, genomics witnessed an unprecedentedly deep change with the advent of High Throughput Sequencing (HTS), also known as Next Generation Sequencing (NGS). These technologies generate data of a new type in huge volumes. Crucial computational developments are needed to take full advantage of these data. Our project proposed an original way of extracting information from such data. Usually, a generic assembly (pretreatment) is applied to the data, and then, in a second step, any information of interest is extracted. Our aim was to avoid this protocol that leads to a significant loss of information, or generates chimerical results because of the heuristics used in the assembly. Instead, we developed a set of innovative methods for extracting information of biological interest from HTS data that bypass any costly and often inaccurate assembly step. Importantly, the developed methods do not require the availability of a reference genome. This broadened considerably the spectrum of applications of our methods. Shortly, for each biological question, our general approach consisted in 1) defining a model for the searched elements; 2) detecting in one or several HTS datasets those elements that fit the model; 3) outputting those together with a score and their genomic neighbourhood. From a computational viewpoint, our proposal relied on a formal model based on the De-Bruijn graph structure to develop algorithms able to handle a huge amount of data. Among others, Colib’read delivered algorithms based on the De-Bruijn graph, and tools validated by biologists.
This project was at the interface between (i) fundamental computational questions, (ii) algorithmic developments including the design of ad-hoc indexes and parallelisation, and (iii) biological applications for validation. Finally (iv) it also proposed a large public and educational dissemination.
More information on it may be found here.