Period of activity : 1 st year of activity
Principal investigator (Inria): Rumen Andonov, GenScale/Inria
Principal investigator (partner): Hristo Djidjev, CCS-3/LANL, N M / USA
INRIA/Genscale : S. François (PhD), C. Marchet (Phd), P. Peterlongo (CR, INRIA), D.
Lavenier (DR, CNRS)
LANL/ CCS-3 : S. Eidenbenz (senior), G. Chennupati (Postdoc), Robyn Miller
Abstract of the scientific program
Genome sequencing and assembly, or the determination of the DNA sequences of a genome, is a key task and challenge in computational biology. During the last decade, the cost of sequencing has decreased dramatically and a huge amount of new genomes have been sequenced. Nevertheless, most of recent genome projects stay unfinished and nowadays the databases contain much more incompletely assembled genomes than whole stable reference genomes. The main reason is that producing a complete genome, or an as-complete-as-possible-genome, is an extremely difficult computational task (an NP-hard problem and, in spite of the efforts and the progress done by the bioinformatics community, no satisfactory solution is available today. New sequencing technologies (such as PacBio or Oxford Nanoport) are being developed that tend to produce longer DNA sequences and offer new opportunities, but also bring significant new challenges. The goal of this project is to develop a new methodology and tools based on novel optimization techniques and massive parallelism suited to these emerging technologies and able to tackle the complete assembly of large genomes.
Task 1 progress:
Task one aimed at exploiting the particular quasi-linear topology of the assembly graphs.The goal was to take advantage of this topology to propose an efficient partitioning method. The intuition was then to assemble so-obtained “graph slices” independently, thus improving assembly time and quality. In this context, we explored the idea of finding sequencing data (the reads) location in this graph. The intuition behind is that small errors and, in particular, genomic repeats generate local patterns in the graph. Those patterns are a major source of assembly limitation generating an exponential number of potential paths in this graph. Instead of cutting out the graph, the idea of finding sequencing data location in the graph enables one to avoid, or at least to dramatically limit, the number of potential paths. This results in assemblies of greater lengths and fewer errors. During the first year, we provided new algorithmic approaches for mapping reads on assembly graphs and designed first theoretical methods for exploiting these pieces of information. As a side effect, this enables a new way for efficiently correcting the sequencing data
Task 2 progress:
As mentioned in the project description, our strategy for tackling the scaffolding problem differs significantly from the approaches described in the literature. While the previous ones apply various heuristics for solving the different genome assembly stages separately, our methodology consists in developing a global optimization approach where the three steps (scaffolding, gap-filling and contig extension) are merged and simultaneously solved in the framework of a common objective function. Furthermore, our approach is based on integer programming models and uses Gurobi/CPLEX solver towards finding an exact solution. The approach was presented at several conferences and was accepted with great interest by the community of the domain [2,3,6].
During the first year we did a notable progress in tuning our methodology and in developing the associated tool called GAT for Genscale Assembly Tool. At this stage of development, we decided to focus on smaller genomes and, more specifically, on chloroplasts. The relevant features of the chloroplasts genomes are that they are circular and that they contain numerous repetitions, making the corresponding assembly problem combinatorically hard. These genomes are hence very suitable for our approach—while being small in length, they lead to optimization problems that are hard for traditional methods. Because of the small size of the genomes, our optimization problems have fewer binary variables and allow us to assemble in a short time many different chloroplast genomes and to tune our technique and tool. We created a benchmark of about 40 chloroplast genomes and performed computational comparisons with two famous tools (SPADE and SSPACE). The two figures below illustrate the number of contig, as well the genome fraction that stays non-assembled (non-predicted) after running the corresponding scaffolder. Our tool GAT2 (in blue) outperforms the other two on the considered benchmark.
Next year’s work program
During the next year we plan to work on the following directions of research:
Task 1: Exploiting the particular quasi-linear topology of the assembly graphs:
We will push forward efforts to exploit information of the reads injected in the assembly graphs. The first obvious application is the improvement of assembly techniques using single reads. ThusThus, in the next few months, we will implement and validate this approach. Latter on, we will focus on long- range information. As reads may come by pairs (close- distance with paired end, long- distance with mate pairs, any distances with HiC techniques), we will adapt our approaches to take into account this kind of information, this willthereby further improving the quality of the assemblies.
Integration of Third Generation Sequencing data (PacBio Rs methodology) in our approach:
This technology provides order of magnitude longer sequences than short reads and thus can decrease the ambiguities in the scaffolding. The approach we foresee starts with mapping the unitigs provided by the assembler on the long reads. Many unitigs are expected to map on a single long read. This way we can infer distances between pairs of unitigs and integrate them in our integer programming approach.
New method to determine, with sufficient accuracy, the number of occurrences of each unitigs:
Currently, the data is provided by Minia that applies a heuristic-based on k-mer counting. However, this technique is not reliable for small unitigs. We envisage to improve the quality of the approach by taking into account the similarity between unitigs: when two unitigs have very close sequences, their differences can be dropped by using the ‘N’ wildcard. The result would be a brand new unitig whose number of occurrences can be predicted more accurately because of its less specific nature..
Extending our methodology to a new benchmark containing bacteria:
These organisms are more complex than the chloroplasts (more than ten times bigger) while still having a circular genomes. This will require improving the performance of the approach. A very promising direction is to develop specific decomposition methods such as the following one. Given that the unitig graph G appears to have good partitioning properties and that the parameter that limits the performance the most is the number of mate pairs, we plan to use the following decomposition approach in order to improve the scalability of the algorithm. Divide the vertices of G into two sets V 1 and V 2 of roughly equal sizes by removing a small set of edges. Use only those mate pairs that have both endpoints in V 1 and solve the problem on the resulting reduced set of mate pairs (this ensures a maximum number of mate pairs satisfied in the first set as well as the existence of a path from the start to the end vertex). In the second step, solve the problem with the remaining mate pairs. After we implement and fine-tune the algorithm for two parts, we will try to generalize it to k parts, i.e., dividing the vertices of G into sets V 1 ,…,V k .
Record of activities
- Visit of R. Andonov at LANL from Mai 4 th to Mai 30 th
- Visit of S. Francois at LANL from Mai 4 th to Mai 30 th and from August 2 – August 23
- Visit of P. Peterlongo at LANL from Mai 13 th to Mai 21 th
- Visit of H. Djidjev at INRIA/Rennes from June 5 to July 4, 2017
- Participation of P. Peterlongo, S. François, R. Andonov and H. Djidjev (LANL)
at “Sequencing, Finishing and Analysis in the Future” (SFAF) – 12 th annual
meeting of experts in the genomics field, May 16-18, 2017, Santa Fe, NM.
- Participation of S. François and R. Andonov at the 8 th International Network
Optimization Conference (INOC 2017) February 26-28, Lisbon, Portugal
- Nouvelles approches pour l’exploitation des données de séquençage haut débit ,
Antoine Limasset, PhD thesis defended at Rennes, 12 th July 2017
- Global Optimization Methods for Genome Scaffolding, S François, R Andonov, D
Lavenier, H Djidjev, 8 th International Network Optimization Conference (INOC
2017) February 26-28, Lisbon, Portugal (to appear in the Special Issue of
Electronic Notes in Discrete Mathematics (ENDM), Number 64).
- Global Optimization for Scaffolding and Completing Genome Assemblies, S.
François, R Andonov, D Lavenier, H Djidjev, INRIA Research Report n° 9050 —
- Toward perfect reads, A. Limasset, J-F. Flot, P. Peterlongo, Recomb 2018
- Assembly of heterozygous genomes, P. Peterlongo, SFAF, 2017.
- Global optimization approach for scaffolding and assembly, R. Andonov, SFAF,