News – Dyliss

Internship offer 2025-2026: Deep Learning Prediction of Enzyme Functions Beyond EC Classification

Master2 or 5th-year Engineer’s internship proposal 2025-2026:

Deep Learning Prediction of Enzyme Functions Beyond EC Classification

Keywords: Deep Learning, Bioinformatics, Large Language Models, Functional Annotation, Enzymes

Context

With the advent of high-throughput sequencing technologies, an ever-increasing number of genomes from diverse species are being sequenced. However, sequencing alone is not sufficient—understanding the functions of genes and proteins is crucial for deriving meaningful biological insights.

Current annotation methods primarily rely on homology-based approaches (e.g., BLAST, HMMER [1]) to transfer existing annotations to new sequences. While effective, these methods face significant limitations, particularly when no close homologs are available in existing databases.

Recent advances in deep learning and the development of Large Language Models (LLMs) trained on protein sequences have shown great promise for improving functional prediction. Within our team, we focus particularly on metabolomic networks and, consequently, on the accurate characterisation of enzymatic functions. LLM-based enzyme prediction approaches, such as EnzBERT developed by our team [2] and others [3,4,5], have already achieved performances comparable to traditional methods, while outperforming them in low-homology scenarios [6]. However, these evaluations have been based primarily on the Enzyme Commission (EC) nomenclature, which limits the full potential of machine learning, particularly for leveraging the hierarchical structure of enzyme classes, as implemented in the successor of EnzBERT (see slides [7]).

The goal of this internship is to investigate and develop alternative hierarchical classification schemes that better capture enzymatic functions, enabling the training of next-generation annotation methods.

Objective

Design and evaluate a novel deep-learning-based annotation method of enzymatic functions that:

Moves beyond traditional EC (Enzyme Commission) class prediction.
Leverages hierarchical and multi-label classification schemes.
Integrates large language models (e.g., EnzBERT and its successors).
Complements or surpasses current homology-based methods

Missions

Analyse existing enzyme classification systems (EC nomenclature, Gene Ontology, CAZy, CyanoLyase, Rhea, Reactome, KEGG, BioCyc).
Construct a benchmark dataset from high-quality public annotations for training and evaluating enzyme function predictors.
Develop and train machine learning models, including next-generation EnzBERT capable of hierarchical and multi-label predictions.
Compare with state-of-the-art methods (homology-based, deep learning, LLM-based).

Expected Results

A curated and hierarchically structured enzyme annotation dataset suitable for training and evaluating machine learning models.
A functional prototype of an enzyme annotation tool capable of predicting enzymatic functions.
A scientific report and, if results permit, a publication co-authored by the student.

Required Skills

Background in bioinformatics, computational biology, or computer science.
Knowledge of deep machine learning or enzymology.
Proficiency in Python.
Strong analytical and problem-solving skills.

Practical Information

Supervision: François Coste, Inria Researcher https://people.rennes.inria.fr/Francois.Coste/
Location: Rennes, France (Dyliss Team, IRISA / Inria Research Centre at Rennes University)
Start date: January–March 2026 (flexible)
Duration: 5–6 months
Continuation: High-performing interns may have the opportunity to extend this work for two years as a research engineer funded by ECxit, a new Inria Exploratory Action.

Application

Interested candidates should send a CV and a brief motivation letter to francois.coste@inria.fr

References

Richard Durbin et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
Nicolas Buton, François Coste, and Yann Le Cunff. “Predicting enzymatic function of protein sequences with attention”. In: Bioinformatics (2023).
Gi Bae Kim et al. “Functional annotation of enzyme-encoding genes using deep learning with transformer layers”. In: Nature Communications 14.1 (2023), p. 7370.
Zhenkun Shi et al. “Enzyme commission number prediction and benchmarking with hierarchical dual-core multitask learning framework”. In: Research (2023).
Tianhao Yu et al. “Enzyme function prediction using contrastive learning”. In: Science 379.6639 (2023), pp. 1358–1363.
João Capela et al. “Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction”. In: BMC Bioinformatics (2025).
François Coste, “Enzymatic annotation of protein sequences with a deep language model”, talk at IA pour l’annotation des génomes days, MERIT CNRS network, Paris (2024)
slides: https://people.rennes.inria.fr/Francois.Coste/pub/2024-merit-coste.pdf

Yann Le Cunff HDR defense: 22nd May 2025

Yann Le Cunff will defend his habilitation “From Data to Phenotype: Integrating Data Structure and Prior Knowledge to Model Biological Systems” on Thursday 22nd May 2025 at 13:00 amphi B building 02, Campus de Beaulieu.

Committe:

Anaïs BAUDOT (reviewer) CNRS Marseille
Blaise HANCZAR (reviewer) IBISC Évry
Carl HERRMANN (reviewer) Heidelberg Univ.
Emmanuelle BECKER Univ. Rennes
Hugues BERRY Inria Lyon
Antoine CORNUEJOLS Univ. Paris-Saclay, AgroParisTech
Dominique DE VIENNE INRAe
Guillaume GRAVIER CNRS Rennes
Khashayar PAKDAMAN Univ. paris-Cité

Camille Juigné won the Open Science PhD award

Camille Juigné won the Open Science PhD award for her work on “Integration and analysis of heterogeneous biological data through multilayer graph exploitation to gain deeper insights into feed efficiency variations in growing pigs”, co-supervized by Florence Gondret (INRAe PEGASE) and Emmanuelle Becker (DYLISS)

Matthieu Bouguéon’s PhD defense: 21st December 2023 14:00

Matthieu Bouguéon will defend his PhD thesis “Kappa language modelling of hepatic star cell dynamics during fibrosis development and reversion” on Thursday 21st December 2023 at 14:00, room Métivier at IRISA.

Thesis committee:
– Anna NIARAKIS, Professor – Université de Toulouse (reviewer)
– Cédric LHOUSSAINE, Professor – Université de Lille (reviewer)
– Fabien CRAUSTE, Research director CNRS – UMR 8541 CNRS, Univ Paris Cité (reviewer)
– Sophie LOTERSZTAJN, Research director Inserm – UMR 1149 Inserm, Univ Paris Cité
– Nathalie THÉRET, Research director Inserm – UMR1085 Inserm, EHESP, Univ Rennes (thesis director)
– Anne SIEGEL, Research director CNRS – UMR 6074 CNRS, Inria, Univ Rennes (thesis director)
– Jérôme FERET, Researcher Inria – ENS, Paris (thesis supervizor)

Abstract:
Hepatic fibrosis is an excessive scarring response induced by chronic injury. It is characterised by an accumulation of extracellular matrix (ECM), mainly composed of collagen 1 (COL1), which increases tissue rigidity and leads to severe liver dysfunction. Activation of hepatic stellate cells (HSCs), induced by growth factor TGFB1, is the main process underlying liver fibrosis.
In order to study the dynamics of HSC during the development and reversion of fibrosis, we have developed a multi-scale model integrating the different states of HSC as well as their production of COL1, under the influence of TGFB1. This model is implemented using the Kappa language, which is a rewriting language for site graphs. As well as being the first multi-scale Kappa model, this model allowed us to capture the plasticity of star cells during the development and reversion of fibrosis. The model’s predictions show that the inactivation state of HSC plays an essential role in the development of fibrosis. The model was validated by new experiments in mice and the predictions were validated with RNAseq data from fibrotic patients.

Camille Juigné’s PhD defense: 01st December 2023 09:00

Camille Juigné will defend her PhD thesis “Integration and analysis of heterogeneous biological data through multilayer graph exploitation to gain deeper insights into feed efficiency variations in growing pigs” on Friday 01st December 2023 at 09:00, amphi Matagrain, Agrocampus.

Thesis committee :
– Mathieu Emily, professeur à l’Institut Agro Rennes Angers (examinateur, président du jury)
– Michel Dumontier, professeur à l’Université de Maastricht (examinateur)
– Andrea Rau, directrice de recherche à l’INRAE (rapportrice)
– Fabien Jourdan, directeur de recherche à l’INRAE(rapporteur)
– Florence Gondret, directrice de recherche à l’INRAE (directrice de thèse)
– Emmanuelle Becker, maîtresse de conférence à l’Université de Rennes (co-encadrante de thèse)

Résumé :
Les progrès technologiques d’étude du vivant ont conduit à une explosion de données multimodales et multicentriques. Ce phénomène soulève de nombreuses questions liées au stockage, à la standardisation et à l’analyse de ces données massives. Ainsi, ce travail de thèse porte sur le développement d’une méthode intégrative d’analyse de données biologiques, pour en extraire de la connaissance. Pour prendre en compte leur forte interdépendance, cette approche consiste à intégrer différents types d’entités biologiques (ARNm, protéines, métabolites, caractères observables) qui sont habituellement étudiés indépendamment les uns des autres. La solution informatique élaborée permet d’intégrer ces données hétérogènes dans un graphe multicouche, avec une couche par type d’entités. L’originalité est de relier les éléments d’une couche ou de couches différentes par des propriétés extraites des bases de données et de connaissances publiques à l’aide de technologies du Web Sémantique. A partir de ce graphe, le but est de caractériser les relations entre un groupe de molécules d’intérêt grâce à des métriques de la théorie des graphes. La méthode développée a été appliquée à des jeux de données expérimentaux (transcriptomique, métabolomique et phénotypes animaux) pour décrire et comprendre les relations entre les molécules et leur importance dans la variation d’efficience alimentaire de porcs. L’efficience alimentaire est un phénotype clé pour contribuer à un élevage durable, mais complexe. Ce travail a permis de mettre à disposition des méthodes d’analyse novatrices, à différentes échelles de l’organisation du vivant, favorisant une meilleure compréhension des processus biologiques.

Mots clés : Efficience alimentaire, Graphe multicouche, Intégration de données, Multi-omiques, Web sémantique

Nicolas Buton’s PhD defense: 18th October 2023 10:00

Nicolas Buton‘s defense of his PhD thesis “Transformer models for interpretable and multilevel prediction of protein functions from sequences” on Wednesday 18th October 2023 at 10:00, room Métivier at IRISA.

Thesis Committee

Nataliya SOKOLOVSKA (President and Rapporteur), Professeure des universités, laboratoire LCQB, Paris, France
Tatiana GALOCHKINA (Rapporteur), Maîtresse de conférences, Université Paris Cité, labroratoire BIGR, France
Blaise HANCZAR (Rapporteur), Professeur des universités, Université Paris-Saclay/Evry, laboratoire IBISC, France
Yann Le Cunff (Encadrant), Maître de conférences, Université de Rennes, France
François COSTE (Encadrant), Chargé de recherche Inria, Rennes, France
Olivier Dameron (Directeur), Professeur des universités, Université de Rennes, France

thesis poster

Olivier Dennler’s PhD defense: 19th December 2022 14:00

Olivier Dennler’s defense on “Characterization in functional modules of ADAMTS-TSL proteins, by phylogeny approaches” will take place on Monday 19th December 2022 at 14:00, room Métivier at IRISA.

Committee:

Lydie LANE : Co-directrice du groupe CALIPHO, Université de Genève, SIB
Hugues RICHARD : Directeur de recherche Robert Koch Institute, Berlin
Vincent BERRY : Professeur Université Montpellier, LIRMM
Pierre TUFFERY : Directeur de recherche INSERM, Paris
Nathalie THÉRET : Directrice de recherche INSERM, Rennes
François COSTE : Chargé de recherche Inria, Rennes
Samuel BLANQUART & Chargé de recherche Inria, Rennes
Catherine BELLEANNÉE & Maîtresse de conférence Université de Rennes 1

Abstract:

The human ADAMTS-TSL multidomain proteins are involved in numerous pathologies.
Encoded by 26 paralogous genes, their domain combination is not sufficient to characterize their functional differences.
We propose in this thesis a new approach to identify functional regions of the sequences.
For this purpose, we use sequences from 9 eukaryotic species to identify conserved sequence modules specific to certain subgroups of homologous sequences.
The evolutionary analysis of the identified modules is obtained by performing a joint phylogenetic reconstruction of genes, species and modules.
Furthermore, to validate the functional interest of the identified modules, we associate phenotypes (PPI) to this evolutionary history.
This has led to the identification of concomitant acquisitions of “modules/phenotypes”, predicting the functionality of these modules.
Applying this approach to human ADAMTS-TSL proteins has allowed us to identify new, finer, non-contiguous functional regions that can describe their specificities.

Emmanuelle Becker’s HDR defense: 14th December 2022

Emmanuelle Becker’s HDR defense on “From homogeneous data to heterogeneous data in systems biology” will take place on Wednesday 14th December 2022 at 14:00, room Petri-Turing at IRISA.

Committee:

Anaïs BAUDOT : Directrice de recherche (MMG, Marseille), rapportrice
Christine BRUN : Directrice de recherche (TAGC, Marseille), examinatrice
Alessandra CARBONE : Prof. Sorbonne Univ. (LCQB, Paris), examinatrice
Olivier DAMERON : Prof. Univ. Rennes (IRISA, Rennes), examinateur
Elisa FROMONT : Prof. Univ. Rennes (IRISA, Rennes), examinatrice
Alejandro MAASS : Prof. Univ. Chile (CMM, Chili), rapporteur
Anne SIEGEL : Directrice de recherche (IRISA, Rennes), examinatrice
Patricia THEBAULT : MCU Univ. Bordeaux (LABRI, Talence), rapportrice

Abstract:

Biological systems involve a large number of different entities, each functionning in a coordinated manner with the others.Their understanding is crucial and can be approached at different scales, from the molecular to the systemic one.
The observation of all these entities in different contexts and at different scales generates a “tsunami of data”, posing complex and interesting computational problems.
My work focuses on the development of methods for knowledge generation and knowledge extraction from these massive data.
The manuscript is organized in three axes. The first axis deals with methods to identify interpretable, robust and replicable signatures in high dimensional unimodal data. The second axis proposes the development of a new approach to integrate multimodal data (miRNA + MRI), and to identify disease progression scores. Finally, the third axis also deals with the integration of heterogeneous data, but with a systemic approach, i.e. taking into account the known relationships between entities. The work presented illustrates the complexity of extracting information from existing databases, despite the constant efforts of the bioinformatics community to structure and unify the available information.

Arnaud Belcour’s PhD defense: 21st october 2022 14:00

Arnaud Belcour’s PhD defense on “Combining knowledge-based and sequence comparison approaches to elucidate metabolic functions, from pathways to communities” will take place on Friday 21st october 2022 at 14:00, room Métivier at IRISA.

Committee:

Delphine Ropers, senior researcher, Inria Grenoble
David Vallenet, senior researcher, CEA Genoscope Évry-Courcouronnes
Karoline Faust, associate professor, KU Leuven
Fabien Jourdan, senior researcher, INRAe Toulouse
Cédric Lhoussaine, professor, Univ. Lille
Samuel Blanquart, researcher, Inria IRISA Rennes
Olivier Dameron, professor, Univ. Rennes1, IRISA Rennes
Anne Siegel, senior researcher, CNRS IRISA Rennes

Abstract:

Metabolism can be modelled and studied at many levels. The first level is the metabolic pathways, which contain a set of chemical transformations leading to the production of compounds of interest. Alternative metabolic pathways were predicted in an alga using a formalism of the metabolic pathway drift and its implementation with constraint programming. The second level is the organism metabolism which contains hundreds of metabolic pathways. A method has been developed to reconstruct homogeneous metabolic networks from heterogeneous public data. The third level is the metabolism of a group of organisms (or taxon) which can be useful to characterize an organism that has not been clearly identified. To achieve this, a method using knowledge engineering and sequence comparison has been created. Finally, the fourth level is the metabolism of a community and the metabolic interaction in this community. A method has been developed to identify the key species among a community.

Short scientific film festival: “patatogene” won 3 awards!

During the short scientific film festival Sciences en court[t]s, Kerian Thuillier (Dyliss), Roland Faure (GenScale), Khodor Hannoush (GenScale), Sandra Romain (GenScale) and Baptiste Ruiz (Dyliss) won 3 prizes (public, scenario and outreach) for their movie “patatogene” (soon on youtube).