Master2 or 5th-year Engineer’s internship proposal 2025-2026:
Deep Learning Prediction of Enzyme Functions Beyond EC Classification 
Keywords: Deep Learning, Bioinformatics, Large Language Models, Functional Annotation, Enzymes
Context
With the advent of high-throughput sequencing technologies, an ever-increasing number of genomes from diverse species are being sequenced. However, sequencing alone is not sufficient—understanding the functions of genes and proteins is crucial for deriving meaningful biological insights.
Current annotation methods primarily rely on homology-based approaches (e.g., BLAST, HMMER [1]) to transfer existing annotations to new sequences. While effective, these methods face significant limitations, particularly when no close homologs are available in existing databases.
Recent advances in deep learning and the development of Large Language Models (LLMs) trained on protein sequences have shown great promise for improving functional prediction. Within our team, we focus particularly on metabolomic networks and, consequently, on the accurate characterisation of enzymatic functions. LLM-based enzyme prediction approaches, such as EnzBERT developed by our team [2] and others [3,4,5], have already achieved performances comparable to traditional methods, while outperforming them in low-homology scenarios [6]. However, these evaluations have been based primarily on the Enzyme Commission (EC) nomenclature, which limits the full potential of machine learning, particularly for leveraging the hierarchical structure of enzyme classes, as implemented in the successor of EnzBERT (see slides [7]).
The goal of this internship is to investigate and develop alternative hierarchical classification schemes that better capture enzymatic functions, enabling the training of next-generation annotation methods.
Objective
Design and evaluate a novel deep-learning-based annotation method of enzymatic functions that:
- Moves beyond traditional EC (Enzyme Commission) class prediction.
- Leverages hierarchical and multi-label classification schemes.
- Integrates large language models (e.g., EnzBERT and its successors).
- Complements or surpasses current homology-based methods
Missions
- Analyse existing enzyme classification systems (EC nomenclature, Gene Ontology, CAZy, CyanoLyase, Rhea, Reactome, KEGG, BioCyc).
- Construct a benchmark dataset from high-quality public annotations for training and evaluating enzyme function predictors.
- Develop and train machine learning models, including next-generation EnzBERT capable of hierarchical and multi-label predictions.
- Compare with state-of-the-art methods (homology-based, deep learning, LLM-based).
Expected Results
- A curated and hierarchically structured enzyme annotation dataset suitable for training and evaluating machine learning models.
- A functional prototype of an enzyme annotation tool capable of predicting enzymatic functions.
- A scientific report and, if results permit, a publication co-authored by the student.
Required Skills
- Background in bioinformatics, computational biology, or computer science.
- Knowledge of deep machine learning or enzymology.
- Proficiency in Python.
- Strong analytical and problem-solving skills.
Practical Information
- Supervision: François Coste, Inria Researcher https://people.rennes.inria.fr/Francois.Coste/
- Location: Rennes, France (Dyliss Team, IRISA / Inria Research Centre at Rennes University)
- Start date: January–March 2026 (flexible)
- Duration: 5–6 months
- Continuation: High-performing interns may have the opportunity to extend this work for two years as a research engineer funded by ECxit, a new Inria Exploratory Action.
Application
Interested candidates should send a CV and a brief motivation letter to francois.coste@inria.fr
References
- Richard Durbin et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
- Nicolas Buton, François Coste, and Yann Le Cunff. “Predicting enzymatic function of protein sequences with attention”. In: Bioinformatics (2023).
- Gi Bae Kim et al. “Functional annotation of enzyme-encoding genes using deep learning with transformer layers”. In: Nature Communications 14.1 (2023), p. 7370.
- Zhenkun Shi et al. “Enzyme commission number prediction and benchmarking with hierarchical dual-core multitask learning framework”. In: Research (2023).
- Tianhao Yu et al. “Enzyme function prediction using contrastive learning”. In: Science 379.6639 (2023), pp. 1358–1363.
- João Capela et al. “Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction”. In: BMC Bioinformatics (2025).
- François Coste, “Enzymatic annotation of protein sequences with a deep language model”, talk at IA pour l’annotation des génomes days, MERIT CNRS network, Paris (2024)
slides: https://people.rennes.inria.fr/Francois.Coste/pub/2024-merit-coste.pdf