Research – GEOSTAT – Geometry & Statistics in Acquisition Data

You can use our plugin to insert parts from from your activity report (raweb)service.

Presentation

Example : abs

Overall objectives

Biomolecules and their function(s).

Computational Structural Biology (CSB) is the scientific domain concerned with the development of algorithms and software to understand and predict the structure and function of biological macromolecules. This research field is inherently multi-disciplinary. On the experimental side, biology and medicine provide the objects studied, while biophysics and bioinformatics supply experimental data, which are of two main kinds. On the one hand, genome sequencing projects give supply protein sequences, and ~200 millions of sequences have been archived in UniProtKB/TrEMBL – which collects the protein sequences yielded by genome sequencing projects. On the other hand, structure determination experiments (notably X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy) give access to geometric models of molecules – atomic coordinates. Alas, only ~150,000 structures have been solved and deposited in the Protein Data Bank (PDB), a number to be compared against the $\sim 10^{8}$ sequences found in UniProtKB/TrEMBL. With one structure for ~1000 sequences, we hardly know anything about biological functions at the atomic/structural level. Complementing experiments, physical chemistry/chemical physics supply the required models (energies, thermodynamics, etc). More specifically, let us recall that proteins with $n$ atoms has $d = 3 n$ Cartesian coordinates, and fixing these (up to rigid motions) defines a conformation. As conveyed by the iconic lock-and-key metaphor for interacting molecules, Biology is based on the interactions stable conformations make with each other. Turning these intuitive notions into quantitative ones requires delving into statistical physics, as macroscopic properties are average properties computed over ensembles of conformations. Developing effective algorithms to perform accurate simulations is especially challenging for two main reasons. The first one is the high dimension of conformational spaces – see $d = 3 n$ above, typically several tens of thousands, and the non linearity of the energy functionals used. The second one is the multiscale nature of the phenomena studied: with biologically relevant time scales beyond the millisecond, and atomic vibrations periods of the order of femto-seconds, simulating such phenomena typically requires $≫ 10^{12}$ conformations/frames, a (brute) tour de force rarely achieved 37.

Computational Structural Biology: three main challenges.

The first challenge, sequence-to-structure prediction, aims to infer the possible structure(s) of a protein from its amino acid sequence. While recent progress has been made recently using in particular deep learning techniques 36, the models obtained so far are static and coarse-grained.

The second one is protein function prediction. Given a protein with known structure, i.e., 3D coordinates, the goal is to predict the partners of this protein, in terms of stability and specificity. This understanding is fundamental to biology and medicine, as illustrated by the example of the SARS-CoV-2 virus responsible of the Covid19 pandemic. To infect a host, the virus first fuses its envelope with the membrane of a target cell, and then injects its genetic material into that cell. Fusion is achieved by a so-called class I fusion protein, also found in other viruses (influenza, SARS-CoV-1, HIV, etc). The fusion process is a highly dynamic process involving large amplitude conformational changes of the molecules. It is poorly understood, which hinders our ability to design therapeutics to block it.

Finally, the third one, large assembly reconstruction, aims at solving (coarse-grain) structures of molecular machines involving tens or even hundreds of subunits. This research vein was promoted about 15 years back by the work on the nuclear pore complex 25. It is often referred to as reconstruction by data integration, as it necessitates to combine coarse-grain models (notably from cryo-electron microscopy (cryo-EM) and native mass spectrometry) with atomic models of subunits obtained from X ray crystallography. Fitting the latter into the former requires exploring the conformation space of subunits, whence the importance of protein dynamics.

As an illustration of these three challenges, consider the problem of designing proteins blocking the entry of SARS-CoV-2 into our cells (Fig. 1). The first challenge is illustrated by the problem of predicting the structure of a blocker protein from its sequence of amino-acids – a tractable problem here since the mini proteins used only comprise of the order of 50 amino-acids (Fig. 1(A), 28). The second challenge is illustrated by the calculation of the binding modes and the binding affinity of the designed proteins for the RBD of SARS-CoV-2 (Fig. 1(B)). Finally, the last challenge is illustrated by the problem of solving structures of the virus with a cell, to understand how many spikes are involved in the fusion mechanism leading to infection. In 28, the promising designs suggested by modeling have been assessed by an array of wet lab experiments (affinity measurements, circular dichroism for thermal stability assessment, structure resolution by cryo-EM). The hyperstable minibinders identified provide starting points for SARS-CoV-2 therapeutics 28. We note in passing that this is truly remarkable work, yet, the designed proteins stem from a template (the bottom helix from ACE2), and are rather small.

Protein dynamics: core CS – maths challenges.

To present challenges in structural modeling, let us recall the following ingredients (Fig. 2). First, a molecular model with $n$ atoms is parameterized over a conformational space $𝒳$ of dimension $d = 3 n$ in Cartesian coordinates, or $d = 3 n - 6$ in internal coordinate–upon removing rigid motions, also called degree of freedom (d.o.f.). Second, recall that the potential energy landscape (PEL) is the mapping $V (\cdot)$ from $ℝ^{d}$ to $ℝ$ providing a potential energy for each conformation 38, 35. Example potential energies (PE) are CHARMM, AMBER, MARTINI, etc. Such PE belong to the realm of molecular mechanics, and implement atomic or coarse-grain models. They may embark a solvent model, either explicit or implicit. Their definition requires a significant number of parameters (up to $\sim 1, 000$ ), fitted to reproduce physico-chemical properties of (bio-)molecules 39.

These PE are usually considered good enough to study non covalent interactions – our focus, even tough they do not cover the modification of chemical bonds. In any case, we take such a function for granted 1.

The PEL codes all structural, thermodynamic, and kinetic properties, which can be obtained by averaging properties of conformations over so-called thermodynamic ensembles. The structure of a macromolecular system requires the characterization of active conformations and important intermediates in functional pathways involving significant basins. In assigning occupation probabilities to these conformations by integrating Boltzmann’s distribution, one treats thermodynamics. Finally, transitions between the states, modeled, say, by a master equation (a continuous-time Markov process), correspond to kinetics. Classical simulation methods based on molecular dynamics (MD) and Monte Carlo sampling (MC) are developed in the lineage of the seminal work by the 2013 recipients of the Nobel prize in chemistry (Karplus, Levitt, Warshel), which was awarded “for the development of multiscale models for complex chemical systems”. However, except for highly specialized cases where massive calculations have been used 37, neither MD nor MC give access to the aforementioned time scales. In fact, the main limitation of such methods is that they treat structural, thermodynamic and kinetic aspects at once 31. The absence of specific insights on these three complementary pieces of the puzzle makes it impossible to optimize simulation methods, and results in general in the inability to obtain converged simulations on biologically relevant time-scales.

The hardness of structural modeling owes to three intertwined reasons.

First, PELs of biomolecules usually exhibit a number of critical points exponential in the dimension 26; fortunately, they enjoy a multi-scale structure 29. Intuitively, the significant local minima/basins are those which are deep or isolated/wide, two notions which are mathematically qualified by the concepts of persistence and prominence. Mathematically, problems are plagued with the curse of dimensionality and measure concentration phenomena. Second, biomolecular processes are inherently multi-scale, with motions spanning $\sim$ 15 and $\sim$ 4 orders of magnitude in time and amplitude respectively 24. Developing methods able to exploit this multi-scale structure has remained elusive. Third, macroscopic properties of biomolecules, i.e., observables, are average properties computed over ensembles of conformations, which calls for a multi-scale statistical treatment both of thermodynamics and kinetics.

Validating models.

A natural and critical question naturally concerns the validation of models proposed in structural bioinformatics. For all three types of questions of interest (structures, thermodynamics, kinetics), there exist experiments to which the models must be confronted – when the experiments can be conducted.

For structures, the models proposed can readily be compared against experimental results stemming from X ray crystallography, NMR, or cryo electron microscopy. For thermodynamics, which we illustrate here with binding affinities, predictions can be compared against measurements provided by calorimetry or surface plasmon resonance. Lastly, kinetic predictions can also be assessed by various experiments such as binding affinity measurements (for the prediction of $K_{o n}$ and $K_{o f f}$ ), or fluorescence based methods (for kinetics of folding).

Last activity report : 2024

2024 : PDF – HTML
2023 : PDF – HTML
2022 : PDF – HTML
2021 : PDF – HTML
2020 : PDF – HTML
2019 : PDF – HTML
2018 : PDF – HTML
2017 : PDF – HTML
2016 : PDF – HTML
2015 : PDF – HTML

Results

New results

Modeling the dynamics of proteins

Keywords: Protein flexibility, protein conformations, collective coordinates, conformational sampling, loop closure, kinematics, dimensionality reduction.

Simpler protein domain identification using spectral clustering

The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful approach to do so is the SPECTRUS algorithm, which provides a segmentation based on spectral clustering applied to a graph coding inter-atomic fluctuations derived from an elastic network model.

We present 20, which makes three straightforward and useful additions to SPECTRUS. For single structures, we show that high quality partitionings can be obtained from a graph Laplacian derived from pairwise interactions–without normal modes. For sets of homologous structures, we introduce a Multiple Sequence Alignment mode, exploiting both the sequence based information (MSA) and the geometric information embodied in experimental structures. Finally, we propose to analyze the clusters/domains delivered using the so-called D-Family matching algorithm, which establishes a correspondence between domains yielded by two decompositions, and can be used to handle fragmentation issues.

Our domains compare favorably to those of the original SPECTRUS, and those of the deep learning based method Chainsaw. Using two complex cases, we show in particular that is the only method handling complex conformational changes involving several sub-domains. Finally, a comparison of and Chainsaw on the manually curated domain classification ECOD as a reference shows that high quality domains are obtained without using any evolutionary related piece of information.

is provided in the Structural Bioinformatics Library, see SBL and Spectral domain explorer.

Algorithmic foundations

Keywords: Computational geometry, computational topology, optimization, graph theory, data analysis, statistical physics.

A mini-review of clustering algorithms and their theoretical properties, with applications to molecular science

Clustering is a fundamental task, in particular to analyze potential and free energy landscapes in molecular science. In this survey 19, I review the key properties of three remarkable clustering algorithms (k-means ++, persistence-based clustering, and spectral clustering) with a double perspective. The first one is the specification of the main mathematical and algorithmic properties of the algorithms; the second one is the relevance of these methods for structural, thermodynamic, and kinetic analysis. Doing so provides a unique opportunity to mention important connexions between optimization, graph theory, geometry, and theoretical biophysics.

Improved seeding strategies for k-means and Gaussian mixture fitting with Expectation-Maximization

k-means clustering and Gaussian Mixture model fitting are fundamental tasks in data analysis and statistical modeling. Practically, both algorithms follow a general iterative pattern, relying on (randomized) seeding techniques.

We revisit the previous seeding methods and formalize their key ingredients (metric used for seed sampling, number of seed candidates, metric used for seed selection). This analysis results in casting most of the previous methods into a coherent framework and, most importantly, yields novel families of initialization methods. Incidentally, these novel methods exploit a lookahead principle–conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy–using at least two selection passes to tame down the effect of randomization.

Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (sum of square error (SSE) for k-means, log-likelihood for Expectation-Maximization applied to Gaussian mixture model fitting), at the same cost. Roughly speaking, our improvement with respect to the greedy smart seeding of k-means++ matches that yielded by this greedy smart seeding with respect to the classical randomized smart seeding.

Remark. Due to the double blind review process of machine learning conferences, the tech report will be made public early 2025.

Subspace-Embedded Spherical Clusters: a novel cluster model for compact clusters of arbitrary dimension

In collaboration with L. Goldenberg (former Inria intern), and with S. Suren (IIT Delhi).

Subspace clustering aims at selecting a small number of original coordinates (features) so that clusters are clearly identified in those subspaces. Subspace techniques rely on parametric cluster models including affine, spherical, Gaussian cluster models–to name a few. To go beyond fully dimensional spherical cluster models and affine clusters of arbitrary dimension, we introduce Subspace-embedded spherical clusters (SESC), a novel cluster model for compact clusters of arbitrary intrinsic dimension. The well posed nature of such clusters is established via the study of an optimization problem relying on an arrangement of hyper-spheres. This arrangement is used to exhibit a piecewise smooth strictly convex function, amenable to non smooth optimization.

We illustrate the merits of the SESC model via comparisons against projection medians and the distance to the measure, and for clustering.

Remark. Due to the double blind review process of machine learning conferences, the tech report will be made public early 2025.

Applications in structural bioinformatics and beyond

Keywords: Docking, scoring, interfaces, protein complexes, phylogeny, evolution.

AlphaFold predictions on whole genomes at a glance

The 2024 Nobel prize in chemistry was awarded to David Baker (Univ. of Washington) for computational protein design, and to Demis Hassabis and John Jumper (Google DeepMind, London, UK), for protein structure prediction. The DeepMind software, called AlphaFold, plays a crucial role to help biologists understand protein functions. We designed novel statistical analysis to assess predictions 21.

For model organisms, AlphaFold predictions show that 30% to 40% of amino acids have a (very) low pLDDT (predicted local distance difference test) confidence score. This observation, combined with the method’s high complexity, commands to investigate difficult cases, the link with IDPs (intrinsically disordered proteins) or IDRs (intrinsically disordered regions), and potential hallucinations. We do so via four contributions. First, we provide a multiscale characterization of stretches with coherent $pLDDT$ values along the sequence, an important analysis for model quality assessment. Second, we leverage the 3D atomic packing properties of predictions to represent a structure as a distribution. This distribution is then mapped into the so-called 2D arity map, which simultaneously performs dimensionality reduction and clustering, effectively summarizing all structural elements across all predictions. Third, using the database of domains ECOD , we study potential biases in AlphaFold predictions at the sequence and structural levels, identifying a specific region of the arity map populated with low quality 3D domains. Finally, with a focus on proteins with intrinsically disordered regions (IDRs), using DisProt and AIUPred, we identify specific regions of the arity map characterized by false positive and false negatives in terms of IDRs.

Summarizing, the arity map sheds light on the accuracy of AlphaFold predictions, both in terms of 3D domains and IDRs.

EncoMPASS: a database for the analysis of membrane protein structures, and symmetries

Membrane proteins (MPs) constitute about 30% of the proteome of each organisms, but they represent only 2% of the entries in the Protein Data Bank (PDB), as their three-dimensional structure is difficult to determine experimentally. Membrane protein structures differ from the rest of the proteome in two respects: 1) despite the great variety of functions performed, their structures are very similar, thus making structural classification more challenging and 2) although symmetric regions are common throughout the whole proteome, in MPs they are often essential for their functional mechanism.

Among the databases collecting and organizing experimental structures of MPs, EncoMPASS is the only one relating the structure and internal symmetry of experimentally determined membrane protein complexes. In this new publication 18, the pipeline and founding criteria for building the database are described along with a complete analysis of the available data. The quality and consistency checks regularly performed on EncoMPASS make it a high quality resource for membrane protein structure algorithms.

Detecting orphan proteins in a nematode’s genome

Protein classified in the same family are called homologs and are thought to share a common ancestor from which they have evolved. Proteins that cannot at present be classified in any known family are called orphan proteins, and their existence can be attributed to either the current limitations in protein classification (we talk then of distant homologs) or to genuinely novel proteins (de novo proteins). Determining whether a protein is orphan – or, even more, a distant homolog or a de novo – is particularly challenging due to the uncertainties and intricateness of homolog detection. In the poster 23 presented at JOBIM2024 by E. Seçkin, we show a new pipeline for determining orphan proteins, and its application to the genomes of the Meloidogyne genus of nematodes. This work is a fundamental step in preparation to the first ever algorithm for characterizing the structure of orphan proteins.

You can write want you want/need on this page by using HTML tags in the text editor or use the visual editor.

Research direction 1

…….

Research direction 2

……….

Research direction 3

……….