### Objective 1: Evolution/co-evolution

(i) Characterisation of the moment of speciation based on the genetic diversity: We will use an evolutionary individual-based model, where each individual in a general population of different species corresponds to a vertex in a non-directed graph, and an edge links two vertices if the individuals are possible sexual partners. A species could then be associated with a maximal subgraph in the total graph of the population. The objective here is to explore how a notion of graph motif and a transition in the distribution of such motifs might be associated with the moment of transition between two genetically different species (*i.e.* the moment of speciation).

(ii) Co-evolutionary analysis: The possible co-evolution of two sets of species (for instance of parasites and the hosts whose cells the former inhabits) is studied by means of a so-called co-phylogenetic method, which proceeds by optimally mapping one tree (in the example mentioned, of the parasites) to the other given a set of possible events (which include co-speciation and host-switch, among others) and their respective costs. Many combinatorial and algorithmic questions remain open, in particular, some related to the distance, such as Maximum Agreement Forest (MAF), between two trees which form an important part of any co-phylogenetic method. The second aspect was already part of an ongoing collaboration between the partners.

### Objective 2: Network analysis and inference

(i) Efficient and accurate comparison of regulatory networks: The main problem in comparing biological networks, besides the size of such networks that may be important (at least tens of thousands of nodes), is the presence of intrinsic noise. To address the latter, one member on the Brazilian side developed methods for parameter estimation and model selection, as well as tests to compare simultaneously several (sets of) graphs and to identify the correlation between vectors of graphs. However, these methods are computationally heavy (they are based on inefficient/naive optimisation algorithms), which makes their application to large networks very limited. The objective will be to exploit the expertise of members of the French side to design computationally more efficient algorithms, possibly dynamic ones, to allow the statistical methods developed in Brazil to become applicable to large networks.

(ii) Metabolic network analysis: Two main questions will be addressed here. The first is the identification of the parts of a metabolic network impacted by a condition change (which could be the introduction of a new species in the environment) using various types of omics data, and the second is the inference of an optimal set of intervention (mainly gene knock-outs) to reach a given metabolic goal. In both cases, we believe the capacity to reach efficient algorithms will require to better develop the theory from (di)graphs to (directed) hypergraphs, and thus to join the expertise of the two partners.

(iii) Inference of non-coding RNAs, of their targets intra and cross-species and of the overall network of post-transcriptional regulation: Although many methods for detecting such non-coding RNAs and inferring their function have been developed, these remain largely inefficient in providing a clear picture of the world of non-coding RNAs. Identifying the latter’s targets is even harder, notably as concerns some families such as miRNAs, for which this remains an “outstanding (open) problem in the miRNA field”. Both partners have worked in the area but independently, the French side more on the inference of the actors (some classes of non-coding RNAs mostly) and the Brazilian more on the network once these have been inferred. Our aim is then to join our efforts, notably by exploring an association of combinatorial and statistical / machine learning approaches. We first plan to start from intra-species, and then to progress towards the much more open problem of cross-species targeting.

(iv) Inference of variants, notably related to alternative splicing: Such variants are usually identified by modelling the input RNA-seq data as a de Bruijn digraph where the variants will correspond to internally vertex-disjoint (*s*,*t*)-paths with some characteristics. Listing and analysing all (*s*,*t*)-paths (also called bubbles in the bioinformatics literature) in a given graph is usually unfeasible in practice due to the exponential number of bubbles present in real data graphs. The French team has already proposed a notion of a bubble generator set, *i.e.*, a polynomial-sized subset of bubbles from which all the other bubbles can be obtained through a suitable application of a specific symmetric difference operator. Much work remains, however, to be done to improve the method both from the computational and from the application points of view. In the first case, for instance, our current generator is not minimal, and in the second case, it would be important to reduce the number of bubbles that correspond to repeats.

### Objective 3: Knowledge representation and model revision

This will represent the more novel aspect of the collaboration and an exploratory one. Given the enormous amount of noise, and more in general of open issues related to the inference of non-coding RNAs and their targets even intra-species, the French partner intends to use this as a starting point for discussions with the Brazilian expert, namely R. Wassermann. Machine learning techniques have been used in several related areas to approximate classification functions. However, the present algorithms usually do not produce any symbolic representation of such functions. On the other hand, the body of existing knowledge in the field, which can be obtained from experts, can be captured with symbolic representations in some formal language. It would be interesting to see whether such symbolic knowledge can be combined with data to provide more robust models for inferring new non-coding RNAs and especially their targets.