# Abstract

This project takes place within the context of the ANR project Colis. Its goal is to certify properties of transformations of data trees, such as the installation scripts for Linux distributions. In previous work, we developed algorithms that can automatically verify properties of top-down transducers on data trees. The main bottleneck of this approach is that such transducers cannot support move operations that are typical for many installation scripts. The goal of this PhD is to overcome this limitation. For this, we are going to propose a model that captures the flexibility of logical transformations. A distinguishing feature of our approach consists in incorporating ideas coming from functional programming.

**Keywords:** data tree; data transformations; data management; higher-order computation; linux packages

**Team:** Links (CRIStAL/INRIA Lille)

## Supervisors

# Requirements

The succesful candidate must have a master in computer science or an equivalent diploma. She/He should have:

- knowledge in formal methods and more particularly in finite states machines,
- rigor and ease with mathematical reasonning,
- competence in functional programming is a plus.

# Scientific Context

The ANR project Colis aims at making package managers of Linux distribution safer. Actions of installation scripts are viewed as transformations of data trees. The project proposes two different approaches to verify the properties of installation scripts: an approach based on proof theory and an approach based on verification methods. The first approach can cover a wide variety of scripts but requires human intervention. The second approach can be fully automated but may not be able to treat all the installation scripts.

Interestingly, the second approach can be completely phrased in terms of data transformations. It is then natural for the INRIA project team Links to develop this approach within the project Colis. Indeed, installation scripts are seen as programs that transform file systems and file systems are modeled as data trees. It is then natural to represent installation scripts by means of **data transformations**. These transformations play a central role in the context of **data management** and **data exchange** and find here a natural application.

The aim of the project consists in defining data transformations that are expressive enough to model a large number of installation scripts, but that are simple enough to be amenable to the verification of certain of their properties. A particular feature we expect from those transformations is that the successive applications of such transformations can be represented by a single transformation. This requirement comes from the way installation scripts will be modeled. Their basic operations will be represented by elementary transformations which are executed successively. Compiling such a sequence of transformations into a single one allows us to verify its properties. Many of the properties we wish to verify can be reduced to the problem of deciding whether two transformations are actually equivalent. The main technical tools that we are going to use are **logic** and **higher-order** programming. While logic has had a long tradition in data management systems, the use of higher-order features is original.

The applicative goals of the project are connected to the open source community. This ensures at the same time a quick dissemination of the techniques proposed by the project and in the longer run a good opportunity for industrial transfer.

# Research Program

The first objective of the PhD project is to overcome the limitations of the first model proposed by Links while preserving its good algorithmic properties. When this is done we need to define tailor made algorithms for this proposal that allow for the verification of the properties we are interested in. Finally, we shall implement and test the proposal on actual installation scripts.

**Defining the class of transducers** will constitute the first part of the project. Actually a good candidate is that of logical transformations. This class verifies the properties we have emphasized: it is closed under composition; the equivalence of particular transductions is decidable. We first need to assess whether it can cover a large enough set of operations that are preformed during the installations of packages. This will form the first milestone of the project.

In the case we estimate that this class of transformations is not rich enough for modeling scripts, we shall have a more precise understanding of the kinds of operations that are missing. Representing these operations within the framework of finite states transducers can have an interesting impact on the community interested by these techniques. In the case it happens that these transformations are sufficiently expressive for our goals, or after having enhanced sufficiently logical transductions to make them up to our need, we shall try to propose machines that can implement exactly these transformations.

**Machines that are equivalent to logical transducers** are already known [1,5,6], but they do not lend themselves to directly represent the composition property and the situation is similar for equivalence algorithms. Thus a second step of the project will consist in defining finite state machines that capture exactly the class of transformations we should have defined in the first part. In particular, we want to obtain a clear way of constructing the composition of such machines. Techniques borrowed to functional programming where functions can be parametrized by functions may play an interesting role in this construction as it has been already showed in other [8]. We shall then turn to the algorithmic properties of the machines we have proposed and try to construct a dedicated algorithm that is able to decide the equivalence of several machines.

**Extending the model** to treat more general classes of structures and more particularly the trees that represent files systems and also to treat data values for modeling file modifications is important so as to obtain a model that is mature enough to handle real-world installation scripts. As we mentioned earlier, work on automata has already been done with the team Links on these topics. Concerning trees, a general approach is given by [3]. This approach has already been extended to take into account data values [2]. Adapting those results to the machines we will have defined can prove technical. But the difficulties should not be overwhelming. It is nevertheless hard to assess how long coping with them can take as the precise models with which we are going to work are still to be defined.

**Implementing and testing** the proposed solution on installation scripts will constitute the third task of the project. This part will require a heavy investment and will probably meet some technical difficulties. However, the risk will be distilled by previous efforts made within the team in relation with the project concerning finite state methods. The first goal is to construct a prototype around the formal proposals that will have been previously made. Implementing an abstract machine for executing the transducers and a compilation algorithm that computes the compositions of transducers will constitute the first step of the implementation effort. Then we shall try to bridge the gap between actual scripts and the transducer model. This can be done by using the specialized language that the partner of the ANR project Colis are building and translate its basic operations into transducers. We shall then implement the equivalence algorithm. At this stage we will be in position of testing our proposal.

# Bibliography

[1] Rajeev Alur, Loris D’Antoni: Streaming Tree Transducers. ICALP (2) 2012: 42-53

[2] Adrien Boiret, Vincent Hugot, Joachim Niehren, and Ralf Treinen. Logics for Unordered Trees with Data Constraints on Siblings. In LATA : 9th Interna- tional Conference on Language and Automata Theory and Applications, pages 175–187, Nice, France, March 2015.

[3] Adrien Boiret, Vincent Hugot, Joachim Niehren, and Ralf Treinen. Automata for Unordered Trees. Information and Computation, July 2016.

[4] Bruno Courcelle: Monadic Second-Order Definable Graph Transductions: A Survey. Theor. Comput. Sci. 126(1): 53-75 (1994)

[5] Joost Engelfriet, Sebastian Maneth: The Equivalence Problem for Deterministic MSO Tree Transducers Is Decidable. FSTTCS 2005: 495-504

[6] Joost Engelfriet, Sebastian Maneth: Characterizing and Deciding MSO-Definability of Macro Tree Transductions. STACS 2000: 542-554

[7] Joost EngelFriet and Sebastian Maneth. Macro tree transduction, attribute grammars and mso definable tree translations. Information and Computation, 154:34–91, 1999.

[8] Sylvain Salvati. Lambda-Calculus and Formal Language Theory. Habilitation à diriger les recherches, Université de Bordeaux, 2015.