PhD Position: Distributed Management of Scientific Workflows for High-Throughput Plant Phenotyping

Directors: Esther Pacitti (Zenith team, University Montpellier and Inria, LIRMM), François Tardieu (UMR LEPSE, INRA) and Christophe Pradal (CIRAD)

Contact: christophe.pradal@cirad.fr

Funding: #Digitag-Inria (Inria PhD contract, Net salary/month: 1600€)

Keywords : Scientific Workflow, Distributed Computing, Cloud & Grid Computing, Phenotyping, Computer Vision, Reproducibility

Skills: We look for efficient candidates strongly motivated by challenging research topics in a multi-disciplinary environment. The applicant should present a good background in computer science including distributed computing, databases and computer vision. Basic knowledge in scientific workflow would be a plus. As regards software development, C, Python or Java languages are preferred.

Context

This work is part of a new project on Scientific Workflows for Plant Phenotyping using cloud and grid computing, in the context of the Digital Agriculture Convergence Lab (#DigitAg) and in collaboration with the PIA Phenome project. This PhD will be directed both by computer scientists (E. Pacitti, C. Pradal) and by a biologist (F. Tardieu) that will provide both the data and the use cases relevant in plant phenotyping.

In the context of climate change and performance improvement of the crops, plant scientists study traits of interest in order to discover their natural genetic variations and identify their genetic controls. One important category is the morphological traits, which determine the 3D plant architecture [8]. This geometric information is useful to compute in-silico light interception and radiation-use efficiency (RUE), which are essential components to understand the genetic controls of biomass production and yield [9].

During the last decade, high-throughput phenotyping platforms have been designed to acquire quantitative data that will help understanding plant responses to the environmental conditions and the genetic control of these responses. Plant phenotyping consists in the observation of physical and biochemical traits of plant genotypes in response to environmental conditions. Recently, projects such as the Phenome project, have started to use high-throughput platforms to observe the dynamic growth of a large number of plants under different conditions, in field and platform conditions. These widely instrumented platforms produce huge datasets (images of thousands of plants, data collected by various sensors…) that keep increasing with complex in-silico experiments. For example the seven facilities of Phenome produce from 150 to 200 Terabytes of data per year. These data are heterogeneous (images, time courses), multiscale (from the organ to the field) and come from different sites. Farmers and breeders who use sensors from precision agriculture are now able to capture huge amounts of diverse data (e.g. images). Thus, the major problem becomes the automatic analysis of these massive datasets and the reproducibility of the in-silico experiments.

We define a scientific workflow as a pipeline to analyze experiments in an efficient and reproducible way, allowing scientists to express multi-step computational tasks (e.g. upload input files, preprocess the data, run various analyses and simulations, aggregate the results, …). OpenAlea [6] is a scientific workflow system that provides methods and software for plant modeling at different scales. It has been in constant use since 2004 by the plant community: the system has been downloaded 670 000 times and the web site has 10 000 unique visitors a month according to the OpenAlea web repository (https://openalea.gforge.inria.fr).

In the frame of Phenome, we are developing Phenomenal, a software package in OpenAlea that is dedicated to the analysis of phenotyping data in connection with ecophysiological models [9,10]. Phenomenal provides fully automatic workflows dedicated to the 3D reconstruction, segmentation and tracking of plant organs. It has been tested on maize, cotton, sorgho and apple tree. OpenAlea radiative models are used to estimate the light use efficiency and the in silico crop performance in a large range of contexts. To illustrate, Figure 1 shows the Phenomenal workflow that automatically reconstructs the 3D shoot architecture of plants from multi-view images acquired with the Phenoarch platform. This workflow has been tested on various annual and perennial plants such as maize, cotton, sorghum and young apple trees.

Executing such complex scientific workflows on huge datasets may take a lot of time. Thus, we have started to design an infrastructure, called InfraPhenoGrid, to distribute the computation of workflows using the EGI/France Grilles computing facilities [1]. EGI provides access to a grid with multiple sites around the world, each with one or more clusters. This environment is now well suited for data-intensive science, with different research groups collaborating at different sites. In this context, the goal is to address two critical issues in the management of plant phenotyping experiments: (i) scheduling distributed computation and (ii) allowing reuse and reproducibility of experiments [1,2].

Thesis subject

The proposed PhD thesis consists in scheduling the Phenomenal workflow on distributed resources and provide proofs of concepts

Scheduling distributed computation.

We shall adopt an algebraic approach, which is better suited for the optimization and parallelization of data-intensive scientific workflows [3]. The scheduling problem resembles scientific workflow execution in a multisite cloud [4,5]. The objective of the thesis is to go further and propose workflow parallelization and dynamic task allocation and data placement techniques to work with heterogeneous sites, as in EGI. To exchange and share intermediate data, we plan to use iRODS, an open-source data management software that federates distributed and heterogeneous data resources into a single logical file system [7]. In this context, the challenge is to deal with both task allocation and data placement among the different sites, while taking into account their heterogeneity, for instance, different transfer capabilities and cost models.

Allowing reuse and reproducibility of experiments

Modern scientific workflow systems are now equipped with modules that offer assistance for this. This is notably the case of the provenance modules, able to trace the parameter settings chosen at runtime and the data sets used as input of (or produced by) each workflow task. However, allowing workflow reproducibility and reuse depends on providing users with the means to interact with provenance information. The originality of the thesis lies in considering popular tools among data scientists, named interactive notebooks (like RStudio or Jupyter) as a means for users to interact with provenance information directly extracted from workflow runs. Challenges are numerous and include providing users with a simplified (sequential), yet correct (in terms of data dependencies involved) provenance information, hiding the complexity of highly parallel executions.

The approaches that proposed in this PhD will be implemented in OpenAlea. Image data from controlled and field phenotyping experiments will be provided by the Phenome project. The grid and cloud infrastructure for experimenting will be France Grille (European Grid Institute).

References

[1] C. Pradal, S. Artzet, J. Chopard, D. Dupuis, C. Fournier, M. Mielewczik, V. Nègre, P. Neveu, D. Parigot, P. Valduriez, S. Cohen-Boulakia: InfraPhenoGrid: A scientific workflow infrastructure for plant phenomics on the Grid. Future Generation Comp. Syst. 67: 341-353 (2017).

[2] S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, K. Hinsen, P. Larmande, Y. Le Bras, F. Lemoine, F. Mareuil, H. Ménager, C. Pradal, C. Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems. doi: 10.1016/j.future.2017.01.012 (2017).

[3] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. In Proceedings of the VLDB Endowment (PVLDB), 4(12): 1328-1339 (2011).

[4] J. Liu, E. Pacitti, P. Valduriez, M. Mattoso: A Survey of Data-Intensive Scientific Workflow Management. J. Grid Comput. 13(4): 457-493(2015).

[5] J. Liu, E. Pacitti, P. Valduriez, D. de Oliveira, M. Mattoso: Multi-objective scheduling of Scientific Workflows in multisite clouds. Future Generation Computer Systems, 63: 76-95 (2016)

[6] C. Pradal, C. Fournier, P. Valduriez, S. Cohen-Boulakia: OpenAlea: scientific workflows combining data analysis and simulation. SSDBM: 11:1-11:6 (2015).

[7] A. Rajasekar, R. Moore, C. Y. Hou, C. A. Lee, R. Marciano, A. de Torcy et al. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1), 1-143. (2010).

[8] M. Balduzzi, B. M. Binder, A. Bucksch, C. Chang, L. Hong, A. Lyer-Pascuzzi, C. Pradal, E. Sparks. Reshaping plant biology: Qualitative and quantitative descriptors for plant morphology. Frontiers in Plant Science 8:117 (2017).

[9] L. Cabrera‐Bosquet, C. Fournier, N. Brichet, C. Welcker, B. Suard, F. Tardieu. High‐throughput estimation of incident light, light interception and radiation‐use efficiency of thousands of plants in a phenotyping platform. New Phytologist, 212(1), 269-281 (2016).

[10] S. Artzet, N. Brichet, L. Cabrera, T. W. Chen, J. Chopard, M. Mielewczik, C. Fournier, C. Pradal. Image workflows for high throughput phenotyping platforms. BMVA technical meeting: Plants in Computer Vision, London, United Kingdom (2016).

Permanent link to this article: https://team.inria.fr/zenith/phd-position-distributed-management-of-scientific-workflows-for-high-throughput-plant-phenotyping/