Jun 21

PhD Position: Distributed Management of Scientific Workflows for High-Throughput Plant Phenotyping

Directors: Esther Pacitti (Zenith team, University Montpellier and Inria, LIRMM), François Tardieu (UMR LEPSE, INRA) and Christophe Pradal (CIRAD)

Contact: christophe.pradal@cirad.fr

Funding: #Digitag-Inria (Inria PhD contract, Net salary/month: 1600€)

Keywords : Scientific Workflow, Distributed Computing, Cloud & Grid Computing, Phenotyping, Computer Vision, Reproducibility

Skills: We look for efficient candidates strongly motivated by challenging research topics in a multi-disciplinary environment. The applicant should present a good background in computer science including distributed computing, databases and computer vision. Basic knowledge in scientific workflow would be a plus. As regards software development, C, Python or Java languages are preferred.


This work is part of a new project on Scientific Workflows for Plant Phenotyping using cloud and grid computing, in the context of the Digital Agriculture Convergence Lab (#DigitAg) and in collaboration with the PIA Phenome project. This PhD will be directed both by computer scientists (E. Pacitti, C. Pradal) and by a biologist (F. Tardieu) that will provide both the data and the use cases relevant in plant phenotyping.

In the context of climate change and performance improvement of the crops, plant scientists study traits of interest in order to discover their natural genetic variations and identify their genetic controls. One important category is the morphological traits, which determine the 3D plant architecture [8]. This geometric information is useful to compute in-silico light interception and radiation-use efficiency (RUE), which are essential components to understand the genetic controls of biomass production and yield [9].

During the last decade, high-throughput phenotyping platforms have been designed to acquire quantitative data that will help understanding plant responses to the environmental conditions and the genetic control of these responses. Plant phenotyping consists in the observation of physical and biochemical traits of plant genotypes in response to environmental conditions. Recently, projects such as the Phenome project, have started to use high-throughput platforms to observe the dynamic growth of a large number of plants under different conditions, in field and platform conditions. These widely instrumented platforms produce huge datasets (images of thousands of plants, data collected by various sensors…) that keep increasing with complex in-silico experiments. For example the seven facilities of Phenome produce from 150 to 200 Terabytes of data per year. These data are heterogeneous (images, time courses), multiscale (from the organ to the field) and come from different sites. Farmers and breeders who use sensors from precision agriculture are now able to capture huge amounts of diverse data (e.g. images). Thus, the major problem becomes the automatic analysis of these massive datasets and the reproducibility of the in-silico experiments.

We define a scientific workflow as a pipeline to analyze experiments in an efficient and reproducible way, allowing scientists to express multi-step computational tasks (e.g. upload input files, preprocess the data, run various analyses and simulations, aggregate the results, …). OpenAlea [6] is a scientific workflow system that provides methods and software for plant modeling at different scales. It has been in constant use since 2004 by the plant community: the system has been downloaded 670 000 times and the web site has 10 000 unique visitors a month according to the OpenAlea web repository (https://openalea.gforge.inria.fr).

In the frame of Phenome, we are developing Phenomenal, a software package in OpenAlea that is dedicated to the analysis of phenotyping data in connection with ecophysiological models [9,10]. Phenomenal provides fully automatic workflows dedicated to the 3D reconstruction, segmentation and tracking of plant organs. It has been tested on maize, cotton, sorgho and apple tree. OpenAlea radiative models are used to estimate the light use efficiency and the in silico crop performance in a large range of contexts. To illustrate, Figure 1 shows the Phenomenal workflow that automatically reconstructs the 3D shoot architecture of plants from multi-view images acquired with the Phenoarch platform. This workflow has been tested on various annual and perennial plants such as maize, cotton, sorghum and young apple trees.

Executing such complex scientific workflows on huge datasets may take a lot of time. Thus, we have started to design an infrastructure, called InfraPhenoGrid, to distribute the computation of workflows using the EGI/France Grilles computing facilities [1]. EGI provides access to a grid with multiple sites around the world, each with one or more clusters. This environment is now well suited for data-intensive science, with different research groups collaborating at different sites. In this context, the goal is to address two critical issues in the management of plant phenotyping experiments: (i) scheduling distributed computation and (ii) allowing reuse and reproducibility of experiments [1,2].

Thesis subject

The proposed PhD thesis consists in scheduling the Phenomenal workflow on distributed resources and provide proofs of concepts

Scheduling distributed computation.

We shall adopt an algebraic approach, which is better suited for the optimization and parallelization of data-intensive scientific workflows [3]. The scheduling problem resembles scientific workflow execution in a multisite cloud [4,5]. The objective of the thesis is to go further and propose workflow parallelization and dynamic task allocation and data placement techniques to work with heterogeneous sites, as in EGI. To exchange and share intermediate data, we plan to use iRODS, an open-source data management software that federates distributed and heterogeneous data resources into a single logical file system [7]. In this context, the challenge is to deal with both task allocation and data placement among the different sites, while taking into account their heterogeneity, for instance, different transfer capabilities and cost models.

Allowing reuse and reproducibility of experiments

Modern scientific workflow systems are now equipped with modules that offer assistance for this. This is notably the case of the provenance modules, able to trace the parameter settings chosen at runtime and the data sets used as input of (or produced by) each workflow task. However, allowing workflow reproducibility and reuse depends on providing users with the means to interact with provenance information. The originality of the thesis lies in considering popular tools among data scientists, named interactive notebooks (like RStudio or Jupyter) as a means for users to interact with provenance information directly extracted from workflow runs. Challenges are numerous and include providing users with a simplified (sequential), yet correct (in terms of data dependencies involved) provenance information, hiding the complexity of highly parallel executions.

The approaches that proposed in this PhD will be implemented in OpenAlea. Image data from controlled and field phenotyping experiments will be provided by the Phenome project. The grid and cloud infrastructure for experimenting will be France Grille (European Grid Institute).


[1] C. Pradal, S. Artzet, J. Chopard, D. Dupuis, C. Fournier, M. Mielewczik, V. Nègre, P. Neveu, D. Parigot, P. Valduriez, S. Cohen-Boulakia: InfraPhenoGrid: A scientific workflow infrastructure for plant phenomics on the Grid. Future Generation Comp. Syst. 67: 341-353 (2017).

[2] S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, K. Hinsen, P. Larmande, Y. Le Bras, F. Lemoine, F. Mareuil, H. Ménager, C. Pradal, C. Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems. doi: 10.1016/j.future.2017.01.012 (2017).

[3] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. In Proceedings of the VLDB Endowment (PVLDB), 4(12): 1328-1339 (2011).

[4] J. Liu, E. Pacitti, P. Valduriez, M. Mattoso: A Survey of Data-Intensive Scientific Workflow Management. J. Grid Comput. 13(4): 457-493(2015).

[5] J. Liu, E. Pacitti, P. Valduriez, D. de Oliveira, M. Mattoso: Multi-objective scheduling of Scientific Workflows in multisite clouds. Future Generation Computer Systems, 63: 76-95 (2016)

[6] C. Pradal, C. Fournier, P. Valduriez, S. Cohen-Boulakia: OpenAlea: scientific workflows combining data analysis and simulation. SSDBM: 11:1-11:6 (2015).

[7] A. Rajasekar, R. Moore, C. Y. Hou, C. A. Lee, R. Marciano, A. de Torcy et al. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1), 1-143. (2010).

[8] M. Balduzzi, B. M. Binder, A. Bucksch, C. Chang, L. Hong, A. Lyer-Pascuzzi, C. Pradal, E. Sparks. Reshaping plant biology: Qualitative and quantitative descriptors for plant morphology. Frontiers in Plant Science 8:117 (2017).

[9] L. Cabrera‐Bosquet, C. Fournier, N. Brichet, C. Welcker, B. Suard, F. Tardieu. High‐throughput estimation of incident light, light interception and radiation‐use efficiency of thousands of plants in a phenotyping platform. New Phytologist, 212(1), 269-281 (2016).

[10] S. Artzet, N. Brichet, L. Cabrera, T. W. Chen, J. Chopard, M. Mielewczik, C. Fournier, C. Pradal. Image workflows for high throughput phenotyping platforms. BMVA technical meeting: Plants in Computer Vision, London, United Kingdom (2016).

Permanent link to this article: https://team.inria.fr/zenith/phd-position-distributed-management-of-scientific-workflows-for-high-throughput-plant-phenotyping/

May 15

IBC seminar: Dennis Shasha “Reducing Errors by Refusing to Guess (Occasionally)” 1 June 2017

IBC Seminar
Thursday 1 June 2017, 15h, room 1/124, bat. 5 Campus Saint Priest, Montpellier

Reducing Errors by Refusing to Guess (Occasionally)

Dennis Shasha

New York University
We propose a meta-algorithm to reduce the error rate of state-of-the-art machine learning algorithms by refusing to make predictions in certain cases even when the underlying algorithms suggest predictions. Intuitively, our new Conjugate Prediction approach estimates the likelihood that a prediction will be in error and when that likelihood is high, the approach refuses to go along with that prediction. Unlike other approaches, we can probabilistically guarantee an error rate on predictions we do make (denoted the {\em decisive predictions}). Empirically on seven diverse data sets from genomics, ecology, image-recognition, and gaming,, our method can probabilistically guarantee to reduce the error rate to 1/4 of what it is in the state-of-the-art machine learning algorithm at a cost of between 11% and 58% refusals. Competing state-of-the-art methods refuse at roughly twice the rate  of ours (sometimes refusing all suggested predictions).

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-dennis-shasha-reducing-errors-by-refusing-to-guess-occasionally-1-june-2017/

Apr 26

IBC seminar: Fabio Porto “Simulation Data Management” 1 June 2017

IBC Seminar
Thursday 1 June 2017, 14h, room 1/124, bat. 5 Campus Saint Priest, Montpellier

Simulation Data Management

Fabio Porto

LNCC, Rio de Janeiro, Brazil

Numerical Simulation has attracted the interest of different areas from engineering and biology to astronomy. By using simulations scientists can analyse the behaviour of hard to observe phenomena and practitioners  can test techniques before engaging into dangerous actions, such as brain surgery.  Simulations are CPU intensive applications, normally running in super-computers, such as the Santos_Dumont machine at LNCC. They also produce a huge amount of data, distributed in hundreds of files, and including different structures, such as: a domain discretisation mesh, field values and domain topology information. These data must be integrated
and structured in a way that scientists can easily and efficiently interpret the simulation outcome, using analytical queries or scientific visualization tools. In this talk we will presents the main challenges involved in managing simulation data  and highlight the recent results in this area.

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-fabio-porto-simulation-data-management-1-june-2017/

Feb 15

IBC seminar: Tamer Özsu “Approaches to RDF Data Management and SPARQL Query Processing” 9 march 2017

IBC Seminar
Thursday 9 march 2017, 14h, room 3/124, bat. 5 Campus Saint Priest, Montpellier

Approaches to RDF Data Management and SPARQL Query Processing
M. Tamer Özsu University of Waterloo, Canada
Resource Description Framework (RDF) was originally proposed by the World Wide Web Consortium (W3C) for modeling Web objects as part of developing the “semantic web”. It has found uses in other areas such as the management of biological data (e.g., UniProt) and web data integration (through Linked Open Data). W3C has also proposed SPARQL as the query language for accessing RDF data repositories. Given the growing size of RDF data sets, and the existence of a declarative query language, the topic is ripe for the application of state-of-the-art data management techniques. In this talk, I will discuss the various approaches to RDF data management and SPARQL query processing, including relational techniques, graph approaches, distributed RDF data management, and Linked Open Data querying, using examples from a number of domains including biological data sets, and popular movie databases.

M. Tamer Özsu is Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. His research is in data management focusing on large-scale data distribution and management of non-traditional data. He is a Fellow of the Royal Society of Canada, of the Association for Computing Machinery (ACM), and of the Institute of Electrical and Electronics Engineers (IEEE). He is an elected member of the Science Academy of Turkey, and member of Sigma Xi and American Association for the Advancement of Science (AAAS).

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-tamer-ozsu-approaches-to-rdf-data-management-and-sparql-query-processing-9-march-2017/

Jan 23

Zenith seminar: Fabio Porto “Database System Support of Simulation Data” 27 jan. 2017

Zenith seminar, friday 27 january 10h30-12h, room 2/124 bat 5 Campus Saint Priest

Database System Support of Simulation Data
Fabio Porto
LNCC, Rio de Janeiro, Brazil

Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this work, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and spacefilling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row-store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-fabio-porto-database-system-support-of-simulation-data-27-jan-2017/

Dec 03

Zenith seminar: Pierre Bourhis “A Formal Study of Collaborative Access Control in Distributed Catalog” 2 dec. 2016

Séminaire Zenith: vendredi 2 déc. 10h30, salle 3/124 bat. 5

A Formal Study of Collaborative Access Control in Distributed Datalog
Pierre Bourhis
CNRS et Inria Lille

We formalize and study a declaratively specified collaborative access control mechanism for data
dissemination in a distributed environment. Data dissemination is specified using distributed
datalog. Access control is also defined by datalog-style rules, at the relation level for extensional
relations, and at the tuple level for intensional ones, based on the derivation of tuples. The model
also includes a mechanism for “declassifying” data, that allows circumventing overly restrictive
access control. We consider the complexity of determining whether a peer is allowed to access
a given fact, and address the problem of achieving the goal of disseminating certain information
under some access control policy. We also investigate the problem of information leakage, which
occurs when a peer is able to infer facts to which the peer is not allowed access by the policy.
Finally, we consider access control extended to facts equipped with provenance information,
motivated by the many applications where such information is required. We provide semantics
for access control with provenance, and establish the complexity of determining whether a peer
may access a given fact together with its provenance. This work is motivated by the access
control of the Webdamlog system, whose core features it formalizes.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-pierre-bourhis-a-formal-study-of-collaborative-access-control-in-distributed-catalog/

Nov 06

Zenith seminar: Teresa Branch-Smith “An Introduction to Philosophy and a closer look at Philosophy of Science” 7 nov. 2016

Séminaire Zenith
Lundi 7 novembre 2016, 15h30
Salle 2/124, Bat. 5, Campus Saint Priest

An Introduction to Philosophy and a closer look at Philosophy of Science
Teresa Branch-Smith
University of Waterloo, Canada

Philosophy of science is a major sub-discipline of philosophy aimed at considering the methods, laws, and implications of science. While it is now considered an entirely separate field of the academy with its own journals, conferences and highly specialized degrees, historically scientists themselves were also philosophers (and called Natural Scientists). To make philosophy as a discipline more accessible for our discussions, this talk will first go over some basic themes in philosophy and how the philosophy of science is situated within the discipline. Afterwards, the discussion will narrow to consider major themes in the philosophy of science. The aim of this talk is to introduce the team to relevant topics in philosophy that might also be found in their work and prime us for subsequent discussions about the implications of big data analytics specifically. Finally, I will outline a general plan for what the philosophy of big data project might look like in the upcoming months and how the Zenith’s expertise will be sourced, highlighted and improved.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-teresa-branch-smith-an-introduction-to-philosophy-and-a-closer-look-at-philosophy-of-science-7-nov-2016/

Oct 18

Journée Floris’Tic, Agropolis, Montpellier, 8 nov. 2016

Le 8 novembre de 9h00 à 16h00, aura lieu la première journée Floris’Tic à Agropolis International, Montpellier ! Venez nous rencontrer et découvrir ce projet !


Le projet Floris’Tic

Floris’TIC, c’est un projet qui propose un socle technologique innovant pour le promotion des sciences du végétal, avec notamment des outils nomades de terrain, des outils collaboratifs, des outils pédagogiques et des serious games, des formations, des ateliers participatifs, et une équipe pluri-disciplinaire pour vous accompagner sur vos propres projets !

Floris’TIC c’est le regroupement de compétences nombreuses et complémentaires d’Agropolis Fondation, de Tela Botanica, duCIRAD, de l’INRIA, de l’IRD, de l’INRA, du CNRS, et de l’Université de Montpellier et le soutien du “Programme Investissements d’Avenir”.

> En savoir plus sur le site web floristic.org

Floris’Tic organise sa première journée

Nous vous donnons rendez-vous le 8 novembre à Agropolis International (Montpellier).

Cette journée vous permettra :
-  d’identifier les réalisations Floris’Tic (Pl@ntNet, Smart’Flore, The Plant Game, MOOC Botanique) et de les tester via un forum déambulatoire ;
-  de comprendre comment utiliser ces outils pour la mise en œuvre de projets à travers des témoignages d’acteurs ;
-  d’initier de nouvelles collaborations via des échanges et des ateliers de rencontre.

Nous espérons que le programme détaillé de la rencontre ainsi que la plaquette du projet vous donneront l’envie de venir échanger avec nous à cette occasion !

La participation est libre mais que l’inscription est obligatoire.

Permanent link to this article: https://team.inria.fr/zenith/journee-floristic/

Sep 27

Zenith engages in the new H2020 project CloudDbAppliance, 1 dec. 2016

The CloudDbAppliance Project is a European H2020 2016-2019 (3 years) with Bull/Atos (leader), Inria Zenith, U. Madrid, INESC and the companies LeanXcale, QuartetFS, Nordea, BTO, H3G, IKEA, CloudBiz, and Singular Logic. The objective is to build the European Cloud In-Memory Database Appliance with Predictable Performance for Critical Applications.


Permanent link to this article: https://team.inria.fr/zenith/zenith-engages-in-a-new-h2020-project-clouddbappliance/

Sep 20

MUSIC seminar: Esther Pacitti “Experience on Data Management Techniques for Scientific Applications” 8 dec. 2016

Experience on Data Management  Techniques for Scientific Applications

Esther Pacitti
Inria and LIRMM, University of Montpellier

8 December 2016, 14h
Fundação Getulio Vargas, Rio de Janeiro


Modern science such as agronomy, bio-informatics, and environmental science must deal with overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed, analyzed) in all kinds of ways in order to draw new conclusions, prove scientific theories and produce knowledge. In this talk I will present some experience of specific  data management techniques (recommendation, crowdsourcing, data privacy, etc) used in the context of citizen science and  internet of objects applications.

Permanent link to this article: https://team.inria.fr/zenith/music-seminar-esther-pacitti-experience-on-data-management-techniques-for-scientific-applications/