PhD position: “A Data-Centric Execution Model for Scientific Workflows”

PhD position

Advisors: Didier Parigot

The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [taylor07]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies.This work is part of the IBC project (Institut de Biologie Computationelle –, in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed. Existing scientific workflow environments [taylor07] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers. As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [ogasawara11].  The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [valduriez11]. To validate this work, a prototype will be implemented using the SON middleware [parigot12] and a distributed file system like HDFS. For application fields, this work will be in close relationship to the Virtual Plants team which develop computational models of plant development to understand the physical and biological principles that drive the development of plant branching systems and organs. In particular OpenAlea [pradal08] is a software platform for plant analysis and modelling at different scales. It provids a scientific workflow environment to integrate different tak for plant reconstruction, analysis, simulation and visualisation at the tissue level [lucas13] and at the plant level [boudon12]. One challenging application in biology and computer science is to process and analyse data collected on phenotyping plateforms in high-throughput.  The scifloware middleware, combined with OpenAlea, will improve the capability of the plant science community  at analysing high throughput of variables hardly accessible in the field such as architecture, response of organ growth to environmental conditions or radiation use efficiency. This will improve the ability of this community to model the genetic variablity of plant response to environmental cues associated to climate change.


[ogasawara11] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011. [valduriez11] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011. [taylor07] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007. [parigot12] Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012. [pradal08] C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin. OpenAlea: A visual programming and component-based software platform for plant modeling. Functional Plant Biology [lucas13] Lucas, Mikaël, et al. “Lateral root morphogenesis is dependent on the mechanical properties of the overlaying tissues.” Proceedings of the National Academy of Sciences 110.13 (2013): 5229-5234. [boudon12] Boudon, F., Pradal, C., Cokelaer, T., Prusinkiewicz, P., & Godin, C. (2012). L-py: an L-system simulation framework for modeling plant architecture development based on a dynamic language. Frontiers in plant science, 3 Contact: Didier Parigot ( Apply online

Permanent link to this article: