PhD offer: Multisite Management of Data-intensive Scientific Workflows in the Cloud

Directors: Esther Pacitti (University Montpellier 2), Marta Mattoso (UFRJ) and Patrick Valduriez (Inria)
Funding: The joint Microsoft-Inria Research Center
Gross salary : 1957 euros/month (36 months)

This work is part of a new project on advanced data storage and processing for cloud workflows (2013-2017) funded by Microsoft Research, in collaboration with the Kerdata INRIA team. It will be conducted within the Institut de Biologie Computationelle in Montpellier.

Scientific workflows allow scientists to easily express multi-step computational tasks, for instance, load input data files, preprocess the data, run various analyses, and aggregate the results. A scientific workflow describes the dependencies between tasks, typically as a Directed Acyclic Graph (DAG) where the nodes are tasks (that can call programs) and the edges express the task dependencies. As scientific workflows need to deal more and more with big data, it becomes critical to process them in high-performance computing environments such as clusters or clouds. Some scientific workflow systems such as Pegasus and Swift provide parallel support but with an imperative language, which forces optimization and parallelization to be hardcoded.

To be amenable to automatic optimization and parallel processing, the specification of a workflow should be high-level. Recently [1], we have proposed an algebraic approach for the optimization and parallelization of data-intensive scientific workflows. This approach is based on a workflow algebra with powerful operators such as Filter, Map and Reduce, a set of algebraic transformation rules as a basis for optimization and a parallel execution model. It has been implemented in Chiron [2] in a cluster environment.

In this thesis, we consider the problem of managing algebraic workflows to run efficiently in a multisite cloud environment, where each site has its own cluster, data and programs. Such environment is well suited for scientific communities, with groups and labs located at geographically dispersed sites. The problem resembles multisite query processing in distributed and parallel database systems [3,4] and we plan to develop similar techniques for workflow decomposition, optimization and parallelization, dynamic task allocation and efficient management of intermediate data to be exchanged between sites. These techniques will be validated by a prototype implemented using the BlobSteer distributed storage system [5] on Microsoft Azure.

Note: a second Ph.D. position related to the joint project is available in the Kerdata team.


[1] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. In Proceedings of the VLDB Endowment (PVLDB), 4(12): 1328-1339, 2011.

[2] E. Ogasawara, D. Jonas, V. Silva, C. Fernando, D. De Oliveira, F. Porto, P. Valduriez, M. Mattoso. Chiron: A Parallel Engine for Algebraic Scientific Workflows. Journal of Concurrency and Computation: Practice and Experience, 2013.

[3] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems”. Third Edition, Springer ISBN 978-1-4419-8833-1, 2011.

[4] E. Pacitti, R. Akbarinia, M. El Dick. P2P Techniques for Decentralized Applications. Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2012.

[5] B. Nicolae, G. Antoniu, L. Bougé, D. Moise, A. Carpen-Amarie. BlobSeer: Next Generation Data Management for Large Scale Infrastructures. Journal of Parallel and Distributed Computing, 71 (2):168-184, 2011.


  • Distributed programming, distributed and parallel data management, programming languages like C++, Java.
  • Fluent English (internship stays at MSR Redmond, USA, are planned).

Permanent link to this article: