A Decentralized Management Approach for Data-intensive Scientific Workflows

PhD. thesis

Director: Patrick Valduriez, senior researcher at INRIA

Supervisor: Hinde Bouziane, associate professor at University Montpellier 2

The ZENITH team is interested in the management of scientific applications that are computation intensive and that manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow
environments have been already proposed [3]. However, they have no support for efficiently managing large data sets. Our team aims to develop an original approach that deals with such large data sets and allows efficient placement of both tasks and data on large scale infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies. This work will be part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr) and will be done in collaboration with biologists, especially from CIRAD, and cloud providers, in particular Microsoft.

The recent concept of cloud Computing mainly relies on the following technology advances: Service-Oriented Architectures, resource virtualization (computational and storage), and modern data management systems, referred to as NoSQL. These technologies enable a flexible and extensible (elastic) usage of resources that is inherent to the Cloud. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed.

Existing scientific workflow environments [3] have been developed primarily to simplify the design and the execution of a set of tasks on parallel-distributed infrastructures. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and to compose these functions with other existing ones in order to build a workflow. Therefore, these environments propose a design approach that we can classify as “process-oriented”, where the information about data dependencies (data flow) are purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grid. Finally, data analyzed and produced by a scientific workflow are often stored in loosely structured files. Simple and classical mechanisms for their management are used: they are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-centric applications ( because of a bottleneck, costly data transfer, etc.).

The main goal of this thesis is to propose an original approach for partitioning and mapping data and tasks to distributed resources. This approach will be based on a declarative specification (data flow oriented) of a workflow, on recent NoSQL approaches for data distribution and management and finally, on a Service Oriented Architecture for flexibility. Moreover, a dynamic approach is necessary to take into account the elasticity of the cloud. It will rely on a P2P architecture, in particular SON [4], currently under development in ZENITH.

A declarative specification has the advantage to be able to express a composition of algebraic operators(map, reduce, split, filter, etc) on manipulated data, from which it is possible to automatically parallelize tasks and to optimize the placement of data and tasks. A first algebraic specification for distributed scientific workflows has
been proposed by ZENITH [1]. This specification is close to the one used in Distributed and Parallel Database Management Systems [2], which have been widely studied. We aim to rely on such systems to influence
tasks placement (and scheduling) depending on the data placement managed by NoSQL systems.

With regard to P2P architectures, they have the benefit to be autonomous, dynamic and scalable. Thus, the interest of this thesis is to propose placement algorithms that are decentralized and able to take decisions guided by observed changes withing the execution environment, data placement and performance requirements (e.g. a
maximum execution time constraint). In this context, we can also rely on existing works around adaptation, e.g. [7].

To validate this work, a prototype will be implemented using the SON middleware [4] and a distributed file system like HDFS (Hadoop Distributed File System). It will be then introduced in a workflow environment like Galaxy (used by CIRAD scientists and researchers, with whom we have collaborations). Experimentations will be performed on the experimental Grid5000 platform and on a cloud environment, in particular Microsoft Azure.

Permanent link to this article: https://team.inria.fr/zenith/une-approche-decentralisee-elastique-pour-des-workflows-scientifiques-orientes-donnees/