1. Multisite cloud architecture
In [Liu 2015], we define a multisite cloud architecture that is composed of traditional clouds, e.g., a pay-per-use cloud service such as Amazon EC2 and Microsoft Azure, private data-centers, e.g. a cloud of a scientific organization like Inria, COPPE or LNCC, and client desktop machines that have authorized access to the data-centers. We can model this architecture as a distributed system on the Internet, each site having its own computer cluster, data and programs. An important requirement is to provide distribution transparency for advanced services (i.e. workflow management, data analysis), to ease their scalability and elasticity. Current solutions for multisite clouds typically rely on application specific overlays that map the output of one task at a site to the input of another in a pipeline fashion. Instead, we define fully distributed services for data storage, intersite data movement and task scheduling.
2. Big data integration
In astronomy surveys, collected data are published through catalogues of stars. A common problem that appears in this case is how to integrate the distributed catalogues, taking into account the ambiguity on objects identification among catalogues. We have developed NACluster [Freire 2014], a scalable algorithm to match objects in catalogues. The main focus is scaling-up the NACluster algorithm to run in parallel, by adopting a MapReduce execution strategy. The work involves applying a data partitioning strategy that considers different density regions of the sky, as much as dealing with sky objects sitting close to spatial frontiers. Additionally, the strategy must deal with the impact over clustering of sky objects in the partitioning frontier. For each partition, a multi-dimensional spatial index will support cluster centroid searching. We developed a parallel version of NACluster. The implementation has been mapped to the Spark framework and integrated into a workflow. The latter combines data partitioning, using the France algorithm, and loads the partition into an HDFS cluster, thus enabling the parallel execution of NACluster over the partitioned data.
3. Scientific workflow algebra
Dynamic workflows are scientific workflows to support computational science simulations, typically using dynamic processes based on runtime scientific data analyses. They require the ability of adapting the workflow, at runtime, based on user input and dynamic steering. Supporting data-centric iteration is an important step towards dynamic workflows because user interaction with workflows is iterative. However, current support for iteration in scientific workflows is static and does not allow for changing data at runtime.
In [Dias 2014], we propose a solution based on algebraic operators and a dynamic execution model to enable workflow adaptation based on user input and dynamic steering. We introduce the concept of iteration lineage that makes provenance data management consistent with dynamic iterative workflow changes. Lineage enables scientists to interact with workflow data and configuration at runtime through an API that triggers steering. We evaluated our approach using a novel and real large-scale workflow for uncertainty quantification on a 640-core cluster. The results show impressive execution time savings from 2.5 to 24 days, compared to non-iterative workflow execution. The maximum overhead introduced by our iterative model is less than 5% of execution time. Also, our proposed steering algorithms are very efficient and run in less than 1 millisecond, in the worst-case scenario.
4. Processing scientific workflows in a multisite cloud
A scientific workflow often needs to be partitioned and executed in a multisite cloud environment. In [Liu 2014a, Liu 2014b], we propose a non-intrusive approach to execute scientific workflows in a multisite cloud with three workflow partitioning techniques. We describe an experimental validation using an adaptation of the Chiron SWfMS for Microsoft Azure. The experiment results reveal the efficiency of our partitioning techniques, and their superiority in different environments.
Based on the multisite cloud architecture defined in [Liu 2015], we investigate dynamic scheduling methods. The scheduling problem is to decide at which cloud site to execute the activities while achieving one or multiple objectives. The objectives can be to reduce execution time or financial cost, to maximize reliability etc. In [Liu 2016a, Liu 2016b], we address the scheduling problem with two objectives, i.e. execution time and monetary cost, and propose a novel activity scheduling approach for scientific workflow execution in a multisite cloud. We developed a system model for multisite workflow and validated our proposed techniques using the SciEvol bioinformatics workflow with experiments on the Azure cloud with three sites.
We developed a multisite version of Chiron, i.e. multisite Chiron, and proposed a multisite task scheduling algorithm that considers the time to generate provenance data [Liu 2016c]. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workfows (Buzz and Montage). The results show that our scheduling algorithm is up to 49,6% better than baseline algorithms in terms of execution time.
5. Simulation data management
As numerical simulations become more precise and cover longer periods of time, they produce files with terabytes of data that need to be efficiently analyzed. In [Lustosa 2016], we investigate techniques for managing such data using an array DBMS. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with row-store and column-store DBMS. The results show that multidimensional arrays and column-stores are much faster than a row-store for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.
Most of the raw data files produced by numerical simulations follow a de facto standard format established by the application domain, e.g., FITS for astronomy. DBMS are not suited for this, because they require loading the raw data and structuring it, which gets heavy at large-scale. Systems like NoDB, RAW and FastBit, have been proposed to index and query raw data files without the overhead of using a DBMS. However, they focus on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data, which is time-consuming and error-prone. When computer simulations are managed by a SWfMS, they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. In [Silva 2016a-c], we propose a dataflow approach for analyzing element data from several related raw data files using a SWfMS. We validated our approach with the Montage workflow from astronomy and a workflow from Oil and Gas domain as I/O intensive case studies.
6. Parallel Data Mining
A myriad of applications from different scientific domains collects time series data for further analysis. In many of them, such as seismic datasets, the observed data is also associated to a space dimension, which corresponds, in fact, to spatial-time series. The analysis of these datasets is difficult due to both the continuous nature of the observed data and the relationship between spatial and time dimensions. Meanwhile, sequential patterns mining techniques have been successfully used in large volume of transactional databases to obtain insights from data. In [Campisano 2016], we explore the discovery of frequent sequential patterns in seismic datasets. For that, we discretize continuous values into symbols and adapt well-known sequential algorithm to mine spatial-time dataset. To better understand the quality of the identified patterns, we visualize them over the original seismic traces images. Our preliminary results indicate that our approach to sequence mining in seismic datasets is promising.