In this project, we studied architectures (post-processing, in situ and in transit) and methods to combine simulation and scientific data analysis using Data-Intensive Scalable Computing (DISC). We addressed the following main steps of the data-intensive science process: (1) data preparation, including raw data ingestion and data cleaning, transformation and integration; (2) data processing and simulation execution; (3) exploratory data analysis and visualization; (4) data mining, knowledge discovery and recommendation. A general presentation of the SciDISC project and results was given at the LADaS workshop, held in conjunction with the VLDB 2018 conference in Rio de Janeiro [Valduriez 2018].
The main scientific results are:
1. SciDISC architecture. We defined a generic SciDISC architecture as a basis for developing distributed and parallel techniques to deal with scientific data. The architecture features a high-performance computer (e.g. to perform data processing and simulation) with shared-disk and a shared-nothing cluster to perform data analysis. The high-performance computer can be a supercomputer (e.g. the Bull supercomputer at LNCC) or a large cluster of compute nodes (e.g. Grid5000), which yields different cost-performance trade-offs to be studied. This architecture allows us to design generic techniques for data transfer, partitioning and replication, as a basis for parallel data analysis and fault-tolerance in DISC [Silva 2017, Souza 2017a, 2017b].
2. In situ / in transit data analysis. First, we proposed a general solution that integrates dataflow analysis and data visualization for performing in-situ and in-transit analysis of simulation data [Silva 2017, 2018a]. We implemented this solution in DfAnalyzer [Silva 2018b,c], a library of components that can be plugged directly in the simulation code of highly optimized parallel applications with negligible overhead validated in several real large scale oil & gas applications. With the support of sophisticated online data analysis, scientists get a detailed view of the execution, providing insights to determine when and how to tune parameters [Camata 2018, Silva 2018c,d] or reduce data that does not need to be processed [Souza 2018]. These results led to improve online data analysis by managing user steering actions [Souza 2019a,b]. Second, we developed SAVIME (Simulation and Visualization in-memory), a multi-dimensional array data model system for in-transit data analysis of simulation data in a shared-nothing cluster [Lustosa 2017, 2019]. SAVIME can interface with simulation code running on a super-computer, enabling transparent and asynchronous data transfer to SAVIME. The query capability includes matrix-based complex operations, such as aggregations, joins, and filtering. Our experimental validation, compared with NetCDF file format or SciDB DBMS, shows major improvement in data ingestion time and data transfer. Finally, we have integrated SAVIME with Paraview to support visualization of simulation output.
3. Post-processing data analysis. We explored the use of two popular DISC systems, Spark and OpenAlea, for the efficient post-processing of big scientific data. In the context of seismic applications, data correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomena. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area, which can be very time-consuming. In [Liu 2018a], we take advantage of Spark to efficiently compute PDFs in parallel. We use two methods to compute PDFs efficiently, i.e. data grouping and Machine Learning (ML) prediction. In the context of plan phenotyping, we addressed the problem of efficient workflow execution. Since it is common for users to reuse workflows or data generated by other workflows, a promising approach is to cache intermediate data and exploit it to avoid task re-execution. In [Heidsieck 2019, 2019a], we propose an adaptive caching solution for data-intensive workflows in the cloud. Our experimental validation using OpenAlea and the Phenome workflow shows major performance gains, up to 120.16% with 6 workflow re-executions. Finally, we worked on the analysis of geometrical patterns, i.e. sets of points with all pairwise distances specified, with applications in seismic, astronomy and transportation data. Finding geometric patterns is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In [Porto 2018a, 2018b], we proposed efficient algorithms to find patterns in large data applications using constellation queries.
4. Analysis of spatiotemporal data. We have addressed an important new challenge for this domain: find frequent sequences constrained in space and time that may not be frequent in the entire dataset. This leads to the identification of the maximal time interval and space range where these sequences are frequent. The discovery of such patterns, along with their constraints may lead to extract valuable knowledge that can remain hidden using traditional methods. We have proposed a new Spatio-Temporal Sequence Miner (STSM) algorithm to discover tight space-time sequences [Campisano 2018]. We evaluated STSM using a proof of concept seismic use case. Compared with general spatial-time sequence mining algorithms (GSTSM), STSM enables new insights by detecting maximal space-time areas where each pattern is frequent. In the area of trajectory analysis, we proposed an algorithm that computes space-time aggregations to convert trajectory data into permanent objects analysis for identification of systemic patterns and anomalies over different regions. In the context of urban mobility data, we were able to identify systemic and specific anomalies on the urban transit of Rio de Janeiro [Cruz 2018]. We also developed a novel approach, named CSA approach, to discover and rank motifs in spatial-time series [Borges 2019]. CSA is based on partitioning the spatial-time series into blocks. Inside each block, subsequences of spatial-time series are combined in a way that hash-based motif discovery algorithm is applied. Motifs are validated according to both temporal and spatial constraints. Later, motifs are ranked according to their entropy, the number of occurrences, and the proximity of their occurrences. The approach was evaluated using both synthetic and seismic datasets. We also continue the investigation involving the search of spatial point patterns in large datasets [Porto 2018]. We developed algorithms that explore the combinatorial nature of candidate patterns with solutions designed to be implemented in Apache Spark. The proposed algorithms involve three types of matches: pure, isotropic and non-isotropic, depending on the flexibilizations allowed for a matching pattern. This work has been applied to the process of finding gravitational lensing effects, a phenomenon in astronomy.
5. Machine learning and recommendation. Scientists commonly explore several input data files and parameter values in different executions of scientific workflows and it is critical that they do not produce undesired results. Our proposition is to use provenance data captured during previous workflow executions to recommend data files and parameter values and use ML to predict which data files and parameters are more suitable for a specific workflow execution. We proposed FReeP [Silva Jr. 2018a], a parameter recommendation algorithm that suggests a value to a parameter that agrees with the user preferences. FReeP relies on Preference Learning, provenance data, and voting to recommend a value for a specified workflow parameter or input data. Our experimental evaluation using provenance traces from real executions of SciPhy (bioinformatics) and Montage (astronomy) workflows shows the efficiency of FReeP to recommend parameter values for scientific workflows.
We validated our techniques by building new software (DfAnalyzer, SAVIME), improving our software (Chiron, OpenAlea) or integrating them in software libraries in popular DISC systems, such as Apache Spark. We validated these techniques on real-world scientific data obtained from our application partners (e.g. INRA, Petrobras) in agronomy, astronomy, and computational engineering.