NB: a general presentation of the SciDISC project and results has been given at the LADaS workshop, held in conjunction with the VLDB 2018 conference [Valduriez 2018]
In situ and in transit data analysis
In situ analysis and visualization have been used successfully in large-scale computational simulations to visualize scientific data of interest, while data is in memory. Such data are obtained from intermediate (or final) simulation results, and once analyzed are typically stored in raw data files. However, existing in situ data analysis and visualization solutions (e.g. ParaView/Catalyst, VisIt) have limited online query processing and no support for dataflow analysis. The latter is a challenge for exploratory raw data analysis. We are working on a solution that integrates dataflow analysis with ParaView Catalyst for performing in-situ and in-transit data analysis and monitoring dataflow from simulation runs [Camata 2018]. Based on this in-situ data extraction and analysis we plan to improve our dataflow monitoring, debugging and extend our support for adaptation at runtime like parameter fine-tuning and data reduction.
In [Silva 2018a], we propose a solution (architecture and algorithms), called Armful, to combine the advantages of a dataflow-aware SWMS and raw data file analysis techniques to allow for queries on raw data file elements that are related but reside in separate files. Armful is available as open source software (https://hpcdb.github.io/armful/) and its main components are a raw data extractor, a provenance gatherer and a query processing interface, which are all dataflow-aware.
An instantiation of Armful is DfAnalyzer [Silva 2018b], a library of components to support online in-situ and in-transit data analysis. DfAnalyzer components are plugged directly in the simulation code of highly optimized parallel applications with negligible overhead. With support of sophisticated online data analysis, scientists get a detailed view of the execution, providing insights to determine when and how to tune parameters [Camata 2018, Silva 2018a] or reduce data that does not need to be processed [Souza 2018]. The source code of the DfAnalyzer implementation for Spark is available on github (github.com/hpcdb/RFA-Spark).
We also continued working on SAVIME (Simulation and Visualization in-memory), a multi-dimensional array data model system for in-transit data analysis of simulation data. During this year, we spent time on implementation and evaluation of SAVIME. The framework has been extended to interface with the simulation code running on the super-computer, enabling transparent and asynchronous data transfer to SAVIME, running on a shared-nothing cluster. The query capability of SAVIME has been extended to include matrix-based complex operations, such as: aggregations, joins and filtering. Initial experiments have investigated the performance of data ingestion comparing SAVIME against file formats, such as NetCDF, and SciDB. The results show that SAVIME is competitive compared with NetCDF and orders of magnitude more efficient than SciDB. We also evaluated its impact on the simulation code. The results demonstrate that as the size of meshes increases SAVIME data transfer overhead becomes negligible. Finally, we have integrated SAVIME with Paraview to support visualization of simulation output. The code of SAVIME is available at github.com/hllustosa/Savime.
Post-processing of simulation data
We explored the use of Spark, one of the most popular DISC systems, as a platform for the efficient post-processing of big simulation data in the context of seismic applications. Such data correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomena. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area, which can be very time consuming. In [Liu 2018a], we take advantage of Spark to efficiently compute PDFs in parallel. The basic processing of PDFs consists of data loading, from NFS to Spark RDDs, followed by PDF computation using Spark. The data loading process treats the data corresponding to a slice and pre-processes it in parallel, i.e. calculates statistical parameters of observation values of each point. Then, the PDF computation groups the data and calculates the PDFs and errors of all the points in a slice in parallel. In order to efficiently compute PDFs, we use two memory management techniques: data caching and window size adjustment. We use two methods to compute PDFs efficiently, i.e. data grouping and ML prediction.
Analysis of scientific data
We worked on the analysis of geometrical patterns and data mining of spatial-time series. A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. Finding geometric patterns is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In [Porto 2018a], we propose algorithms to find patterns in large data applications using constellation queries. Our methods combine quadtrees, matrix multiplication, and bucket join processing. Our distributed experiments show that the choice of the composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm. Answering complex constellation queries, i.e. isotropic and non-isotropic queries, is challenging because scale factors and stretch factors may take any of an infinite number of values. In [Porto 2018b], we propose practically efficient sequential and distributed algorithms for pure, isotropic, and non-isotropic constellation queries. As far as we know, this is the first work to address isotropic and non-isotropic queries.
The problem of discovering spatiotemporal sequential patterns affects a broad range of scientific applications. There are many initiatives that find sequences constrained by space and time. However, we have addressed an appealing new challenge for this domain: find tight space-time sequences, i.e., find frequent sequences constrained in space and time that may not be frequent in the entire dataset. This leads to the identification of the maximal time interval and space range where these sequences are frequent. The discovery of such patterns along with their constraints may lead to extract valuable knowledge that can remain hidden using traditional methods since their support is extremely low over the entire dataset. We have introduced a new Spatio-Temporal Sequence Miner (STSM) algorithm to discover tight space-time sequences [Campisano 2018]. We evaluated STSM using a proof of concept seismic use case. Compared with general spatial-time sequence mining algorithms (GSTSM), STSM enabled new insights by detecting maximal space-time areas where each pattern is frequent. To the best of our knowledge, this is the first solution to tackle the problem of identifying tight space-time sequences.
In the field of motif discovery, we observed that many time series data mining techniques were developed to tackle this problem. However, when it comes to spatial-time series, it is possible to observe an open gap according to the literature review. This work proposes a new approach to discover motif in spatial-time series. It is based on combining spatial-time series transformation. Then, a hash-based algorithm is applied in this transformed time series to find spatial-time motifs. Finally, motifs are aggregate according to spatial-time constraints. We have currently published an R Package (STMotif) [Bazaz 2018] and we are finishing a paper using this technique over a seismic dataset.
Finally, in the area of trajectory analysis, we observe methods that either identify systemic patterns or detect anomalies over trajectory data. We are working on both developing support for large-scale mining process [Ferreira 2018] and developing an algorithm that computes space-time aggregations to convert trajectory data into permanent objects analysis for identification of systemic patterns and anomalies over different regions. We have evaluated in the area of social sciences with urban mobility data. We were able to identify systemic and specific anomalies on the urban transit of Rio de Janeiro [Cruz 2018].
Machine learning and recommendation
Scientists commonly explore several input data files and parameter values in different executions of scientific workflows. These workflows can execute for days in DISC environments and they are costly both in terms of execution time and financial cost [Liu 2018b]. It is fundamental that input data files and parameter values chosen for a specific workflow execution do not produce undesired results. Our proposal is to use provenance data captured during previous workflow executions to recommend data files and parameter values for future executions. We use Machine Learning algorithms (ML) to predict which data files and parameters are more suitable for a specific workflow execution. This way, we developed FReeP [Silva 2018c], a parameter recommendation algorithm that suggests a value to a parameter that agrees with the user preferences. FReeP relies on the Preference Learning technique (i.e.it induces a predictive function that, given a set of already established-as-preferred items, it predicts the preferences for a new set of items), provenance data, and voting systems to recommend a value for a specified workflow parameter. A preliminary experimental evaluation performed over the provenance traces from SciPhy (bioinformatics) and Montage (astronomy) workflows (workflows that we have access to specialists that can inform how to measure quality of results) showed the feasibility of FReeP to recommend parameter values for scientific workflows.
The first year of the project has been devoted to the definition of a SciDISC architecture that now serves as a basis for developing new distributed and parallel techniques to deal with scientific data. We consider a generic architecture that features a high-performance computer (e.g. to perform data processing and simulation) with shared-disk and a shared-nothing cluster to perform data analysis. The high-performance computer can be a supercomputer (e.g. the Bull supercomputer at LNCC) or a large cluster of compute nodes (e.g. Grid5000), which yields different cost-performance trade-offs to be studied. This architecture allows us to design generic techniques for data transfer, partitioning and replication, as a basis for parallel data analysis and fault-tolerance in DISC [Silva 2017, Souza 2017a, 2017b]. Additionally, envisioning an almost real-time data transfer between the HPC system and the analytics platform, an orchestrated and tuned set of components must be devised. Security concerns, for instance, may restrict the exposure of simulation results through a single HPC entry node, which rapidly turns into a bottleneck at the HPC side.
From simulation to interactive analysis and visualization
In complex simulations, users must track quantities of interest (residuals, errors estimates, etc.) to control as much execution as possible. However, this tracking is typically done only after the simulation ends. We are designing techniques to extract, index and relate strategic simulation data for online queries while simulation is running.
We consider coupling these techniques with largely adopted libraries such as libMesh (for numerical solvers) and ParaView (for visualization), so that queries on quantities of interest are enhanced by visualization and provenance data. Interactive data analysis support is planned for post simulation and runtime as in-situ and in-transit, taking advantage of memory access at runtime.
In [Silva 2017], we propose a solution (architecture and algorithms) to combine the advantages of a dataflow-aware SWMS and the raw data file analysis techniques to allow for queries on raw data file elements that are related, but reside in separate files. Armful is the name of the architecture and its main components are a raw data extractor, a provenance gatherer and a query processing interface, which are all dataflow aware. In [Silva 2017] we show ARMFUL instantiated with Chiron SWMS. In [Souza 2017a] we instantiate Armful without the SWMS, plugging the components directly in the simulation code of highly optimized parallel applications. With support of sophisticated online data analysis, scientists get a detailed view of the execution, providing insights to determine when and how to tune parameters. In [Souza 2017b] we evaluate a parameter sweep workflow also in the Oil and Gas domain, this time using Spark to understand its scalability when having to execute legacy black-box code with a DISC system. The source code of the dataflow implementation for Spark is available on github (github.com/hpcdb/RFA-Spark).
We started investigating the combination of in-transit analysis and visualization, with the development of SAVIME (Scientific Analysis and Visualization In-Memory). The system adopts a multi-dimensional data model TARS (Typed Array Schema) [Hermano 2017] that enables the representation of simulation output data, the topology mesh and simulation metadata. Data produced by the simulation is ingested into the system without any transformation as a Typed Array (TAR). We intend SAVIME to implement an algebra on TARs that enables simulation output analysis and direct production of visualization output.
Data mining of scientific data
In [Campisano 2017], we tackle the problem finding, within the same process: i) frequent sequences constrained in space and time that may not be frequent in the entire dataset and ii) the time interval and space range where these sequences are frequent. The discovery of such patterns along with their constraints may lead to extract valuable knowledge that can remain hidden using traditional methods since their support is extremely low over the entire dataset. We introduce a new spatiotemporal Sequence Miner (STSM) algorithm to discover sequences that are frequent in a constrained space and time. We evaluate STSM using a seismic use case and illustrate its ability to detect frequent sequences constrained in space and time. When compared with traditional algorithms, such as GSP, STSM not only discovers a larger number of additional patterns (200 times more patterns), but it offers a new knowledge represented by each maximal block area where each pattern is frequent. Additionally, in [Cruz 2017], we started studying sensor data sources using spatial-temporal aggregations from trajectories of the buses of Rio de Janeiro. As a preliminary work on this subject, we established a baseline for anomaly identification in urban mobility, which may be useful for developing new approaches that help better understand of urban mobility systems.
Machine learning and recommendation
Scientists commonly explore several data files and parameter values in different executions of workflows. These workflows can execute for days in DISC environments and they are costly both in terms of execution time and financial cost. It is fundamental that data files and parameter values chosen for a workflow do not produce undesired results. Today, scientists spend much time choosing appropriate data files and parameter values based on their previous experience, but this is a tedious and error-prone task.
Our approach is to use provenance that is captured in previous executions of scientific workflows to recommend data files and parameters for scientists in new executions. Our goal is to use Machine Learning algorithms (ML) to predict which data files and parameters values are more suitable for a new workflow execution. In this first year, we have developed a series of predictive models [Silva Jr 2017] in order to identify which combinations of data files and parameters values produce results with more quality and in less time. We use as input datasets provenance traces from SciPhy (bioinformatics) and Montage (astronomy) workflows (workflows that we have access to specialists that can inform how to measure quality of results). This way, we are able to suggest “ideal” parameter values and data files for scientists that will produce results with more quality and/or less time. These predictive models are based on traditional machine learning algorithms such as Classification Trees, Support Vector Machines (SVM), One Class SVM and Inductive Logic Programming. Each predictive model presents different precision and accuracy, and it may be required to choose the best one before recommending parameter values and data files to use. This way, we have a 2-level recommendation scenario. First, we have to recommend which predictive model to use and then run this model with new data to finally recommend the parameter values and data files for workflow executions. This combination of Machine Learning and feedback is novel when compared with existing approaches.