The first year of the project has been devoted to the definition of a SciDISC architecture that will serve as a basis for developing new distributed and parallel techniques to deal with scientific data. We consider a generic architecture that features a high-performance computer (e.g. to perform data processing and simulation) with shared-disk and a shared-nothing cluster to perform data analysis. The high-performance computer can be a supercomputer (e.g. the Bull supercomputer at LNCC) or a large cluster of compute nodes (e.g. Grid5000), which yields different cost-performance trade-offs to be studied. This architecture allows us to design generic techniques for data transfer, partitioning and replication, as a basis for parallel data analysis and fault-tolerance in DISC [Silva 2017, Souza 2017a, 2017b]. Additionally, envisioning an almost real-time data transfer between the HPC system and the analytics platform, an orchestrated and tuned set of components must be devised. Security concerns, for instance, may restrict the exposure of simulation results through a single HPC entry node, which rapidly turns into a bottleneck at the HPC side.
From simulation to interactive analysis and visualization
In complex simulations, users must track quantities of interest (residuals, errors estimates, etc.) to control as much execution as possible. However, this tracking is typically done only after the simulation ends. We are designing techniques to extract, index and relate strategic simulation data for online queries while simulation is running.
We consider coupling these techniques with largely adopted libraries such as libMesh (for numerical solvers) and ParaView (for visualization), so that queries on quantities of interest are enhanced by visualization and provenance data. Interactive data analysis support is planned for post simulation and runtime as in-situ and in-transit, taking advantage of memory access at runtime.
In [Silva 2017], we propose a solution (architecture and algorithms) to combine the advantages of a dataflow-aware SWMS and the raw data file analysis techniques to allow for queries on raw data file elements that are related, but reside in separate files. Armful is the name of the architecture and its main components are a raw data extractor, a provenance gatherer and a query processing interface, which are all dataflow aware. In [Silva 2017] we show ARMFUL instantiated with Chiron SWMS. In [Souza 2017a] we instantiate Armful without the SWMS, plugging the components directly in the simulation code of highly optimized parallel applications. With support of sophisticated online data analysis, scientists get a detailed view of the execution, providing insights to determine when and how to tune parameters. In [Souza 2017b] we evaluate a parameter sweep workflow also in the Oil and Gas domain, this time using Spark to understand its scalability when having to execute legacy black-box code with a DISC system. The source code of the dataflow implementation for Spark is available on github (github.com/hpcdb/RFA-Spark).
We started investigating the combination of in-transit analysis and visualization, with the development of SAVIME (Scientific Analysis and Visualization In-Memory). The system adopts a multi-dimensional data model TARS (Typed Array Schema) [Hermano 2017] that enables the representation of simulation output data, the topology mesh and simulation metadata. Data produced by the simulation is ingested into the system without any transformation as a Typed Array (TAR). We intend SAVIME to implement an algebra on TARs that enables simulation output analysis and direct production of visualization output.
Data mining of scientific data
In [Campisano 2017], we tackle the problem finding, within the same process: i) frequent sequences constrained in space and time that may not be frequent in the entire dataset and ii) the time interval and space range where these sequences are frequent. The discovery of such patterns along with their constraints may lead to extract valuable knowledge that can remain hidden using traditional methods since their support is extremely low over the entire dataset. We introduce a new spatiotemporal Sequence Miner (STSM) algorithm to discover sequences that are frequent in a constrained space and time. We evaluate STSM using a seismic use case and illustrate its ability to detect frequent sequences constrained in space and time. When compared with traditional algorithms, such as GSP, STSM not only discovers a larger number of additional patterns (200 times more patterns), but it offers a new knowledge represented by each maximal block area where each pattern is frequent. Additionally, in [Cruz 2017], we started studying sensor data sources using spatial-temporal aggregations from trajectories of the buses of Rio de Janeiro. As a preliminary work on this subject, we established a baseline for anomaly identification in urban mobility, which may be useful for developing new approaches that help better understand of urban mobility systems.
Machine learning and recommendation
Scientists commonly explore several data files and parameter values in different executions of workflows. These workflows can execute for days in DISC environments and they are costly both in terms of execution time and financial cost. It is fundamental that data files and parameter values chosen for a workflow do not produce undesired results. Today, scientists spend much time choosing appropriate data files and parameter values based on their previous experience, but this is a tedious and error-prone task.
Our approach is to use provenance that is captured in previous executions of scientific workflows to recommend data files and parameters for scientists in new executions. Our goal is to use Machine Learning algorithms (ML) to predict which data files and parameters values are more suitable for a new workflow execution. In this first year, we have developed a series of predictive models [Silva Jr 2017] in order to identify which combinations of data files and parameters values produce results with more quality and in less time. We use as input datasets provenance traces from SciPhy (bioinformatics) and Montage (astronomy) workflows (workflows that we have access to specialists that can inform how to measure quality of results). This way, we are able to suggest “ideal” parameter values and data files for scientists that will produce results with more quality and/or less time. These predictive models are based on traditional machine learning algorithms such as Classification Trees, Support Vector Machines (SVM), One Class SVM and Inductive Logic Programming. Each predictive model presents different precision and accuracy, and it may be required to choose the best one before recommending parameter values and data files to use. This way, we have a 2-level recommendation scenario. First, we have to recommend which predictive model to use and then run this model with new data to finally recommend the parameter values and data files for workflow executions. This combination of Machine Learning and feedback is novel when compared with existing approaches.