The research challenge is to develop new architectures and methods to combine simulation and data analysis. We can distinguish between three main approaches depending on where analysis is done [Oldfield 2014]: postprocessing, in-situ and in-transit. Postprocessing analysis performs analysis after simulation, e.g. by loosely coupling a supercomputer and a SciDISC cluster (possibly in the cloud). This approach is the simplest but is restricted to batch analysis. In-situ analysis runs on the same compute resources as the simulation, e.g. a supercomputer, thus making it easy to perform interactive analysis. In-transit analysis offloads analysis to a separate partition of compute resources, e.g. using a single cluster with both compute nodes and data nodes that communicate through a high-speed network. Although less intrusive than in-situ, this approach requires careful synchronization of simulation and analysis.
In this project, we study different architectures for SciDISC and their trade-offs. We address the following main steps of the data-intensive science process: (1) data preparation, including raw data ingestion (e.g. from sensors) and data cleaning, transformation and integration; (2) data processing and simulation execution; (3) exploratory data analysis and visualization; (4) data mining, knowledge discovery and recommendation. Note that these steps are not necessarily sequential, for instance, steps 2 and 3 need to be interleaved in order to perform real time analysis.
The expected results of the project are: new data analysis methods for SciDISC systems; the integration of these methods as software libraries in popular DISC systems, such as Apache Spark; and extensive validation on real scientific applications, by working with our scientific partners such as INRA and IRD in France and Petrobras and the National Research Institute (INCT) on e-medicine (MACC) in Brazil.