Return to SciDISC (2017-2019) with LNCC, UFRJ, UFF, CEFET (Brazil)

SciDISC objectives

The research challenge is to develop new architectures and methods to combine simulation and data analysis. We can distinguish between three main approaches depending on where analysis is done [Oldfield 2014]: postprocessing, in-situ and in-transit. Postprocessing analysis performs analysis after simulation, e.g. by loosely coupling a supercomputer and a SciDISC cluster (possibly in the cloud). This approach is the simplest but is restricted to batch analysis. In-situ analysis runs on the same compute resources as the simulation, e.g. a supercomputer, thus making it easy to perform interactive analysis. In-transit analysis offloads analysis to a separate partition of compute resources, e.g. using a single cluster with both compute nodes and data nodes that communicate through a high-speed network. Although less intrusive than in-situ, this approach requires careful synchronization of simulation and analysis.

In this project, we study different architectures for SciDISC and their trade-offs. We address the following main steps of the data-intensive science process: (1) data preparation, including raw data ingestion (e.g. from sensors) and data cleaning, transformation and integration; (2) data processing and simulation execution; (3) exploratory data analysis and visualization; (4) data mining, knowledge discovery and recommendation. Note that these steps are not necessarily sequential, for instance, steps 2 and 3 need to be interleaved in order to perform real time analysis.

The expected results of the project are: new data analysis methods for SciDISC systems; the integration of these methods as software libraries in popular DISC systems, such as Apache Spark; and extensive validation on real scientific applications, by working with our scientific partners such as INRA and IRD in France and Petrobras and the National Research Institute (INCT) on e-medicine (MACC) in Brazil.

Permanent link to this article: