SciDISC (2017-2019) with LNCC, UFRJ, UFF, CEFET (Brazil)

SciDISC (Scientific data analysis using Data-Intensive Scalable Computing) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2017. SciDISC is headed by Marta Mattoso (COPPE/UFRJ) and Patrick Valduriez (Zenith).

Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data-intensive scalable computing (DISC). HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers. DISC, on the other hand, is data-centric and focuses on fault-tolerance and scalability of web and cloud applications using cost-effective clusters of commodity hardware. Examples of DISC systems include big data processing frameworks such as Hadoop or Apache Spark or NoSQL systems . To harness parallel processing, HPC uses a low-level programming model (such as MPI or OpenMP) while DISC relies on powerful data processing operators (Map, Reduce, Filter, …). Data storage is also quite different: supercomputers typically rely on a shared disk infrastructure and data must be loaded in compute nodes before processing while DISC systems rely on a shared-nothing cluster (of disk-based nodes) and data partitioning.

Spurred by the growing need to analyze big scientific data, the convergence between HPC and DISC has been a recent topic of interest. However, simply porting the Hadoop stack on a supercomputer is not cost-effective, and does not solve the scalability and fault-tolerance issues addressed by DISC. On the other hand, DISC systems have not been designed for scientific applications, which have different requirements in terms of data analysis and visualization. This project will address the grand challenge of scientific data analysis using DISC (SciDISC), by developing architectures and methods to combine simulation and data analysis.

Participants

Objectives

Achievements

Publications

Meetings

Permanent link to this article: https://team.inria.fr/zenith/scidisc/

SciDISC achievements

In this project, we studied architectures (post-processing, in situ and in transit) and methods to combine simulation and scientific data analysis using Data-Intensive Scalable Computing (DISC). We addressed the following main steps of the data-intensive science process: (1) data preparation, including raw data ingestion and data cleaning, transformation and integration; (2) data processing and simulation execution; (3) exploratory data analysis and visualization; (4) data mining, knowledge …

SciDISC meetings

2019 24 April 2019 : 3rd SciDISC workshop, LNCC, Rio de Janeiro. 12 August 2019: SciDISC meeting at CEFET-RJ, Rio de Janeiro with Inaugural lecture by Esther Pacitti. 13 August 2019 : 4th SciDISC workshop, COPPE/UFRJ, Rio de Janeiro. 20 November 2019: Final SciDISC workshop, Inria, Montpellier. 2018 31 Jan 2018: Zenith seminar, Montpellier: Vitor Silva (UFRJ)  “A …

SciDISC objectives

The research challenge is to develop new architectures and methods to combine simulation and data analysis. We can distinguish between three main approaches depending on where analysis is done [Oldfield 2014]: postprocessing, in-situ and in-transit. Postprocessing analysis performs analysis after simulation, e.g. by loosely coupling a supercomputer and a SciDISC cluster (possibly in the cloud). This …

SciDISC participants

LNCC, Petrópolis, RJ Fabio Porto (senior researcher) Kary Ocaña (researcher) Daniel Gaspar (PhD student) Hermano Lustosa (PhD student) Noel Lemus (postdoc) Rafael Pereira (Master student) João N. Rittmeyer (Master student)   COPPE/UFRJ, Rio de Janeiro, RJ Alvaro Coutinho (professor) Marta Mattoso (professor) José Camata  (posdoc) Vitor Silva (PhD student), until June 2018 Renan Souza (PhD student) …

SciDISC publications

2019 [Borges 2019] H. Borges, M. Dutra, A. Bazaz, R. Coutinho, F. Perosi, F. Porto, F. Masseglia, E. Pacitti, E. Ogasawara, Spatial-Time Motifs Discovery. Intelligent Data Analysis, accepted for publication, 2019. [Heidsieck 2019] G. Heidsieck, D. de Oliveira, E. Pacitti, C. Pradal, F. Tardieu, P. Valduriez. Adaptive Caching for Data-Intensive Scientific Workflows in the Cloud. …