SciDISC (2017-2019) with LNCC, UFRJ, UFF, CEFET (Brazil)

SciDISC (Scientific data analysis using Data-Intensive Scalable Computing) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2017. SciDISC is headed by Marta Mattoso (COPPE/UFRJ) and Patrick Valduriez (Zenith).

Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data-intensive scalable computing (DISC). HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers. DISC, on the other hand, is data-centric and focuses on fault-tolerance and scalability of web and cloud applications using cost-effective clusters of commodity hardware. Examples of DISC systems include big data processing frameworks such as Hadoop or Apache Spark or NoSQL systems . To harness parallel processing, HPC uses a low-level programming model (such as MPI or OpenMP) while DISC relies on powerful data processing operators (Map, Reduce, Filter, …). Data storage is also quite different: supercomputers typically rely on a shared disk infrastructure and data must be loaded in compute nodes before processing while DISC systems rely on a shared-nothing cluster (of disk-based nodes) and data partitioning.

Spurred by the growing need to analyze big scientific data, the convergence between HPC and DISC has been a recent topic of interest. However, simply porting the Hadoop stack on a supercomputer is not cost-effective, and does not solve the scalability and fault-tolerance issues addressed by DISC. On the other hand, DISC systems have not been designed for scientific applications, which have different requirements in terms of data analysis and visualization. This project will address the grand challenge of scientific data analysis using DISC (SciDISC), by developing architectures and methods to combine simulation and data analysis.






Permanent link to this article:

SciDISC achievements

SciDISC Architecture The first year of the project has been devoted to the definition of a SciDISC architecture that will serve as a basis for developing new distributed and parallel techniques to deal with scientific data. We consider a generic architecture that features a high-performance computer (e.g. to perform data processing and simulation) with shared-disk …

SciDISC meetings

27 january 2017, SciDISC seminar by Fabio Porto (LNCC) “Database System Support of Simulation Data”, Zenith, Montpellier. 30 january – 2 february 2017: meetings during the HPC4E H2020 project review in Sophia Antipolis with A. Coutinho (COPPE/UFRJ), M. Mattoso (COPPE/UFRJ), H. Lustosa (LNCC), F. Porto (LNCC), J. Liu (Zenith) and P. Valduriez. 1 june 2017, …

SciDISC objectives

The research challenge is to develop new architectures and methods to combine simulation and data analysis. We can distinguish between three main approaches depending on where analysis is done [Oldfield 2014]: postprocessing, in-situ and in-transit. Postprocessing analysis performs analysis after simulation, e.g. by loosely coupling a supercomputer and a SciDISC cluster (possibly in the cloud). This …

SciDISC participants

LNCC, Petrópolis, RJ Fabio Porto (senior researcher) Kary Ocaña (researcher) Daniel Gaspar (PhD student) Hermano Lustosa (PhD student) Noel Lemus (PhD student) COPPE/UFRJ, Rio de Janeiro, RJ Alvaro Coutinho (professor) Marta Mattoso (professor) José Camata  (posdoc) Vitor Silva (PhD student) Renan Souza (PhD student) UFF, Niterói, RJ Daniel Oliveira (professor) Fabricio da Silva (PhD student) CEFET, …

SciDISC publications

[Khatibi 2017] A. Khatibi, F. Porto, J. Rittmeyer, E. Ogasawara, P. Valduriez, D. Shasha. Pre-processing and Indexing Techniques for Constellation Queries in Big Data. Int. Conf. on Big Data Analytics and Knowledge Discovery (DaWaK), 164-172, 2017. [Liu 2017] J. Liu, E. Pacitti, P. Valduriez, M. Mattoso. Scientific Workflow Scheduling with Provenance Data in a Multisite …