HPDaSc (High Performance Data Science)

HPDaSc (High Performance Data Science) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2020. HPDaSc is headed by Patrick Valduriez (Zenith) and Fabio Porto (LNCC).

Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data science. HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers whereas data science is data-centric and focuses on scalability and fault-tolerance of web and cloud applications using cost-effective clusters of commodity hardware.

In the context of the SciDISC project (associated team 2016-2019) and the Inria Project Lab (IPL) HPC-BigData (2018-2022), we studied various architectures for integrating HPC and big data (post-processing, in-situ, in-transit) for applications in astronomy, life science and agronomy, and geoscience (oil & gas). We learned major lessons, which are the basis for this new project:

  • Importance of realtime analytics to make critical high-consequence decisions, e.g. preventing useless drilling based on a driller’s realtime data and realtime visualization of simulated data ;
  • Effectiveness of machine learning (ML) to deal with scientific data, e.g. computing Probability Density Functions (PDFs) over simulated seismic data using Spark;
  • Effectiveness of the Human-In-the-Loop (HIL) paradigm in combination with provenance data in scientific workflows, e.g. to avoid useless, long-duration computations in a supercomputer;
  • Significance of working closely with domain experts in order to interpret scientific data.

This project addresses the grand challenge of High Performance Data Science (HPDaSc), by developing architectures and methods to combine simulation, ML and data analytics.

Highlights

Participants

Objectives

Scientific results

Publications

Meetings and seminars

Permanent link to this article: https://team.inria.fr/zenith/hpdasc/

Highlights

RISC2 European H2020 project (2021-2023) between Europe and Latin America in HPC, final review The final review of the RISC2 project was on 31 October, 2023 (virtual) and was outstanding. All the reviewers congratulated the RISC2 participants for their excellent work and sustained collaboration, despite the COVID pandemic. As mentioned by the project officer, the …

HPDaSc meetings and seminars

16 August 2024: 9th Workshop of the HPDaSc project, CEFET-RJ, Rio de janeiro, Brazil. 8 August 2024: Seminar  by Patrick Valduriez on “Data Science and Innovation”, CEFET-RJ, Rio de janeiro, Brazil. 31 May 2024: Eight Workshop of the HPDaSc project, Montpellier. 9 May 2024: Seminar  by Patrick Valduriez on “Ciência de Dados e Inovação”,  Instituto …

HPDaSc objectives

Based on lessons learned with previous projects (SciDISC, HPCBD), we address the following requirements for high-performance data science (HPDaSc): Support realtime analytics and visualization (in either in situ or in transit architectures) to help make high-impact online decisions; Combine ML with analytics and simulation, which implies dealing with uncertainty in the data and models, leading …

HPDaSc participants

LNCC, Petrópolis, RJ Fabio Porto (senior researcher), Kary Ocaña (researcher), Luiz Manoel Gadelha (researcher) Rafael Pereira (research engineer), Eduardo Pena (postdoc) PhD students: Anderson Chaves, Gustavo Decarlo, Victor Ribeiro Dornellas MSc students: Rafael de Souza Terra,  Rafael Silva Pereira COPPE/UFRJ, Rio de Janeiro, RJ Alvaro Coutinho (professor), Marta Mattoso (professor), Fernando Rochinha (professor) Renan Souza (research engineer) PhD students: Debora Pina, Liliane …

HPDaSc Publications

2024 [Lima 2024] Janio Lima, Lucas Tavares, Esther Pacitti, Ferreira, João Ferreira, Ismael Santos, Isabela Siqueira, Diego Carvalho, Fabio Porto, Rafaelli Coutinho, Eduardo Ogasawara. Online Event Detection in Streaming Time Series: Novel Metrics and Practical Insights. International Joint Conference on Neural Networks (IJCNN), 2024. [Ogasawara 2024] Eduardo Ogasawara, Rebecca Salles, Fabio Porto, Esther Pacitti. Event …

Scientific results

Data analytics SoftED metrics [Salles 2023a], a new set of metrics designed for soft evaluating event detection methods, which enable the evaluation of both detection accuracy and the degree to which their detections represent events. They improve event detection evaluation by associating events and their representative detections, incorporating temporal tolerance in over 36% of experiments …