HPDaSc (High Performance Data Science)

HPDaSc (High Performance Data Science) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2020. HPDaSc is headed by Patrick Valduriez (Zenith) and Fabio Porto (LNCC).

Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data science. HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers whereas data science is data-centric and focuses on scalability and fault-tolerance of web and cloud applications using cost-effective clusters of commodity hardware.

In the context of the SciDISC project (associated team 2016-2019) and the Inria Project Lab (IPL) HPC-BigData (2018-2022), we studied various architectures for integrating HPC and big data (post-processing, in-situ, in-transit) for applications in astronomy, life science and agronomy, and geoscience (oil & gas). We learned major lessons, which are the basis for this new project:

  • Importance of realtime analytics to make critical high-consequence decisions, e.g. preventing useless drilling based on a driller’s realtime data and realtime visualization of simulated data ;
  • Effectiveness of machine learning (ML) to deal with scientific data, e.g. computing Probability Density Functions (PDFs) over simulated seismic data using Spark;
  • Effectiveness of the Human-In-the-Loop (HIL) paradigm in combination with provenance data in scientific workflows, e.g. to avoid useless, long-duration computations in a supercomputer;
  • Significance of working closely with domain experts in order to interpret scientific data.

This project addresses the grand challenge of High Performance Data Science (HPDaSc), by developing architectures and methods to combine simulation, ML and data analytics.

Highlights

Participants

Objectives

Scientific results

Publications

Meetings and seminars

Permanent link to this article: https://team.inria.fr/zenith/hpdasc/

Highlights

SBBD 2024 Best Paper award The paper “Cutoff Frequency Adjustment for FFT-Based Anomaly Detectors” by Ellen Silva, Helga Balbi, Esther Pacitti, Fabio Porto, Joel Santos, and Eduardo Ogasawara obtained the best paper award at SBBD 2024 – SBBD 2024 – Brazilian Symposium on Databases,  Florianopolis, Brazil. SBBD 2024 keynote Patrick Valduriez gave the keynote talk on …

HPDaSc meetings and seminars

16 December 2024: First Workshop of the LNCC AI Institute, LNCC, Petropolis, Brazil. 10-11 September 2024: Inria-Brasil (hybrid) workshop on Digital Science and Agronomy, Inria Montpellier. 16 August 2024: 9th Workshop of the HPDaSc project, CEFET-RJ, Rio de janeiro, Brazil. 8 August 2024: Seminar  by Patrick Valduriez on “Data Science and Innovation”, CEFET-RJ, Rio de …

HPDaSc objectives

Based on lessons learned with previous projects (SciDISC, HPCBD), we address the following requirements for high-performance data science (HPDaSc): Support realtime analytics and visualization (in either in situ or in transit architectures) to help make high-impact online decisions; Combine ML with analytics and simulation, which implies dealing with uncertainty in the data and models, leading …

HPDaSc participants

LNCC, Petrópolis, RJ Fabio Porto (senior researcher), Kary Ocaña (researcher), Luiz Manoel Gadelha (researcher) Eduardo Pena (postdoc), Vinicius Kreischer (engineer), Douglas E. de Oliveira (engineer), Rafael Pereira (engineer), Rocio Zorilla (postdoc) PhD students: Anderson Chaves, Victor Ribeiro Dornellas, Henrique Matheus Ferreira, Gabriela Moraes, Rafael de Souza Terra, MSc students:  Mauro Sergio Moura COPPE/UFRJ, Rio de Janeiro, RJ Alvaro …

HPDaSc Publications

2024 [Lima 2024] Janio Lima, Lucas Tavares, Esther Pacitti, Ferreira, João Ferreira, Ismael Santos, Isabela Siqueira, Diego Carvalho, Fabio Porto, Rafaelli Coutinho, Eduardo Ogasawara. Online Event Detection in Streaming Time Series: Novel Metrics and Practical Insights. International Joint Conference on Neural Networks (IJCNN), 1-8, 2024. [Ogasawara 2024] Eduardo Ogasawara, Rebecca Salles, Fabio Porto, Esther Pacitti. …

Scientific results

Data analytics Novel metrics for evaluating event detection methods: SoftED metrics [Salles 2024, Salles 2023a], which focus on  both detection accuracy and the degree to which their detections represent events; new metrics (detection probability and detection lag) [Lima 2024] for online event detection in streaming time series, exploring the impact of configurable batches on detection …