HPDaSc (High Performance Data Science)

HPDaSc (High Performance Data Science) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2020. HPDaSc is headed by Patrick Valduriez (Zenith) and Fabio Porto (LNCC).

Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data science. HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers whereas data science is data-centric and focuses on scalability and fault-tolerance of web and cloud applications using cost-effective clusters of commodity hardware.

In the context of the SciDISC project (associated team 2016-2019) and the Inria Project Lab (IPL) HPC-BigData (2018-2022), we studied various architectures for integrating HPC and big data (post-processing, in-situ, in-transit) for applications in astronomy, life science and agronomy, and geoscience (oil & gas). We learned major lessons, which are the basis for this new project:

  • Importance of realtime analytics to make critical high-consequence decisions, e.g. preventing useless drilling based on a driller’s realtime data and realtime visualization of simulated data ;
  • Effectiveness of machine learning (ML) to deal with scientific data, e.g. computing Probability Density Functions (PDFs) over simulated seismic data using Spark;
  • Effectiveness of the Human-In-the-Loop (HIL) paradigm in combination with provenance data in scientific workflows, e.g. to avoid useless, long-duration computations in a supercomputer;
  • Significance of working closely with domain experts in order to interpret scientific data.

This project addresses the grand challenge of High Performance Data Science (HPDaSc), by developing architectures and methods to combine simulation, ML and data analytics.







Permanent link to this article: https://team.inria.fr/zenith/hpdasc/


Realtime data analytics A novel spatial-time motif discovery method (CSA) that can find patterns that frequently occur in a constrained space and time. CSA includes a multidimensional criterion that enables ranking motifs according to their relevance [Borges 2020a]. The method is made available as R-Package named ST-Motif [Borges 2020b]. A Generalized Spatial-Time Sequence Miner (G-STSM), …


DEXA 2020 best paper award The paper “Distributed Caching of Scientific Workflows in Multisite Cloud” by Gaëtan Heidsieck, Daniel de Oliveira, Esther Pacitti, Christophe Pradal, François Tardieu, and Patrick Valduriez, obtained the best paper award from the 31st International Conference on Database and Expert Systems Applications (DEXA), Springer, Sep 2020. The work has been done in …

HPDaSc meetings

2020 22 September 2020:  First (Virtual) Workshop of the HPDaSc project 29 September 2020: SBBD 2020 meeting 23 November 2020:  Second (Virtual) Workshop of the HPDaSc project 10 December 2020:  Third (Virtual) Workshop of the HPDaSc project 12 May 2021:  Fourth (Virtual) Workshop of the HPDaSc project

HPDaSc objectives

Based on lessons learned with previous projects (SciDISC, HPCBD), we address the following requirements for high-performance data science (HPDaSc): Support realtime analytics and visualization (in either in situ or in transit architectures) to help make high-impact online decisions; Combine ML with analytics and simulation, which implies dealing with uncertainty in the data and models, leading …

HPDaSc participants

LNCC, Petrópolis, RJ Fabio Porto (senior researcher) Kary Ocaña (researcher) Luiz Manoel Gadelha (researcher) Yania Molina Souto (postdoc) PhD students: Anderson Chaves, Maria Luiza Modelli Master students: Rafael Pereira, Henrique Matheus Ferreira   COPPE/UFRJ, Rio de Janeiro, RJ Alvaro Coutinho (professor) Marta Mattoso (professor) Fernando Rochinha (postdoc) Renan Souza (postdoc) PhD students: Debora Pina, Liliane Neves, Gabriel Barros, …