HPDaSc (High Performance Data Science) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2020. HPDaSc is headed by Patrick Valduriez (Zenith) and Fabio Porto (LNCC).
Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data science. HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers whereas data science is data-centric and focuses on scalability and fault-tolerance of web and cloud applications using cost-effective clusters of commodity hardware.
In the context of the SciDISC project (associated team 2016-2019) and the Inria Project Lab (IPL) HPC-BigData (2018-2022), we studied various architectures for integrating HPC and big data (post-processing, in-situ, in-transit) for applications in astronomy, life science and agronomy, and geoscience (oil & gas). We learned major lessons, which are the basis for this new project:
- Importance of realtime analytics to make critical high-consequence decisions, e.g. preventing useless drilling based on a driller’s realtime data and realtime visualization of simulated data ;
- Effectiveness of machine learning (ML) to deal with scientific data, e.g. computing Probability Density Functions (PDFs) over simulated seismic data using Spark;
- Effectiveness of the Human-In-the-Loop (HIL) paradigm in combination with provenance data in scientific workflows, e.g. to avoid useless, long-duration computations in a supercomputer;
- Significance of working closely with domain experts in order to interpret scientific data.
This project addresses the grand challenge of High Performance Data Science (HPDaSc), by developing architectures and methods to combine simulation, ML and data analytics.