Return to HPDaSc (High Performance Data Science)

HPDaSc objectives

Based on lessons learned with previous projects (SciDISC, HPCBD), we address the following requirements for high-performance data science (HPDaSc):

  • Support realtime analytics and visualization (in either in situ or in transit architectures) to help make high-impact online decisions;
  • Combine ML with analytics and simulation, which implies dealing with uncertainty in the data and models, leading to a physics-aware ML approach.
  • Support scientific workflows that combine analytics, modeling and simulation, and exploit provenance in realtime and HIL for efficient wokflow execution.

To address these requirements, we will exploit new distributed and parallel architectures and design new techniques for ML, realtime analytics and scientific workflow management. The architectures will be in the context of multisite cloud, with heterogeneous data centers with data nodes, compute nodes and GPUs. We will validate our techniques with major software systems on real applications with real data. The main systems will be OpenAleaandPl@ntnetfrom Zenith and DfAnalyzer and SAVIME from the Brazilian side. The main applications will be in  agronomy and plant phenotyping (with plant biologists from CIRAD and INRA), biodiversity informatics (with biodiversity scientists from LNCC and botanists from CIRAD), and oil & gas (with geoscientists from UFRJ and Petrobras).

Our approach is to capitalize on the principles of distributed data management [Özsu & Valduriez 2019], scientific workflow management [Oliveira, Liu & Pacitti 2019] and machine learning, in particular, deep learning. We will keep exploring declarative languages to manipulate data and workflows and perform optimization, and environments such as cluster and cloud for scalability and performance. We also exploit machine learning, probabilities and statistics for high-dimensional data processing, data analytics and data search.

Permanent link to this article: