Return to HPDaSc (High Performance Data Science)

Achievements

Data analytics

  • A novel method for detecting events in nonstationary time series [Lima 2022]. The method, entitled Forward and Backward Inertial Anomaly Detector (FBIAD), analyzes inconsistencies in observations concerning surrounding temporal inertia (forward and backward).

  • A comprehensive review of the state-of-the-art  in learning-based analytics for the Edge-to-Cloud Continuum [Rosendo 2022a]. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed, with special attention on experiment reproducibility.
  • A novel implementation of TSPred for time series prediction in association with data preprocessing  [Salles 2022a, Salles 2022b]. TSPred establishes a prediction process that seamlessly integrates nonstationary time series transformations with state-of-the-art statistical and machine learning methods. It is made available as an R-package, which provides functions for defining and conducting time series prediction, including data pre(post)processing, decomposition, modeling, prediction, and accuracy assessment.

  • An analysis of recent studies on the detection of anomalies in time series [Borges 2021]. The goal is to provide an introduction to anomaly detection and a survey of recent research and challenges.
  • A novel spatial-time motif discovery method (CSA) that can find patterns that frequently occur in a constrained space and time. CSA includes a multidimensional criterion that enables ranking motifs according to their relevance [Borges 2020a]. The method is made available as R-Package named ST-Motif [Borges 2020b].
  • A Generalized Spatial-Time Sequence Miner (G-STSM), which extends this previously proposed approach (STSM). STSM was limited to one-dimensional space and time. G-STSM generalizes the problem considering both three-dimensional space and time. It also outperforms other methods regarding speed, being up to three times faster [Castro 2020, Castro 2021].

Machine learning and simulation

  • We extended the DJEnsemble approach for the selection of time-series analysis algorithms, such as ARIMA [Zorrilla 2022]. Moreover, we implemented the DJEnsemble query technique into the SAVIME system [Chaves da Silva 2022].

  • A data-driven approach for selecting pre-trained temporal models to be applied at each query point [Zorrilla 2022], which applies a model to a point according to the training and input time series similarity. The approach is used in Gypscie to avoid training a different model for each domain point, thus saving model training time.

  • The deployment of the first version of Gypscie [Porto 2022] which supports the entire ML lifecycle and enables AI model reuse and import from other frameworks. It is integrated with SAVIME for querying tensor data.
  • The use of workflow provenance techniques to build a holistic view to support the lifecycle of scientific machine learning [Souza2020b, Souza 2022]. The experiments show that the decisions enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability and high performance.
  • A provenance data-based approach for the collection and analysis of configuration data in deep neural networks, with an experimental validation with Keras and a real application which provides evidence of the flexibility and efficiency of the approach, including physics-informed neural-networks [Pina 2020, Pina 2021, Kunstmann 2021, Silva 2021c].
  • SUQ2: a new method based on the on the Generalized Lambda Distribution (GLD) function to perform in parallel uncertainty quantification queries over large spatio-temporal simulation results [Liu 2020, Lemus 2020].
  • The extension of the SAVIME database system in support to machine learning models. We implemented operators that enable the registration and invocation of ML models as part of a SAVIME query expression [Lustosa 2020a].

Scientific workflow management

  • A methodology to support the optimization of complex workflows on the Edge-to-Cloud Continuum [Rosendo 2021a, Rosendo 2021b, Rosendo 2022b].  Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour and related performance trade-offs. We illustrate our methodology with our Pl@ntNet application.
  • SchalaDB [Souza 2021], an architecture with a set of design principles and techniques based on distributed in-memory data management for efficient workflow execution control and user steering. We propose a distributed data design for scalable workflow task scheduling and high availability driven by a parallel and distributed in-memory DBMS. We developed d-Chiron (https://github.com/hpcdb/d-Chiron), a Workflow Management System designed according to SchalaDB’s principles.
  • FReeP-Feature Recommender from Preferences, a parameter value recommendation method that is designed to suggest values for workflow parameters, taking into account past user preferences [Silva 2021a]. FReeP is based on machine learning techniques, in particular, preference learning.
  • SaFER (workflow Scheduling with conFidEntity pRoblem), a scheduling approach that considers data confidentiality constraints [Silva 2021b].
  • A distributed solution for the efficient execution of scientific workflows in multisite cloud through adaptive caching, with extensive validation using the Phenomenal workflow deployed in the OpenAlea workflow system [Heidsieck 2020a, Heidsieck 2020b, Heidsieck 2021].
  • An adaptive workflow monitoring approach that combines provenance data monitoring and computational steering to support users to steer a running workflow and reduce subsets from datasets online, with an experimental validation in oil and gas using a 936-cores cluster which shows major reductions of execution time and the data processed [Souza 2020a].
  • DfAnalyzer: a tool for monitoring, debugging, and analyzing dataflows generated by Computational Science and Engineering (CSE) applications. The performance evaluation of CSE executions for a complex multiphysics application shows that DfAnalyzer has negligible time overhead on total elapsed time [Silva 2020].

Permanent link to this article: https://team.inria.fr/zenith/hpdasc/achievements/