Return to HPDaSc (High Performance Data Science)

Achievements

Data analytics

  • SoftED metrics [Salles 2023a], a new set of metrics designed for soft evaluating event detection methods, which enable the evaluation of both detection accuracy and the degree to which their detections represent events. They improve event detection evaluation by associating events and their representative detections, incorporating temporal tolerance in over 36% of experiments compared to the usual classification metrics.

  • STMotif Explorer [Borges 2023], a spatial-time motif analysis system that aims to interactively discover and visualize spatial-time motifs in different domains, offering insight to users. STMotif Explorer enables users to use and implement novel spatiotemporal motif detection techniques and then run this across various domains.

  • The GSTSM R package [Castro 2023], the first tool for mining spatial time-stamped sequences in constrained space and time. It allows users to search for spatio-temporal patterns that are not frequent in the entire database, but are dense in restricted time-space intervals. Thus, making it possible to find non-trivial patterns that would not be found using common data mining tools.

  • A novel implementation of TSPred for time series prediction in association with data preprocessing  [Salles 2022a, 2022b, 2023b]. TSPred establishes a prediction process that seamlessly integrates nonstationary time series transformations with state-of-the-art statistical and machine learning methods. It is made available as an R-package, which provides functions for defining and conducting time series prediction, including data pre(post)processing, decomposition, modeling, prediction, and accuracy assessment.

  • A comprehensive review of the state-of-the-art  in learning-based analytics for the Edge-to-Cloud Continuum [Rosendo 2022a]. The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed, with special attention on experiment reproducibility.
  • An analysis of recent studies on the detection of anomalies in time series [Borges 2021a, Borges 2021b]. The goal is to provide an introduction to anomaly detection and a survey of recent research and challenges.
  • A Generalized Spatial-Time Sequence Miner (G-STSM), which extends this previously proposed approach (STSM). STSM was limited to one-dimensional space and time. G-STSM generalizes the problem considering both three-dimensional space and time. It also outperforms other methods regarding speed, being up to three times faster [Castro 2021].
  • A novel spatial-time motif discovery method (CSA) that can find patterns that frequently occur in a constrained space and time. CSA includes a multidimensional criterion that enables ranking motifs according to their relevance [Borges 2020a]. The method is made available as R-Package named ST-Motif [Borges 2020b].

Machine learning and simulation

  •  A framework to reduce computational costs while maintaining adequate accuracy of ML models [Ribeiro 2023]. It identifies “subdomains” within the input space and train local models that produce better predictions for samples from that specific subdomain, instead of training a single global model on the full dataset. Our experimental validation on two real-world datasets shows that subset modelling improves the predictive performance compared to a single global model and allows data-efficient training [Zorrilla 2022] .

  • We extended the DJEnsemble approach for the selection of time-series analysis algorithms, such as ARIMA [Zorrilla 2022]. Moreover, we implemented the DJEnsemble query technique into the SAVIME system [Chaves da Silva 2022].
  • A data-driven approach for selecting pre-trained temporal models to be applied at each query point [Zorrilla 2022], which applies a model to a point according to the training and input time series similarity. The approach is used in Gypscie to avoid training a different model for each domain point, thus saving model training time.
  • The deployment of the first version of Gypscie [Porto 2022] which supports the entire ML lifecycle and enables AI model reuse and import from other frameworks. It is integrated with SAVIME for querying tensor data.
  • The use of workflow provenance techniques to build a holistic view to support the lifecycle of scientific machine learning [Souza2020b, Souza 2022]. The experiments show that the decisions enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability and high performance.
  • A provenance data-based approach for the collection and analysis of configuration data in deep neural networks, with an experimental validation with Keras and a real application which provides evidence of the flexibility and efficiency of the approach, including physics-informed neural-networks [Pina 2020, Pina 2021, Kunstmann 2021, Silva 2021c].
  • The extension of the SAVIME database system in support to machine learning models. We implemented operators that enable the registration and invocation of ML models as part of a SAVIME query expression [Lustosa 2021].
  • SUQ2: a new method based on the on the Generalized Lambda Distribution (GLD) function to perform in parallel uncertainty quantification queries over large spatio-temporal simulation results [Liu 2020, Lemus 2020].

Scientific workflow management

  • An open service-based architecture, Life Science Workflow Services (LifeSWS) [Akbarinia 2023], which provides data analysis workflow services for life sciences. We illustrate our motivations and rationale for the architecture with real use cases from life science. LifeSWS capitalizes on our collaboration in developing major systems for scientific applications such as: time series prediction with TSPred [Salles 2022a, Salles 2022b] workflows with OpenAlea [Heidsieck 2021], model management with Gypscie [Porto 2022], and querying data across distributed services with DfAnalyzer  [Silva 2020] and Provlake [Souza 2022].
  • KheOps [Rosendo 2023a], a collaborative environment specifically designed to enable cost-effective reproducibility and replicability of Edge-to-Cloud experiments. We illustrate KheOps with a real-life Edge-to-Cloud application. The experimental results show how KheOps helps authors to systematically perform repeatable and reproducible experiments on the Grid5000 and FIT IoT LAB testbeds.

  • ProvLight [Rosendo 2023b, 2023c], a tool that enables efficient provenance capture on the IoT/Edge, leveraging simplified data models, data compression and grouping, and lightweight transmission protocols to reduce overheads. Our validation at large scale with synthetic workloads on 64 real-life IoT/Edge devices shows that ProvLight outperforms state-of-the-art systems like ProvLake [Souza 2022] and DfAnalyzer [Silva 2020] in resource-constrained devices.

  • A methodology to support the optimization of complex workflows on the Edge-to-Cloud Continuum [Rosendo 2021a, Rosendo 2021b, Rosendo 2022b].  Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour and related performance trade-offs. We illustrate our methodology with our Pl@ntNet application.
  • SchalaDB [Souza 2021], an architecture with a set of design principles and techniques based on distributed in-memory data management for efficient workflow execution control and user steering. We propose a distributed data design for scalable workflow task scheduling and high availability driven by a parallel and distributed in-memory DBMS. We developed d-Chiron (https://github.com/hpcdb/d-Chiron), a Workflow Management System designed according to SchalaDB’s principles.
  • A distributed solution for the efficient execution of scientific workflows in multisite cloud through adaptive caching [Heidsieck 2021]. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation in a three-site cloud with Phenomenal workflow shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of the same input data for each new execution.
  • FReeP-Feature Recommender from Preferences, a parameter value recommendation method that is designed to suggest values for workflow parameters, taking into account past user preferences [Silva 2021a]. FReeP is based on machine learning techniques, in particular, preference learning.
  • SaFER (workflow Scheduling with conFidEntity pRoblem), a scheduling approach that considers data confidentiality constraints [Silva 2021b].
  • A distributed solution for the efficient execution of scientific workflows in multisite cloud through adaptive caching, with extensive validation using the Phenomenal workflow deployed in the OpenAlea workflow system [Heidsieck 2020a, 2020b].
  • An adaptive workflow monitoring approach that combines provenance data monitoring and computational steering to support users to steer a running workflow and reduce subsets from datasets online, with an experimental validation in oil and gas using a 936-cores cluster which shows major reductions of execution time and the data processed [Souza 2020a].
  • DfAnalyzer: a tool for monitoring, debugging, and analyzing dataflows generated by Computational Science and Engineering (CSE) applications. The performance evaluation of CSE executions for a complex multiphysics application shows that DfAnalyzer has negligible time overhead on total elapsed time [Silva 2020].

Permanent link to this article: https://team.inria.fr/zenith/hpdasc/achievements/