Seminar by Dennis Shasha, NYU “BugDoc: Algorithms to Debug Computational Processes”, 25 May 2022

Journée Zenith, Golfe de Coulondres, 25 May 2022

BugDoc: Algorithms to Debug Computational Processes

Dennis Shasha, New York University, USA (joint work with Raoni Lourenco and Juliana Freire)

Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures, including in data inputs. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.

Permanent link to this article: https://team.inria.fr/zenith/seminar-by-dennis-shasha-nyu-bugdoc-algorithms-to-debug-computational-processes-25-may-2022/

Seminar by Fabio Porto, LNCC “ML Model Management in Gypscie”, 25 May 2022

Journée Zenith, Golfe de Coulondres, 25 May 2022

ML Model Management in Gypscie

Fabio Porto, LNCC, Petropolis, Brazil

To realize the full potential of data science, ML models (or models for short) must be built, combined and ensembled, which can be very complex as there can be many models to select from. Furthermore, they should be shared and reused, in particular, in different execution environments such as HPC or Spark clusters. To address this problem, we propose Gypscie, a new framework that supports the entire ML lifecycle and enables model reuse and import from other frameworks. The approach behind Gypscie is to combine several rich capabilities for model and data management, and model execution, which are typically provided by different tools, in a unique framework. Finally, Gypscie interfaces with multiple execution environments to run ML tasks, e.g., an HPC system such as the Santos Dumont supercomputer at LNCC or a Spark cluster.

Permanent link to this article: https://team.inria.fr/zenith/gypscie-25-may-2022/

Sixth Workshop of the HPDaSc project, 15 August 2022, LNCC, Petropolis, Brazil

See the program here.

Permanent link to this article: https://team.inria.fr/zenith/sixth-workshop-of-the-hpdasc-project-15-august-2022-lncc-petropolis-brazil/

Seminar at CEFET, Rio de Janeiro, by Patrick Valduriez “Innovation : startup strategies” 5 August 2022.

See the announcement here.

Permanent link to this article: https://team.inria.fr/zenith/seminar-at-cefet-rio-de-janeiro-by-patrick-valduriez-innovation-startup-strategies-5-august-2022/

ICML 2022: paper by Camille Garcin et al.

The paper “Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification” by Camille Garcin, Maximilien Servajean, Alexis Joly, Joseph Salmon has been accepted for presentation at ICML 2022 (Acceptante rate: 21% of total number (5630) of submissions).

Abstract: In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, proposing top-K losses tailored for deep learning remains a challenge, both theoretically and practically. In this paper we introduce a stochastic top-K hinge loss inspired by recent developments on top-K calibrated losses. Our proposal is based on the smoothing of the top-K operator building on the flexible “perturbed optimizer” framework. We show that our loss function performs very well in the case of balanced datasets, while benefiting from a significantly lower computational time than the state-of-the-art top-K loss function. In addition, we propose a simple variant of our loss for the imbalanced case. Experiments on a heavy-tailed dataset show that our loss function significantly outperforms other baseline loss functions.

Permanent link to this article: https://team.inria.fr/zenith/icml-2022-paper-by-camille-garcin-et-al/

Habilitation (HDR) defense of Antoine Liutkus, 11 Feb. 2022, at 14h.

Antoine Liutkus will defend his habilitation (HDR) on February 11th at 2PM (UTC+1). The defense will take place at LIRMM, room 02.022.

 

The defense will be about the following topic:
Probabilistic and deep models for the processing of mixtures of waveforms
In this presentation, I will start by presentint a summary of the research I did in the past 15 years. Doing so, I will first present my effort on probabilistic audio modeling, in­cluding the separation of Gaussian and α­-stable stochastic processes. Second, I will mention my work on deep learning applied to audio, which rapidly turned into a large effort for community service.
As a conclusion, I will mention my research programme, that involves a theoretical revolving around probabilistic machine learning, and an applied part that concerns the processing of time series arising in both audio and life sciences.”
Committee:
Christian Jutten, Emeritus Professor, Grenoble Univ.
Rémi Gribonval, Research Director, Inria Lyon
Cédric Févotte, Research Director, CNRS IRIT Toulouse
Laurent Daudet, Chief Scientific Officier lighton.ai and Professor, Univ. Paris Diderot
Tuomas Virtanen, Professor Tampere University
Alexey Ozerov, Research Scientist, Ava, Rennes

Permanent link to this article: https://team.inria.fr/zenith/habilitation-hdr-defense-of-antoine-liutkus-11-feb-2022-at-14h/

Pl@ntnet numéro 2 des logiciels Inria les plus connus des entreprises, 10 juin 2021.

Selon le sondage de Inria Academy sur les logiciels diffusés par Inria et ses partenaires et les besoins des entreprises. Scikit-learn, Pl@ntNet, Coq, OpenVibe et Pharo arrivent dans le top 5 des logiciels les plus connus des entreprises !

Et Pl@ntnet arrive second derrière Scikit-learn.

Permanent link to this article: https://team.inria.fr/zenith/plntnet-numero-2-des-logiciels-inria-les-plus-connus-des-entreprises-10-juin-2021/

“Making the Right Move to Senior Researcher”, by P. Valduriez, May 2021.

Check out this short article  “Making the Right Move to Senior Researcher”, to  appear in the May 2021 issue of ACM SIGMOD Record, in a new series managed by Professor Tamer Özsu which seeks to provide advice to mid-career researchers.

Permanent link to this article: https://team.inria.fr/zenith/check-out-this-short-article-in-acm-sigmod-record-2021/

SIGKDD 2021: paper by Reza Akbarinia et al. accepted (research track).

The paper proposes PBA (Parallel Boundary Aggregator), a novel algorithm that computes incremental aggregations in parallel over massive data streams. The work has been done with Univ. Clermont-Auvergne (postdoc Chao Zhang and Professor Farouk Toumani).

Chao Zhang, Reza Akbarinia, Farouk Toumani. Efficient Incremental Computation of Aggregations over Sliding Windows. Int ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), 2021.

Nowadays, we are witnessing the production of large volumes of continuous or real-time data in many application domains like traffic monitoring, medical monitoring, social networks, weather forecasting, network monitoring, etc. For example, every day around one trillion messages are processed through Uber data analytics infrastructure, and more than 500 million tweets are posted on Twitter. Efficient streaming algorithms are needed for analyzing data streams in such applications. In particular, aggregations having the inherent property of summarizing information from data, constitute a fundamental operator to compute real-time statistics in this context. In the streaming setting, aggregations are typically computed over finite subsets of a stream, called windows. In particular, sliding-window aggregation (SWAG) continuously computes a summary of the most recent data items in a given range r (aka window size) and using a given slide s.

One of the challenges faced by the SWAG algorithms is to incrementally compute aggregations over moving data, i.e., without recomputing the aggregation from scratch after inserting new data items or evicting old data items to/from the window. High throughput and low latency are essential requirements as stream processing systems are typically designed for real-time applications.

In this paper, we propose PBA (Parallel Boundary Aggregator), a novel algorithm that computes incremental aggregations in parallel. PBA groups continuous slices into chunks, and maintains two buffers for each chunk containing, respectively, the cumulative slice aggregations (denoted as csa) and the left cumulative slice aggregations (denoted as lcs) of the chunk’s slices. Using PBA, SWAGs can be computed in constant time for both amortized and worst-case time. We also propose an approach to optimize the chunk size, which guarantees the minimum latency for PBA. We conducted extensive empirical experiments using both synthetic and real-world datasets. Our experiments show that PBA behaves very well for average and large sliding windows (e.g., with sizes higher than 1024 values) compared to the state-of-the-art algorithms. For small-size windows, the results show the superiority of the non-parallel version of PBA (denoted as SBA) that outperforms other algorithms in terms of throughput.

Permanent link to this article: https://team.inria.fr/zenith/sigkdd-2021-paper-by-reza-akbarinia-et-al-accepted-research-track/

ICML 2021: paper by Antoine Liutkus et al. accepted (as long presentation).

The paper “Relative positional encoding for transformers with linear complexity” Liutkus et al. has been accepted for presentation at ICML 2021 as long paper (3% of total number of submissions). In 2021, of the 5513 articles submitted, only 1184 were accepted in short presentation (21.5%) and 166 in long presentation (3%).
Titre: Relative positional encoding for transformers with linear complexity
Auteurs: A. Liutkus and O. Cifka and S. Wu and U. Simsekli and Y. Yang and G. Richard,
Abstract: Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present \textit{Stochastic Positional Encoding} as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

Permanent link to this article: https://team.inria.fr/zenith/icml-2021-paper-by-antoine-liutkus-et-al-accepted-as-long-presentation/