The scientific ambition of RobotLearn is to train robots to acquire the capacity to look, listen, learn, move and speak in a socially acceptable manner. This will be achieved via a fine tuning between scientific findings, development of practical algorithms and associated software packages, and thorough experimental validation. It is planned to endow robotic platforms with the ability to perform physically-unconstrained and open-domain multi-person interaction and communication. The roadmap of RobotLearn is twofold: (i) to build on the recent achievements of the Perception team, in particular, machine learning techniques for the temporal and spatial alignment of audio and visual data, variational Bayesian methods for unimodal and multimodal tracking of humans, and deep learning architectures for audio and audio-visual speech enhancement, and (ii) to explore novel scientific research opportunities at the crossroads of discriminative and generative deep learning architectures, Bayesian learning and inference, computer vision, audio/speech signal processing, spoken dialog systems, and robotics. The paramount applicative domain of RobotLearn is the development of multimodal and multi-party interactive methodologies and technologies for social (companion) robots. Please, check our publications list.
- A Benchmark of Dynamical Variational Autoencoders \\ applied to Speech Spectrogram Modeling
Abstract. The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space ...
- PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation
By Wen Guo, Enric Corona, Francesc Moreno-Noguer, Xavier Alameda-Pineda, In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV2021).
Abstract. Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent ...
- Robust Face Frontalization For Visual Speech Recognition
by Zhiqi Kang, Radu Horaud and Mostafa Sadeghi
ICCV’21 Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV’21)
Abstract. Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution is a robust method that preserves non-rigid facial deformations, i.e. expressions. The ...
- TransCenter: Transformers with Dense Queries for Multiple-Object Tracking
by Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus and Xavier Alameda-Pineda
Abstract: Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent ...
- Performance Analysis of 3D Face Alignment with a Statistically Robust Confidence Test Abstract: We address the problem of analyzing the performance of 3D face alignment (3DFA) algorithms. Traditionally, performance analysis relies on carefully annotated datasets. Here, these annotations correspond to the 3D coordinates of a ...
- Multi-Person Extreme Motion Prediction with Cross-Interaction Attention
by Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda and Francesc Moreno-Noguer
Abstract We present the Extreme Pose Interaction (ExPI) Dataset, a new person interaction dataset of Lindy Hop aerial steps . Our dataset contains 2 couples of dancers performing 16 different aerials (dancing actions), obtaining 115 sequences with 30k frames for each viewpoint ...
- Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement
by Mostafa Sadeghi, Xavier Alameda-Pineda
IEEE TSP, 2021
Abstract. In this paper, we are interested in unsupervised (unknown noise) speech enhancement, where the probability distribution of clean speech spectrogram is simulated via a latent variable generative model, also called the decoder. Recently, variational autoencoders (VAEs) have gained much popularity as probabilistic generative models. ...
- Variational Inference and Learning of Piecewise-linear Dynamical Systems
by Xavier Alameda-Pineda, Vincent Drouard, Radu Horaud
IEEE TNNLS 2021
Abstract Modeling the temporal behavior of data is of primordial importance in many scientific and engineering fields. Baseline methods assume that both the dynamic and observation equations follow linear-Gaussian models. However, there are many real-world processes that cannot be characterized by a single linear behavior. ...
- ODANet: Online Deep Appearance Network for Identity-Consistent Multi-Person Tracking
by Guillaume Delorme , Yutong Ban , Guillaume Sarrazin and Xavier Alameda-Pineda
ICPR’20 Workshop on Multimodal pattern recognition for social signal processing in human computer interaction
Abstract. The analysis of effective states through time in multi-person scenarii is very challenging, because it requires to consistently track all persons over time. This requires a robust visual ...
- Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction
by Dan Xu, Xavier Alameda-Pineda, Wanli Ouyang, Elisa Ricci, Xiaogang Wang and Nicu Sebe
IEEE TPAMI, 2020
Abstract. Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level ...
- Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks
by Zhiqi Kang, Mostafa Sadeghi, and Radu Horaud
(Submitted to IEEE Transactions on Multimedia)
Abstract: Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a robust face alignment method that enables pixel-to-pixel warping. The method simultaneously estimates the rigid transformation (scale, rotation, and ...
- Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
by Mostafa Sadeghi and Xavier Alameda-Pineda
Presented at IEEE ICASSP 2021
Abstract: Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational auto-encoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. ...
- Deep Variational Generative Models for Audio-visual Speech Separation
by Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, and Xavier Alameda-Pineda
Presented at IEEE MLSP 2021
Abstract: In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean ...
- Online Monaural Speech Enhancement using Delayed Subband LSTM
by Xiaofei Li and Radu Horaud
Presented at INTERSPEECH 2020
Abstract. This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement. The proposed method is developed in the short time Fourier transform (STFT) domain. Online processing requires frame-by-frame signal reception and processing. A paramount feature ...
- CANU-ReID: A Conditional Adversarial Network for Unsupervised person Re-IDentification
by Guillaume Delorme, Stéphane Lathuilière, Radu Horaud and Xavier Alameda-Pineda
Presented at ICPR, 2021
Abstract: Unsupervised person re-ID is the task of identifying people on a target dataset for which the ID labels are unavailable during training. In this paper, we propose to unify two trends ...
- Learning Visual Voice Activity Detection with an Automatically Annotated Dataset
by Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo and Radu Horaud
Presented at ICPR 2021
Abstract. Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient, either because the acoustic signal is difficult to analyze or because it ...
- How To Train Your Deep Multi-Object Tracker
by Yihong Xu, Aljoša Ošep, Yutong Ban, Radu Horaud, Laura Leal-Taixé and Xavier Alameda-Pineda
Presented at IEEE CVPR 2020
Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train ...
- Multi-Person Extreme Motion Prediction with Cross-Interaction Attention