Research

The scientific ambition of RobotLearn is to train robots to acquire the capacity to look, listen, learn, move and speak in a socially acceptable manner. This will be achieved via a fine tuning between scientific findings, development of practical algorithms and associated software packages, and thorough experimental validation. It is planned to endow robotic platforms with the ability to perform physically-unconstrained and open-domain multi-person interaction and communication. The roadmap of RobotLearn is twofold: (i) to build on the recent achievements of the Perception team, in particular, machine learning techniques for the temporal and spatial alignment of audio and visual data, variational Bayesian methods for unimodal and multimodal tracking of humans, and deep learning architectures for audio and audio-visual speech enhancement, and (ii) to explore novel scientific research opportunities at the crossroads of discriminative and generative deep learning architectures, Bayesian learning and inference, computer vision, audio/speech signal processing, spoken dialog systems, and robotics. The paramount applicative domain of RobotLearn is the development of multimodal and multi-party interactive methodologies and technologies for social (companion) robots. Please, check our publications list.

RobotLearn is a Research Team at Inria Grenoble Rhône-Alpes and Université Grenoble Alpes, and is associated with Laboratoire Jean Kuntzman.

Recent contributions

  • A Benchmark of Dynamical Variational Autoencoders \\ applied to Speech Spectrogram Modeling

    by Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber and Xavier Alameda-Pineda
    Interspeech’21, Brno, Czech Republic

    Abstract. The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space ...

  • PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation

    By Wen Guo, Enric Corona, Francesc Moreno-Noguer, Xavier Alameda-Pineda, In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV2021).

     

     

     

    Abstract. Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent ...

  • Robust Face Frontalization For Visual Speech Recognition Robust Face Frontalization For Visual Speech Recognition

    by Zhiqi Kang, Radu Horaud and Mostafa Sadeghi
    ICCV’21 Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV’21)

    Abstract. Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution is a robust method that preserves non-rigid facial deformations, i.e. expressions. The ...

  • TransCenter: Transformers with Dense Queries for Multiple-Object Tracking TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

    by Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus and Xavier Alameda-Pineda

    Abstract: Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent ...

  • Performance Analysis of 3D Face Alignment with a Statistically Robust Confidence Test Performance Analysis of 3D Face Alignment with a Statistically Robust Confidence Test

    by Mostafa Sadeghi,  Xavier Alameda-Pineda and Radu Horaud
    (submitted to IEEE Transactions on Image Processing)

    Abstract: We address the problem of analyzing the performance of 3D face alignment (3DFA) algorithms. Traditionally, performance analysis relies on carefully annotated datasets. Here, these annotations correspond to the 3D coordinates of a ...
  • Multi-Person Extreme Motion Prediction with Cross-Interaction Attention Multi-Person Extreme Motion Prediction with Cross-Interaction Attention

    by Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda and Francesc Moreno-Noguer

    Abstract We present the Extreme Pose Interaction (ExPI) Dataset, a new person interaction dataset of Lindy Hop aerial steps Our dataset contains 2 couples of dancers performing 16 different aerials (dancing actions),  obtaining 115 sequences with 30k frames for each viewpoint ...

  • Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

    by Mostafa Sadeghi, Xavier Alameda-Pineda
    IEEE TSP, 2021

    Abstract. In this paper, we are interested in unsupervised (unknown noise) speech enhancement, where the probability distribution of clean speech spectrogram is simulated via a latent variable generative model, also called the decoder. Recently, variational autoencoders (VAEs) have gained much popularity as probabilistic generative models. ...

  • Variational Inference and Learning of Piecewise-linear Dynamical Systems Variational Inference and Learning of Piecewise-linear Dynamical Systems

    by Xavier Alameda-Pineda, Vincent Drouard, Radu Horaud
    IEEE TNNLS 2021

    Abstract Modeling the temporal behavior of data is of primordial importance in many scientific and engineering fields. Baseline methods assume that both the dynamic and observation equations follow linear-Gaussian  models. However, there are many real-world processes that cannot be characterized by a single linear behavior. ...

  • ODANet: Online Deep Appearance Network for Identity-Consistent Multi-Person Tracking ODANet: Online Deep Appearance Network for Identity-Consistent Multi-Person Tracking

    by Guillaume Delorme , Yutong Ban , Guillaume Sarrazin and Xavier Alameda-Pineda
    ICPR’20 Workshop on Multimodal pattern recognition for social signal processing in human computer interaction

    Abstract. The analysis of effective states through time in multi-person scenarii is very challenging, because it requires to consistently track all persons over time. This requires a robust visual ...

  • Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

    by Dan Xu, Xavier Alameda-Pineda, Wanli Ouyang, Elisa Ricci, Xiaogang Wang and Nicu Sebe
    IEEE TPAMI, 2020

    Abstract. Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level ...

  • Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks

    by Zhiqi Kang, Mostafa Sadeghi, and Radu Horaud
    (Submitted to IEEE Transactions on Multimedia)

    Abstract: Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a robust face alignment method that enables pixel-to-pixel warping. The method simultaneously estimates the rigid transformation (scale, rotation, and ...

  • Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

    by Mostafa Sadeghi and Xavier Alameda-Pineda
    Presented at IEEE ICASSP 2021

    Abstract: Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational auto-encoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. ...

  • Deep Variational Generative Models for Audio-visual Speech Separation Deep Variational Generative Models for Audio-visual Speech Separation

    by Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, and Xavier Alameda-Pineda
    Presented at IEEE MLSP 2021

    Abstract: In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean ...

  • Online Monaural Speech Enhancement using Delayed Subband LSTM

    by Xiaofei Li and Radu Horaud
    Presented at INTERSPEECH 2020

    Abstract.  This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement. The proposed method is developed in the short time Fourier transform (STFT) domain. Online processing requires frame-by-frame signal reception and processing. A paramount feature ...

  • CANU-ReID: A Conditional Adversarial Network for Unsupervised person Re-IDentification CANU-ReID: A Conditional Adversarial Network for Unsupervised person Re-IDentification

    by Guillaume Delorme, Stéphane Lathuilière, Radu Horaud and Xavier Alameda-Pineda
    Presented at ICPR, 2021

    Abstract: Unsupervised person re-ID is the task of identifying people on a target dataset for which the ID labels are unavailable during training. In this paper, we propose to unify two trends ...

  • Learning Visual Voice Activity Detection with an Automatically Annotated Dataset Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

    by Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo and Radu Horaud
    Presented at ICPR 2021

    Abstract. Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient, either because the acoustic signal is difficult to analyze or because it ...

  • How To Train Your Deep Multi-Object Tracker How To Train Your Deep Multi-Object Tracker

    by Yihong Xu, Aljoša Ošep, Yutong Ban, Radu Horaud, Laura Leal-Taixé and Xavier Alameda-Pineda
    Presented at IEEE CVPR 2020

    Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train ...

Comments are closed.