Research

You are welcome to browse through our recent and current research results (alphabetical order). A broad and non-exhaustive list of the team’s research topics may be found on our  homepage. Some of this research is directly linked to recently submitted or accepted publications that can be found here. Please also refer to our complete list of publications

Accounting for Room Acoustics in Audio-Visual Multi-Speaker Tracking

Yutong Ban, Xiaofei Li, Xavier Alameda-Pineda, Laurent Girin and Radu Horaud IEEE International Conference on Acoustics, Speech, and Signal Processing, May 2018, Calgary, Canada PDF | Abstract | | Results | Acknowledgements   Abstract. Multiple speaker tracking is a crucial task for many applications. In real-world scenario, the complementarity between auditory and visual data opens the …

Acoustic Space Learning on Binaural Manifolds

Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds 2016 IJNS Award for Outstanding Contributions to Neural Systems Antoine Deleforge, Florence Forbes, and Radu Horaud International Journal of Neural Systems, 25 (1), 2015 PDF on arXiv | BibTeX | HAL | Additional papers | Matlab Code | Dataset | Videos and more  Abstract In this paper we …

Audio Source Tracking with the von Mises Distribution

Tracking Multiple Audio Sources with the von Mises Distribution and Variational Expectation Maximization Yutong Ban, Xavier Alameda-Pineda, Christine Evers (Imperial College) and Radu Horaud IEEE Signal Processing Letters  26(6), 798 – 802, 2019 | pdf  | pdf of supplemental material | code | video Abstract. In this work we address the problem of simultaneously tracking several audio …

Audio-Visual Gaze Control

Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction Stéphane Lathuilière, Benoit Massé, Pablo Mesejo, and Radu Horaud Pattern Recognition Letters, 2018 | arXiv |  HAL | Video (scroll down) This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt …

Audio-Visual Multi-Speaker Tracking via Variational Inference

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers Yutong Ban,  Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud Paper submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence arXiv | Results | Acknowledgements   Abstract. In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. …

Audio-Visual Speaker Detection, Localization and Interaction with NAO

Publications | Videos | The NAO Robot   Abstract. In this research we address the problem of audio-visual speaker detection. We introduce an online system working on the humanoid robot NAO. The scene is perceived with two cameras and two microphones. A multimodal Gaussian Mixture Model fuses the information extracted from the auditory and visual sensors. The system …

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5), pages 1086-1099, 2018 special issue on Learning with Shared Information for Computer Vision and Multimedia Analysis Israel D. Gebru    Sileye Ba    Xiaofei Li    Radu P. Horaud [PDF on arXiv] [IEEE Xplore] [HAL] [DATASET] [BibTeX][Videos]   Abstract. Speaker diarization consists of assigning speech signals to …

Audio-Visual Speaker Localization Via Weighted Clustering

Abstract. In this paper we address the problem of detecting and locating speakers using audiovisual data. We address this problem in the framework of clustering. We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations. Weighted-data clustering techniques have already been proposed, but …

Audio-visual Speech-Turn Detection and Tracking

Israel D. Gebru, Sileye Ba, Georgios Evangelidis and Radu Horaud International Conference on Latent Variable Analysis and Independent Component Analysis, 2015 (pdf) Speaker diarization is an important component of multi-party dialog systems in order to assign speech-signal segments among participants. Diarization may well be viewed as the problem of detecting and tracking speech turns. It …

Audio-Visual Tracking by Density Approximation

Audio-Visual Tracking by Density Approximation in a Sequential Bayesian  Filtering Framework Israel D. Gebru   Christine Evers* Patrick A. Naylor*    Radu P. Horaud IEEE Workshop on Hands-free Speech Communication and Microphone Arrays Best Paper Award *Imperial College London [PDF] [Slides] [BibTeX] [Code] [Video] Abstract This paper proposes a novel audio-visual tracking approach that exploits constructively …

Audio-visual VAE for Speech Enhancement

  Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud Paper submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing | arXiv Abstract. Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully …

Comprehensive Analysis of Deep Regression

Stéphane Lathuilère, Pablo Mesejo, Xavier Alameda-Pineda, and Radu Horaud Accepted with minor revisions to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. Paper available on arXiv. Abstract: Deep learning revolutionized data science, and recently its popularity has grown exponentially, as did the amount of papers employing deep networks. Vision tasks, such as human …

Continuous Action Recognition

Continuous Action Recognition Based on Sequence Alignment Kaustubh Kulkarni, Georgios Evangelidis, Jan Cech and Radu Horaud International Journal of Computer Vision (online) vol. 112, issue 1, March 2015, pp. 90-114 PDF on arXiv | BibTeX:  | PDF from HAL | Matlab code | Additional Papers | Videos Abstract: Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried …

Deep Mixture of Linear Inverse Regressions

Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation Stéphane Lathuilière, Rémi Juge, Pablo Mesejo, Rafael Munoz-Salinas, and Radu Horaud IEEE Conference on Computer Vision and Pattern Recognition, July 2017 [ pdf ] [ code ] [ BibTeX ]   Abstract: Convolutional Neural Networks (ConvNets) have become the state-of-the-art for many classification and regression …

Depth (TOF) and Stereo Fusion

Fusion of Range and Stereo Data for High-resolution Scene-modeling IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, No.11, 2015, pp. 2178-2192 (IEEE Xplore) G. Evangelidis, M. Hansard, and R. Horaud Abstract This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with …

Direct-Path Relative Transfer Function for Speaker Localization

Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda and Radu Horaud. Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments.  IEEE Journal of Selected Topics in Signal Processing, 13 (1), pp. 88 – 103, 2019. [research page]   Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization …

EM Algorithms for Joint Source Separation and Diarisation of Speech

IEEE International Conference on Acoustics, Speech and Signal Processing, 2017 IEEE Workshop on Applications of Signal Processing to Audio Acoustics, 2017 D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud In this page you can find two EM algorithms for simultaneous separation and diarision of multichannel convolutive audio mixtures. The algorithm on Icassp 2017 …

EM Algorithms for Weigthed-Data Clustering with Application to Audio-Visual Scene Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) Volume 38, number 12, pages 2402 – 2415, December 2016 Israel D. Gebru    Xavier Alameda-Pineda    Florence Forbes    Radu P. Horaud [ PDF on arXiv ]   [ PDF on IEEE Xplore ]   [ BibTex ]   [ CODE & DATASET ]   [ …

Extended Gaze Following

Extended Gaze Following: Detecting Objects in Videos Beyond the Camera Field of View Benoit Massé, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud IEEE International Conference on Automatic Face and Gesture Recognition, May 2019, Lille, France Pdf on HAL | supplementary material |  Abstract. In this paper, we address the problems of detecting objects of interest in …

Eye Gaze and Visual Focus

Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction Benoit Massé, Silèye Ba and Radu Horaud IEEE Transactions on Pattern Analysis and Machine Intelligence,  40(11), 2711-2724, 2018 | IEEEXplore | arXiv | HAL | Code | Demo | Datasets The visual focus of attention (VFOA) has been recognized as a prominent conversational …

Finding Audio-Visual Events in Informal Social Gatherings

by Xavier Alameda-Pineda, Vasil Khalidov, Florence Forbes and Radu Horaud IEEE/ACM International Conference on Multimodal Interaction, 2011 Outstanding Paper Award Abstract In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new …

Geometric Sound Source Localization

A Geometric Approach to Sound Source Localization from Time-Delay Estimates Xavier Alameda-Pineda and Radu Horaud IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(6), pages 1082-1095, June 2014 PDF on arXiv | BibTeX | HAL | Matlab toolbox | Additional Papers | Online multimedia Abstract: We address the problem of sound-source localization from time-delay estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel …

Head Pose Estimation

Head Pose Estimation via Probabilistic High-Dimensional Regression Best Student Paper Award (2nd place) V. Drouard, S. Ba, G. Evangelidis, A. Deleforge, and R. Horaud IEEE International Conference on Image Processing (ICIP’15) Extended version published in IEEE Transactions on Image Processing, available on HAL Also, please visit our High-dimensional regression webpage IEEE Publication | HAL Publication …

Head-Pose Tracking

Switching Linear Inverse-Regression Model for Tracking Head Pose V. Drouard, S. Ba, and R. Horaud IEEE Winter Conference on Application of Computer Vision (WACV’17) IEEE Publication | HAL Publication | Abstract | BibTex | Results | Matlab code | Acknowledgement Abstract We propose to estimate the head-pose angles (pitch, yaw, and roll) by simultaneously predicting the …

High-Dimensional Regression

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables Statistics and Computing, Springer, 2015, vol. 25, number 5, pages 893-911 Antoine Deleforge, Florence Forbes and Radu Horaud  Abstract | arXiv | HAL| Springer | Supplementary materials | Slides | Citation and Bibtex  Associated software packages: Matlab toolbox | R code | Python/Keras package Abstract: The problem of approximating high-dimensional data with a …

Inverse Gamma Source Variance Prior Model for Audio Source Separation

IEEE International Conference on Acoustics, Speech and Signal Processing, 2017 D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud Abstract In this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application to multichannel audio source separation (MASS). The source signal is modeled with …

Joint Registration of Multiple Point Sets

A Generative Model for the Joint Registration of Multiple Point Sets Evangelidis, D. Kounades-Bastian, R. Horaud, E. Psarakis ECCV 2014 Joint Alignment of Multiple Point Sets with Batch and Incremental Expectation-Maximization G. Evangelidis, R. Horaud PAMI 2018 (to appear) Abstract (PAMI version) This paper addresses the problem of registering multiple point sets. Solutions to this …

Learning and Inference of Piecewise-linear Dynamical Systems

Text to come…

Multichannel Speech Enhancement using Long Short-Term Memory

Audio Examples: CHiME3 dataset, using four microphones. clean noisy BeamformIt CNN LSTM BLSTM BUS CAF PED STR

Multichannel speech enhancement with VAE and NMF

Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization Simon Leglaive, Laurent Girin, Radu Horaud IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Brighton, UK, 2019 Article | Bibtex  | Slides | Code |   Audio examples | Acknowledgement   Abstract In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is …

Noise Power Spectral Density Estimation

Non-stationary Noise Power Spectral Density Estimation Based on Regional Statistics Xiaofei Li, Laurent Girin, Sharon Gannot and Radu Horaud The 41th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. [HAL] [ pdf ] [ Matlab code ] Abstract Estimating the noise power spectral density (PSD) is essential for single channel speech enhancement …

ODA-Track: Online Deep Appearance for Robotic Multiple Person Tracking

Online Localization and Tracking of Multiple Moving Speakers

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments Xiaofei Li*, Yutong Ban*, Laurent Girin, Xavier Alameda-Pineda and Radu Horaud  IEEE Journal on Selected Topics in Signal Processing, 2019. | pdf from arXiv | code | dataset Abstract. We address the problem of online localization and tracking of multiple moving speakers in reverberant …

Online Variational Bayesian Tracking

Variational Bayesian Framework for Multi-Person Tracking Sileye Ba, Yutong Ban, Xavier Alameda-PIneda, Alessio Xompero, and Radu Horaud Papers | Matlab code | Results Object tracking is an ubiquitous problem in computer vision with many applications in human-machine and human-robot interaction, augmented reality, driving assistance, surveillance, etc. Although thoroughly investigated, tracking multiple persons remains a challenging …

Plane-extraction from depth-data

Plane-extraction from depth-data using a Gaussian mixture regression model Richard Marriott, Alexander Pashevich, and Radu Horaud Pattern Recognition Letters, volume 110, pages 44-50, July 2018 | arXiv | HAL | BibTeX |Matlab code   Abstract: We propose a novel algorithm for unsupervised extraction of piecewise planar models from depth data. Among other applications, such models are …

Point Registration with Expectation-Maximization

Rigid and Articulated Point Registration with Expectation Conditional Maximization Radu Horaud, Florence Forbes, Manuel Yguel, Guillaume Dewaele, and Jian Zhang IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (3), 587-602, March 2011 Abstract  | code | pdf from HAL | IEEEXplore | Bibtex | Video of a toy example Abstract. This paper addresses the …

Recognition of Group Activities in Videos

Recognition of Group Activities in Videos Based on Single- and Two-Person Descriptors Stéphane Lathuilière, Georgios Evangelidis, Radu Horaud IEEE Winter Conference on Application of Computer Vision (WACV’17) IEEE Publication | HAL Publication | Abstract | BibTex | Results | Acknowledgement Abstract Group activity recognition from videos is a very challenging problem that has barely been addressed. …

Scene Flow Estimation

Scene Flow Estimation by Growing Correspondence Seeds Jan Cech, Jordi Sanchez-Rieira, and Radu Horaud IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3129-3136, 2011 Abstract  | Code | HAL | IEEEXplore | Bibtex | Video | Papers Software package as a Matlab toolbox (source code of binaries) available from Jan Cech’s website or here. Abstract. A simple seed growing algorithm for estimating …

Skeletal Quads

Human Action and Gesture Recognition Using Joint Quadruples Description | Publications | Code G. Evangelidis, G. Singh, R. Horaud Description Recent advances on human motion analysis have made the extraction of human skeleton structure feasible, even from single depth images. This structure has been proven quite informative for discriminating actions in a recognition scenario. In …

Speech Dereverberation

Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud IEEE/ACM Transactions on Audio, Speech and Language Processing, 27 (9), pp. 1365 – 1377, 2019. [pdf] [matlab code] Abstract. This paper addresses the problem of multichannel online dereverberation. The proposed method is performed in the short-time Fourier transform …

Speech enhancement with VAE and alpha-stable distributions

Speech enhancement with variational autoencoders and alpha-stable distributions Simon Leglaive, Umut Şimşekli, Antoine Liutkus, Laurent Girin, Radu Horaud IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Brighton, UK, 2019 Article | Bibtex  | Poster | Code |  Audio examples | Acknowledgement   Abstract This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the …

Supervised Sound-Source Localization

Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression Antoine Deleforge, Radu Horaud, Yoav Y. Schechner and Laurent Girin. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(4), 718-731, April 2015 Abstract | Videos | Dataset | Matlab code | pdf from HAL | IEEE Xplore | Bibtex   Setup: Two microphones plugged into the …

Three-Dimensional Sensors

Depth Cameras and Associated Computer Vision Methods Radu Horaud (INRIA), Miles Hansard (QMUL), and Georgios Evangelidis (DAQRI)   The emergence of three-dimensional sensors, e.g., Microsoft Kinect v1 and v2, Asus Xtion Pro Live (structered-light sensors), Mesa Imaging SR4000, or Velodyne HDL-64 laser range finder, to cite just a few, have introduced a revolution in the …

Tracking and Visual Servoing

Tracking a Varying Number of People with a Visually-Controlled Robotic Head Y.Ban, X. Alameda-Pineda, F. Badeig, S. Ba, and R. Horaud IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’17) Novel Technology Paper Award Finalist   PDF | Abstract | | Slides | Results | Acknowledgements Abstract Multi-person tracking (MOT) using a robot platform is of …

Tracking the Active Speaker Based on Joint Audio-Visual Observation

IEEE International Conference on Computer Vision Workshops, Dec 2015 Israel D. Gebru    Sileye Ba    Georgios Evangelidis    Radu P. Horaud [ PDF ]     [ BibTex ]   [ VIDEO ]   [ DATASET ] Abstract Any multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among …

Variance Modeling Based on VAEs for Speech Enhancement

  A variance modeling framework based on variational autoencoders for speech enhancement Simon Leglaive, Laurent Girin, Radu Horaud IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, September 2018 Article | Supporting document | Bibtex |  Slides |  Code |  Audio examples | Acknowledgement   Abstract In this paper we address the problem of enhancing speech signals in noisy mixtures using a source …

Variational EM for the Separation of Time-Varying Convolutive Mixtures

IEEE/ACM Transactions on Audio, Speech and Language Processing D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud Abstract This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by …

Video Grounding

From Video Matching to Video Grounding G. Evangelidis, F. Diego, R. Horaud Abstract This paper addresses the background estimation problem for videos captured by moving cameras, referred to as video grounding. It essentially aims at reconstructing a video, as if it would be without foreground objects, e.g. cars or people. What differentiates video grounding from …