Research

You are welcome to browse through our recent and current research results (alphabetical order). A broad and non-exhaustive list of the team’s research topics may be found on our  homepage. Some of this research is directly linked to recently submitted or accepted publications that can be found here. Please also refer to our complete list of publications

Acoustic Space Learning on Binaural Manifolds

Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds 2016 IJNS Award for Outstanding Contributions to Neural Systems Antoine Deleforge, Florence Forbes, and Radu Horaud International Journal of Neural Systems, 25 (1), 2015 PDF on arXiv | BibTeX | HAL | Additional papers | Matlab Code | Dataset | Videos and more  Abstract In this paper we …

Audio Source Separation: Yet Another NMF-Based Formulation

An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation IEEE International Conference on Acoustics, Speech and Signal Processing, 2016 D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud Abstract In this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application …

Audio-Visual Multi-Speaker Tracking

Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking Yutong Ban, Laurent Girin, Xavier Alameda-Pineda and Radu Horaud IEEE ICCV Workshop on Computer Vision for Audio-Visual Media, October 2017 PDF | Abstract | | Results | Acknowledgements Abstract   Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and …

Audio-Visual Speaker Detection, Localization and Interaction with NAO

Publications | Videos | The NAO Robot   Abstract. In this research we address the problem of audio-visual speaker detection. We introduce an online system working on the humanoid robot NAO. The scene is perceived with two cameras and two microphones. A multimodal Gaussian Mixture Model fuses the information extracted from the auditory and visual sensors. The system …

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

IEEE Transactions on Pattern Analysis and Machine Intelligence special issue on Learning with Shared Information for Computer Vision and Multimedia Analysis Israel D. Gebru    Sileye Ba    Xiaofei Li    Radu P. Horaud [PDF on arXiv] [IEEE Xplore] [HAL] [DATASET] [BibTeX] Abstract Speaker diarization consists of assigning speech signals to speakers engaged in dialog. …

Audio-visual speaker localization via weighted clustering

Abstract In this paper we address the problem of detecting and locating speakers using audiovisual data. We address this problem in the framework of clustering. We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations. Weighted-data clustering techniques have already been proposed, but …

Audio-visual Speech-Turn Detection and Tracking

Abstract Speaker diarization is an important component of multi-party dialog systems in order to assign speech-signal segments among participants. Diarization may well be viewed as the problem of detecting and tracking speech turns. It is proposed to address this problem by modeling the spatial coincidence of visual and auditory observations and by combining this coincidence …

Audio-Visual Tracking by Density Approximation

Audio-Visual Tracking by Density Approximation in a Sequential Bayesian  Filtering Framework Israel D. Gebru   Christine Evers* Patrick A. Naylor*    Radu P. Horaud IEEE Workshop on Hands-free Speech Communication and Microphone Arrays Best Paper Award *Imperial College London [PDF] [Slides] [BibTeX] [Code] [Video] Abstract This paper proposes a novel audio-visual tracking approach that exploits constructively …

Continuous Action Recognition

Continuous Action Recognition Based on Sequence Alignment Kaustubh Kulkarni, Georgios Evangelidis, Jan Cech and Radu Horaud International Journal of Computer Vision (online) vol. 112, issue 1, March 2015, pp. 90-114 PDF on arXiv | BibTeX:  | PDF from HAL | Matlab code | Additional Papers | Videos Abstract: Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried …

Data Challenge

This page describes the data challenge that has been organized for Grenoble master students. More practical informations (dates, rules…) can be found at https://msiam.imag.fr/collab:data_challenge.  The goal is to develop an audio-visual diarization model. The data are based on th AVDIAR dataset. We provide the following observations that must be used as input for your model: …

Deep Mixture of Linear Inverse Regressions

Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation Stéphane Lathuilière, Rémi Juge, Pablo Mesejo, Rafael Munoz-Salinas, and Radu Horaud IEEE Conference on Computer Vision and Pattern Recognition, July 2017 [ pdf ] [ code ] [ BibTeX ]   Abstract: Convolutional Neural Networks (ConvNets) have become the state-of-the-art for many classification and regression …

Depth (TOF) and Stereo Fusion

Fusion of Range and Stereo Data for High-resolution Scene-modeling IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, No.11, 2015, pp. 2178-2192 (IEEE Xplore) G. Evangelidis, M. Hansard, and R. Horaud Abstract This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with …

Direct-Path Relative Transfer Function for Audio Source Localization

Sound-Source Localization in Reverberant Rooms Based on the Direct-Path Relative Transfer Function Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 24, number 11, 2016.  [pdf] [bibtex] Xiaofei Li, Laurent Girin, Fabien Badeig, Radu Horaud. Reverberant …

EM Algorithms for Weigthed-Data Clustering with Application to Audio-Visual Scene Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) Volume 38, number 12, pages 2402 – 2415, December 2016 Israel D. Gebru    Xavier Alameda-Pineda    Florence Forbes    Radu P. Horaud [ PDF on arXiv ]   [ PDF on IEEE Xplore ]   [ BibTex ]   [ CODE & DATASET ]   [ …

Eye Gaze and Visual Focus

We address the problem of estimating the visual focus of attention (VFOA), e.g. who is looking at whom? This is of particular interest in human-robot interactive scenarios, e.g. when the task requires to identify targets of interest and to track them over time. We make the following contributions. We propose a Bayesian temporal model that …

Finding Audio-Visual Events in Informal Social Gatherings

by Xavier Alameda-Pineda, Vasil Khalidov, Florence Forbes and Radu Horaud IEEE/ACM International Conference on Multimodal Interaction, 2011 Outstanding Paper Award Abstract In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new …

Geometric Sound Source Localization

A Geometric Approach to Sound Source Localization from Time-Delay Estimates Xavier Alameda-Pineda and Radu Horaud IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(6), pages 1082-1095, June 2014 PDF on arXiv | BibTeX | HAL | Matlab toolbox | Additional Papers | Online multimedia Abstract: We address the problem of sound-source localization from time-delay estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel …

Head Pose Estimation

Head Pose Estimation via Probabilistic High-Dimensional Regression Best Student Paper Award (2nd place) V. Drouard, S. Ba, G. Evangelidis, A. Deleforge, and R. Horaud IEEE International Conference on Image Processing (ICIP’15) Extended version published in IEEE Transactions on Image Processing, available on HAL Also, please visit our High-dimensional regression webpage IEEE Publication | HAL Publication …

Head-Pose Tracking

Switching Linear Inverse-Regression Model for Tracking Head Pose V. Drouard, S. Ba, and R. Horaud IEEE Winter Conference on Application of Computer Vision (WACV’17) IEEE Publication | HAL Publication | Abstract | BibTex | Results | Matlab code | Acknowledgement Abstract We propose to estimate the head-pose angles (pitch, yaw, and roll) by simultaneously predicting the …

High-Dimensional Regression

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables Statistics and Computing, Springer, 2015, vol. 25, number 5, pages 893-911 Antoine Deleforge, Florence Forbes and Radu Horaud  Abstract | arXiv | HAL| Springer | Supplementary materials | Matlab toolbox | Slides | Citation and Bibtex  Abstract: The problem of approximating high-dimensional data with a low-dimensional representation is addressed. The article makes the …

Joint Audio Source Separation and Diarisation

An EM Algorithm for Joint Source Separation and Diarisation of Multichannel Convolutive Speech Mixtures IEEE International Conference on Acoustics, Speech, Signal Processing, 2017 D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud Abstract We present a probabilistic model for joint source separation and diarisation of multichannel convolutive speech mixtures. We build upon the framework …

Joint Registration of Multiple Point Sets

A Generative Model for the Joint Registration of Multiple Point Sets European Conference on Computer Vision (Computer Vision – ECCV 2014) An extended version submitted to IEEE TPAMI is available on arXiv: https://arxiv.org/abs/1609.01466 Lecture Notes in Computer Science Volume 8695, 2014, pp 109-122 G. Evangelidis, D. Kounades-Bastian, R. Horaud, E. Psarakis   Abstract This paper describes …

NAOLab

A Distributed Architecture for Interacting with NAO NAOLab is a middleware library for developing robotic applications in C, C++, Python and Matlab, using the humanoid robot NAO Software Download | Publications | People | Support | Acknowledgements NAOLab is a middleware for the development of robotic applications in C, C++, Python and Matlab, using the humanoid robot NAO …

Neural Network Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction

Stéphane Lathuilière, Benoit Massé, Pablo Mesejo, and Radu Horaud This project introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and adapt its gaze control strategy for human-robot interaction without the use of external sensors or human supervision. The robot learns to focus its attention …

Noise Power Spectral Density Estimation

Non-stationary Noise Power Spectral Density Estimation Based on Regional Statistics Xiaofei Li, Laurent Girin, Sharon Gannot and Radu Horaud The 41th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. [HAL] [ pdf ] [ Matlab code ] Abstract Estimating the noise power spectral density (PSD) is essential for single channel speech enhancement …

Online Variational Bayesian Tracking

Variational Bayesian Framework for Multi-Person Tracking Sileye Ba, Yutong Ban, Xavi Alameda-PIneda, Alessio Xompero, and Radu Horaud Papers | Matlab code | Results Object tracking is an ubiquitous problem in computer vision with many applications in human-machine and human-robot interaction, augmented reality, driving assistance, surveillance, etc. Although thoroughly investigated, tracking multiple persons remains a challenging …

Point Registration with Expectation-Maximization

Rigid and Articulated Point Registration with Expectation Conditional Maximization Radu Horaud, Florence Forbes, Manuel Yguel, Guillaume Dewaele, and Jian Zhang IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (3), 587-602, March 2011 Abstract  | code | pdf from HAL | IEEEXplore | Bibtex | Video of a toy example Abstract. This paper addresses the …

Recognition of Group Activities in Videos

Recognition of Group Activities in Videos Based on Single- and Two-Person Descriptors Stéphane Lathuilière, Georgios Evangelidis, Radu Horaud IEEE Winter Conference on Application of Computer Vision (WACV’17) IEEE Publication | HAL Publication | Abstract | BibTex | Results | Acknowledgement Abstract Group activity recognition from videos is a very challenging problem that has barely been addressed. …

Scene Flow Estimation

Scene Flow Estimation by Growing Correspondence Seeds Jan Cech, Jordi Sanchez-Rieira, and Radu Horaud IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3129-3136, 2011 Abstract  | Code | HAL | IEEEXplore | Bibtex | Video | Papers Software package as a Matlab toolbox (source code of binaries) available from Jan Cech’s website or here. Abstract. A simple seed growing algorithm for estimating …

Separation of Time-Varying Audio Mixtures

A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures IEEE/ACM Transactions on Audio, Speech and Language Processing D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud Abstract This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model …

Skeletal Quads

Human Action and Gesture Recognition Using Joint Quadruples Description | Publications | Code G. Evangelidis, G. Singh, R. Horaud Description Recent advances on human motion analysis have made the extraction of human skeleton structure feasible, even from single depth images. This structure has been proven quite informative for discriminating actions in a recognition scenario. In …

Speech Dereverberation Based on Convolutive Transfer Function

Blind Multi-Channel Identification and Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function Xiaofei Li, Radu Horaud, and Sharon Gannot Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (arXiv)   Abstract. This paper addresses the problems of blind channel identification and multichannel equalization for speech dereverberation and noise reduction. The time-domain …

Supervised Sound-Source Localization

Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression Antoine Deleforge, Radu Horaud, Yoav Y. Schechner and Laurent Girin. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(4), 718-731, April 2015 Abstract | Videos | Dataset | Matlab code | pdf from HAL | IEEE Xplore | Bibtex   Setup: Two microphones plugged into the …

Three-Dimensional Sensors

Depth Cameras and Associated Computer Vision Methods Radu Horaud (INRIA), Miles Hansard (QMUL), and Georgios Evangelidis (DAQRI)   The emergence of three-dimensional sensors, e.g., Microsoft Kinect v1 and v2, Asus Xtion Pro Live (structered-light sensors), Mesa Imaging SR4000, or Velodyne HDL-64 laser range finder, to cite just a few, have introduced a revolution in the …

Tracking and Visual Servoing

Tracking a Varying Number of People with a Visually-Controlled Robotic Head Y.Ban, X. Alameda-Pineda, F. Badeig, S. Ba, and R. Horaud IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’17) Novel Technology Paper Award Finalist   PDF | Abstract | | Slides | Results | Acknowledgements Abstract Multi-person tracking (MOT) using a robot platform is of …

Tracking the Active Speaker Based on Joint Audio-Visual Observation

IEEE International Conference on Computer Vision Workshops, Dec 2015 Israel D. Gebru    Sileye Ba    Georgios Evangelidis    Radu P. Horaud [ PDF ]     [ BibTex ]   [ VIDEO ]   [ DATASET ] Abstract Any multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among …

Video Grounding

From Video Matching to Video Grounding G. Evangelidis, F. Diego, R. Horaud Abstract This paper addresses the background estimation problem for videos captured by moving cameras, referred to as video grounding. It essentially aims at reconstructing a video, as if it would be without foreground objects, e.g. cars or people. What differentiates video grounding from …