Return to Research

Direct-Path Relative Transfer Function for Speaker Localization

Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda and Radu Horaud. Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments.  IEEE Journal of Selected Topics in Signal Processing, 13 (1), pp. 88 – 103, 2019.

 

Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity Regularization. IEEE/ACM Transactions on Audio, Speech and Language Processing,, 2017, 25 (10), pp.1997 – 2012.

 

Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 24, number 11, 2016.

[pdf] [arXiv] [HAL] [IEEEXplore] [bibtex] [matlab code]

Additional papers

We address the problem of localization of single and multiple speech sources in reverberant and noisy rooms.  The interchannel response (two microphones) corresponding to the direct-path propagation of an audio source is a function of the source direction. In practice, this response is contaminated by noise and reverberation. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We proposed several  methods to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Third, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not.  Finally, a complex-valued  Gaussian mixture model (CGMM) is adopted to assign the DP-RTF observations to the speaker locations,  whose components correspond to all the possible candidate source locations. After optimizing the CGMM-based objective function, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. In addition, an entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations.

Video: Sound-source localization with the direct-path relative transfer function

 

An example for online multiple-speaker localization: top The CGMM weights along time. bottom The black circles represent the detected speakers by selecting the peaks of CGMM weights. The gray curves represent the ground-truth trajectories of active speakers.


Additional papers

Xiaofei Li, Laurent Girin, Fabien Badeig, Radu Horaud. Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function. International Conference on Intelligent Robots and Systems (IROS) 2016. [pdf] [Slides] [bibtex][matlab code]

Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Estimation of Relative Transfer Function in the Presence of Stationary Noise Based on Segmental Power Spectral Density Matrix Subtraction. IEEE ICASSP 2015. [pdf] [poster] [dataset] [bibtex][Matlab code]

Xiaofei Li, Radu Horaud, Laurent Girin, Sharon Gannot. Local Relative Transfer Function for Sound Source Localization. EUSIPCO 2015. [pdf] [Slides] [dataset] [bibtex]