Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 24, number 11, 2016.
We address the problem of localization of single and multiple speech sources in reverberant and noisy rooms. The interchannel response (two microphones) corresponding to the direct-path propagation of an audio source is a function of the source direction. In practice, this response is contaminated by noise and reverberation. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We proposed several methods to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Third, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Finally, a complex-valued Gaussian mixture model (CGMM) is adopted to assign the DP-RTF observations to the speaker locations, whose components correspond to all the possible candidate source locations. After optimizing the CGMM-based objective function, both the number of sources and their locations are estimated by selecting the CGMM components with the largest weights. In addition, an entropy-based penalty term is added to the likelihood to impose sparsity over the set of CGMM component weights. This favors a small number of detected speakers with respect to the large number of initial candidate source locations.
Video: Sound-source localization with the direct-path relative transfer function
An example for online multiple-speaker localization: top The CGMM weights along time. bottom The black circles represent the detected speakers by selecting the peaks of CGMM weights. The gray curves represent the ground-truth trajectories of active speakers.
Xiaofei Li, Laurent Girin, Fabien Badeig, Radu Horaud. Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function. International Conference on Intelligent Robots and Systems (IROS) 2016. [pdf] [Slides] [bibtex][matlab code]
Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot. Estimation of Relative Transfer Function in the Presence of Stationary Noise Based on Segmental Power Spectral Density Matrix Subtraction. IEEE ICASSP 2015. [pdf] [poster] [dataset] [bibtex][Matlab code]