Abstract. In this paper we address the problem of detecting and locating speakers using audiovisual data. We address this problem in the framework of clustering. We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations. Weighted-data clustering techniques have already been proposed, but not in a generative setting as presented here. We introduce a weighted-data mixture model and we formally devise the associated EM procedure. The clustering algorithm is applied to the problem of detecting and localizing a speaker over time using both visual and auditory observations gathered with a single camera and two microphones. Audiovisual fusion is enforced by introducing a cross-modal weighting scheme. We test the robustness of the method with experiments in two challenging scenarios: disambiguate between an active and a non-active speaker, and associate a speech signal with a person.
![]() |
Paper
@inproceedings{gebru2014audio, title={Audio-visual speaker localization via weighted clustering}, author={Gebru, Israel D and Alameda-Pineda, Xavier and Horaud, Radu and Forbes, Florence}, booktitle={Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on}, pages={1--6}, year={2014}, organization={IEEE} } }
Code
Matlab code to reproduce our result is available upon request.
Related publications
[1] Gebru, I. D., Ba, S., Evangelidis, G., & Horaud, R., Audio-Visual Speech-Turn Detection and Tracking. In Latent Variable Analysis and Signal Separation, LVA/ICA 2015. Research Page
[2] Gebru, I. D., Silèye Ba, Georgios Evangelidis and Radu Horaud. Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model. In ICCV 2015 workshop on 3D Reconstruction and Understanding with Video and Sound, 2015. Research Page
[3] Gebru, I. D., Alameda-Pineda, X., Forbes, F., & Horaud, R., EM algorithms for weighted-data clustering with application to audio-visual scene analysis. arXiv preprint arXiv:1509.01509. Research Page
Acknowledgements
This research has received funding from the EU-FP7 STREP project EARS (#609465) and ERC Advanced Grant VHIA (#340113).