Return to Research

Audio-Visual Speaker Localization Via Weighted Clustering

Abstract. In this paper we address the problem of detecting and locating speakers using audiovisual data. We address this problem in the framework of clustering. We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations. Weighted-data clustering techniques have already been proposed, but not in a generative setting as presented here. We introduce a weighted-data mixture model and we formally devise the associated EM procedure. The clustering algorithm is applied to the problem of detecting and localizing a speaker over time using both visual and auditory observations gathered with a single camera and two microphones. Audiovisual fusion is enforced by introducing a cross-modal weighting scheme. We test the robustness of the method with experiments in two challenging scenarios: disambiguate between an active and a non-active speaker, and associate a speech signal with a person.


MLSP 2014 Paper


  title={Audio-visual speaker localization via weighted clustering},
  author={Gebru, Israel D and Alameda-Pineda, Xavier and Horaud, Radu and Forbes, Florence},
  booktitle={Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on},


Matlab code to reproduce our result is available upon request.

Result Videos

Below are results on audio-visual recordings from AVTrack-1 dataset.

Related publications

[1] Gebru, I. D., Ba, S., Evangelidis, G., & Horaud, R., Audio-Visual Speech-Turn Detection and Tracking. In Latent Variable Analysis and Signal Separation, LVA/ICA 2015. Research Page

[2] Gebru, I. D., Silèye Ba, Georgios Evangelidis and Radu Horaud. Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model. In ICCV 2015 workshop on 3D Reconstruction and Understanding with Video and Sound, 2015. Research Page

[3] Gebru, I. D., Alameda-Pineda, X., Forbes, F., & Horaud, R., EM algorithms for weighted-data clustering with application to audio-visual scene analysis. arXiv preprint arXiv:1509.01509. Research Page



This research has received funding from the EU-FP7 STREP project EARS (#609465) and ERC Advanced Grant VHIA (#340113).