Return to Research

Tracking the Active Speaker Based on Joint Audio-Visual Observation

IEEE International Conference on Computer Vision Workshops, Dec 2015


Any multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among the participants. We here cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the on-image (spatial) coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Both visual and auditory observations are explained by a recently proposed weighted-data mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into on-image observations are also described in detail. The performance of the proposed tracker is tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison.


ICCV 2015 Workshop Paper


  title={Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model},
  author={Gebru, Israel Dejene and Ba, Sileye and Evangelidis, Georgios and Horaud, Radu},
  booktitle={ICCV 3D Reconstruction and Understanding with Video and Sound workshop},


The AVTrack-1 dataset related to this work can be downloaded from here. Matlab code to reproduce our result is available upon request.

Result Videos

Below are some of our results on AVTrack-1 dataset.

Some of Our Related Works

[1] Gebru, I. D., Ba, S., Evangelidis, G., & Horaud, R., Audio-Visual Speech-Turn Detection and Tracking. In Latent Variable Analysis and Signal Separation, LVA/ICA 2015.Research Page

[2] Gebru, I. D., Alameda-Pineda, X., Horaud, R., & Forbes, F., Audio-visual speaker localization via weighted clustering. In IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2014. Research Page

[3] Gebru, I. D., Alameda-Pineda, X., Forbes, F., & Horaud, R., EM algorithms for weighted-data clustering with application to audio-visual scene analysis. arXiv preprint arXiv:1509.01509. Research Page



This research has received funding from the EU-FP7 STREP project EARS (#609465) and ERC Advanced Grant VHIA (#340113).