MSc. Project on Coupled Audio-visual Multi-speaker Tracking

MSc. Project on Coupled Audio-visual Multi-speaker Tracking

Short description: Multi-speaker tracking has been widely investigated and the Perception team contributed with a consistent methodological framework based on variational Bayes techniques [1-4]. Often, audio-visual tracking methods first map all auditory and visual information in the same space, to later on run a tracking algorithm. However, in most of the cases the auditory and visual observations are of very different nature, and they would require different latent space models, that are obviously linked to each other. In this Master thesis we would like to investigate if it is possible to perform multi-speaker tracking with coupled natural latent spaces, rather than one single artificial latent space.

Environment: This project will be carried out in the Perception Team, at Inria Grenoble Rhône-Alpes. The research progress will be closely supervised by Dr. Xavier Alameda-Pineda and Dr. Radu Horaud, head of the Perception Team. At the perception team we have the necessary computational resources (GPU & CPU) to carry on the proposed research.

References:
[1] Y. Ban, X. Alameda-Pineda, L. Girin, and R. Horaud, Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers, 2018.
[2] X. Li, Y. Ban, L. Girin, X. Alameda-Pineda, and R. Horaud, Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environment, 2018.
[3] Y. Ban, X. Li, X. Alameda-Pineda, L. Girin, and R. Horaud, “Accounting for Room Acoustics in Audio-Visual Multi-Speaker Tracking,” in IEEE ICASSP, 2018.
[4] Y. Ban, L. Girin, X. Alameda-Pineda, and R. Horaud, “Exploiting the Complementarity of Audio-Visual Data for Probabilistic Multi-Speaker Tracking,” in IEEE ICCVW, 2017.