Context: Over the past years, variational autoencoders (VAEs) have proven efficient for generative modeling of complicated signals, e.g. speech and audio [1]. Recently, they have successfully been applied to audio-visual speech separation (AVSS) [2], where the goal is to separate a target speech from a mixture of several speech signals, utilizing the visual information of the target speech provided by lip movements. This is done in an unsupervised way, without training on noise or different mixtures, where an audio-visual probabilistic generative model is learned for clean speech which is combined with a noise model at test time. In VAE-based AVSS, one can either learn a universal generative model, trained on many different speakers, or a speaker-dependent model. While a speaker-dependent model yields better performance, it is not so realistic, meaning that it is rarely the case that the identity of the target speaker is known, and the only side information is the visual information connected to it.
Objective: To alleviate the above-mentioned issue, in this internship we aim to investigate the use of a so-called switching VAE model, similar to [3], where there are several VAE architectures each belonging to a particular speaker. With a switching latent variable, the overall model would be able to choose the appropriate generative model for each time frame in an unsupervised way.
Required skills: Theoretical and practical experience with deep learning (PyTorch). Knowledge of audio-visual speech processing, generative models, and probabilistic inference.
Environment: This project will be carried out in the Multispeech Team, at Inria Nancy – Grand Est, in collaboration with the Perception Team, at Inria Grenoble Rhône-Alpes. The research progress will be closely supervised by Dr. Mostafa Sadeghi, Dr. Xavier Alameda-Pineda, Prof. Laurent Girin, and Prof. Emmanuel Vincent. At our teams, we have the necessary computational resources (GPU & CPU) to carry on the proposed research.
Contact: mostafa.sadeghi@inria.fr; xavier.alameda-pineda@inria.fr
References:
[1] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” CoRR, abs/1906.02691, 2019. URL http://arxiv.org/abs/1906.02691.
[2] V. Nguyen et al., “Deep Variational Generative Models for Audio-visual Speech Separation,” arxiv preprint arXiv:2008.07191v1, 2020.
[3] M. Sadeghi and X. Alameda-Pineda, “Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Barcelona, Spain, May 2020.