Topic: In this Master thesis, we address the problem of speech separation given single-channel microphone mixed speech and video frames of the involved speakers. Although there exist several audio-only speech separation methods , here, we aim to utilize also the visual information, that is, video frames of speakers’ lips. This would help to distinguish different speakers, and thus to improve the separation quality. Our approach consists of two main steps. In the first step, an audio-visual speech generative model is trained using a dataset of clean speech and associated video frames. For that, we will use an audio-visual variational autoencoder (AV-VAE) . A VAE is a deep latent variable model capable of modeling complex signals like images and speech . Equipped with the trained AV-VAE model for each unknown speech, the second step amounts to do speech separation in a probabilistic way. This is done by combining the probabilistic model of the observed mixed speech with the probabilistic model of each speaker’s speech, i.e., the trained AV-VAEs, and estimating the unknown latent variables. By the end of this project, the motivated candidate will have a decent level of theoretical as well as practical knowledge on audio-visual speech separation, deep generative models (VAEs), and probabilistic inference.
: This project will be carried out in the Perception Team
, at Inria Grenoble Rhône-Alpes. The research progress will be closely supervised by Dr. Mostafa Sadeghi
and Dr. Xavier Alameda-Pineda
. At the perception team we have the necessary computational resources (GPU & CPU) to carry on the proposed research.
 E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement. Wiley, Aug. 2018.
 M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder,” August 2019.
 D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in ICLR, 2014