Unsupervised Audio-visual Speech Enhancement based on Variational Autoencoders

Speaker: Mostafa Sadeghi

Date and place: March 20, 2020 at 10:30 -VISIO-CONFERENCE


Speech enhancement, or separating a target speech from noise, has long been an important problem in signal processing and machine learning. Visual information associated with the target speaker, i.e., lips movements, is known to improve speech enhancement, especially when the recorded speech is highly noisy. With the advancement of deep learning in recent years, there has been a new trend towards speech enhancement using the power of deep neural networks. Most of these works present supervised methods that need different types of noise for training. In this presentation, I am going to discuss my recent works on unsupervised audio-visual speech enhancement based on deep generative modeling of clean speech, using both audio and visual information. To this end, a variational autoencoder (VAE) is utilized to provide an efficient way for training a latent variable generative model. The trained model is then combined with a noise-variance model at the enhancement (test) phase to estimate the clean speech. I will also discuss how to deal with noisy visual information, i.e. when the lip region is occluded or non-frontal in some video frames, as well as a method to provide a robust initialization for the latent variables at test time.