Mixture of Inference Networks for Audio-visual Speech Enhancement

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi and Xavier Alameda-Pineda

Paper | Audio examples | Acknowledgement

Abstract

In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.

Audio examples

The proposed speech enhancement method is compared with the A-VAE method of [1], as well as V-VAE and AV-VAE methods of [2]. Below, different audio examples with different noise levels, from the NTCD-TIMIT dataset, are provided.

[1] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement”, in Proc. of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2018.

[2] M. Sadeghi, S.Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder, August 2019.