Return to Research

Variance Modeling Based on VAEs for Speech Enhancement


A variance modeling framework based on variational autoencoders for speech enhancement

Simon Leglaive, Laurent Girin, Radu Horaud

IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, September 2018

Article Supporting document | Bibtex |  Slides |  Code |  Audio examples | Acknowledgement



In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach.  


Audio examples

The proposed method is compared with the two following reference methods:

  • A semi-supervised baseline using non-negative matrix factorization (denoted by “NMF baseline”);
  • The fully supervised deep learning approach proposed in [1] (denoted by “Fully supervised deep learning approach”). We used the code provided by the authors at

The rank of the speech model for the NMF baseline and the latent-space dimension for the proposed method are both equal to 64.

Noisy speech signals were created at a 0 dB signal-to-noise ratio.

[1] Y. Xu et al., “A regression approach to speech enhancement based on deep neural networks”, IEEE Transactions on Audio, Speech and Language Processing, 23(1), 7-19. 2015.

Environment Noisy speech Clean speech Enhanced speech
Semi-supervised NMF baseline Fully supervised deep learning approach Proposed method
Living room
University restaurant
Terrace of a cafe
Subway station
City park
Office cafeteria
Town square
Traffic intersection


Bonus experiment on music

In this experiment, we test the ability of all the previously compared methods to separate singing voice from the accompaniment in a piece of music. The accompaniment is considered as noise. Note that all models have been trained on speech and not singing voice.

The fully supervised deep learning approach [1] performs the worst because music is not really one of the noise types that the neural network has seen during the training phase. This example justifies the interest of semi-supervised methods and shows interesting generalization properties of the proposed method.




Original Semi-supervised NMF baseline Fully supervised deep learning approach Proposed method
Singing voice
Accompaniment Not provided by the algorithm



This work was supported by the ERC Advanced Grant VHIA #340113.