Variance Modeling Based on VAEs for Speech Enhancement

A variance modeling framework based on variational autoencoders for speech enhancement

Simon Leglaive, Laurent Girin, Radu Horaud

IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, September 2018

Abstract

In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach.

Audio examples

The proposed method is compared with the two following reference methods:

A semi-supervised baseline using non-negative matrix factorization (denoted by “NMF baseline”);
The fully supervised deep learning approach proposed in [1] (denoted by “Fully supervised deep learning approach”). We used the code provided by the authors at https://github.com/yongxuUSTC/sednn

The rank of the speech model for the NMF baseline and the latent-space dimension for the proposed method are both equal to 64.

Noisy speech signals were created at a 0 dB signal-to-noise ratio.

[1] Y. Xu et al., “A regression approach to speech enhancement based on deep neural networks”, IEEE Transactions on Audio, Speech and Language Processing, 23(1), 7-19. 2015.

Environment	Noisy speech	Clean speech	Enhanced speech
			Semi-supervised NMF baseline	Fully supervised deep learning approach	Proposed method
Living room
University restaurant
Kitchen
Terrace of a cafe
Meeting
Subway
Subway station
City park
River
Office cafeteria
Town square
Traffic intersection

Bonus experiment on music

In this experiment, we test the ability of all the previously compared methods to separate singing voice from the accompaniment in a piece of music. The accompaniment is considered as noise. Note that all models have been trained on speech and not singing voice.

The fully supervised deep learning approach [1] performs the worst because music is not really one of the noise types that the neural network has seen during the training phase. This example justifies the interest of semi-supervised methods and shows interesting generalization properties of the proposed method.

Mixture:

	Original	Semi-supervised NMF baseline	Fully supervised deep learning approach	Proposed method
Singing voice
Accompaniment			Not provided by the algorithm

Acknowledgement

This work was supported by the ERC Advanced Grant VHIA #340113.