Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement
Mostafa Sadeghi and Xavier Alameda-Pineda
Paper | Audio examples | Acknowledgement
The proposed speech enhancement method is compared with the A-VAE method of [1], as well as V-VAE and AV-VAE methods of [2]. Below, different audio examples with different noise levels, from the NTCD-TIMIT dataset, are provided.
[1] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement”, in Proc. of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2018.
[2] M. Sadeghi, S.Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder, August 2019.
Car | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
Living room | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
White | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
Street | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
Babble | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
Cafe | Example 1 | Example 2 | Example 3 |
Noisy speech |
|
|
|
Clean speech |
|
|
|
A-VAE [1] |
|
|
|
V-VAE [2] |
|
|
|
AV-VAE [2] |
|
|
|
Proposed method |
|
|
|
Xavier Alameda-Pineda acknowledges ANR and the IDEX for funding the ML3RI project.