Implicit and explicit phase modeling in deep learning-based source separation

Speaker: Manu Pariente

Date and place: December 3, 2020 at 10:30, VISIO-CONFERENCE

Abstract:

Speech enhancement and separation have recently seen great progress thanks to deep learning-based discriminative methods.In particular, time domain methods relying on learned filterbanks achieve state-of-the-art performance by implicitly modeling phase and amplitude. Despite current efforts against those limitations, these methods produce very specialized models that generalize poorly, and are often uninterpretable in terms of classical digital signal processing. On the contrary, generative models are often more interpretable and re-usable, but current generative approaches to speech enhancement/separation lack the modeling power of deep learning-based discriminative approaches.

Recent approaches combine variational autoencoders with classical expectation-maximization based source separation algorithms using either sampling, gradient descent or heuristics at inference time. While this is one step in the right direction, the underlying probabilistic models are too simplistic as they discard inter-bin dependencies.

In this presentation, we extend the aforementioned discriminative methods to use the STFT, and analyze its phase modeling abilities. We then present a statistically principled algorithm for speech separation, extending the VAE-based algorithm to reuse the probabilistic encoder as a posterior approximator to improve speed. Finally we explore the introduction of explicit phase modeling in the VAE-based generative model for speech.