A Hybrid Approach for Speech Enhancement Using GMM and Deep Neural Network Phoneme Classifier

Tuesday, October 18, 2016, 4:00 pm to 5:00 pm, room F108, INRIA Montbonnot

Seminar by Sharon Gannot, Bar Ilan University

Abstract: In this work, we propose a hybrid approach for single microphone speech enhancement, merging the generative Mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases: the training phase, which does not recur, and the test phase. First, the noise-free speech log power spectral density (PSD) is modeled as a MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification, with mel-frequency cepstral coefficients (MFCC) as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using a combination of the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, updating the noise statistics. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided, accompanied by audio demonstrations. We show that the proposed method significantly outperforms state-of-the-art competing methods.

If time permits, we will also explore another speech enhancement framework consisting multiple DNNs. This framework comprises a set of phoneme-specific DNNs (pDNNs), one for each phoneme, together with an additional phoneme-classification DNN (cDNN). The cDNN is responsible for determining the posterior probability that a specific phoneme was uttered. Concurrently, each of the pDNNs estimates a phoneme-specific speech presence probability (pSPP). The speech presence probability (SPP) is then calculated as a weighted averaging of the phoneme-specific pSPPs, with the weights determined by the posterior phoneme probability.