Narrow-band Deep Filtering for Multichannel Speech Enhancement

Xiaofei Li and Radu Horaud,

Submitted to IEEE Transactions on Audio, Speech and Language Processing.

Short version presented at IEEE WASPAA, October 2019, New Paltz, NY, USA.

[Submitted pdf] [WASPAA p d f] [WASPAA slides] [code][audio examples]

Noisy input

Filtered output

Ground-truth output

Abstract. In this paper we address the problem of multichannel speech enhancement in the short-time Fourier transform (STFT) domain and in the framework of sequence-to-sequence deep learning. A long short-time memory (LSTM) network takes as input a sequence of STFT coefficients associated with a frequency bin of multichannel noisy-speech signals. The network’s output is a sequence of single-channel cleaned speech at the same frequency bin. We propose several clean-speech network targets, namely, the magnitude ratio mask, the complex ideal ratio mask, the STFT coefficients and spatial filtering. A prominent feature of the proposed model is that the same LSTM architecture, with identical parameters, is trained across frequency bins. The proposed method is referred to as narrow-band deep filtering. This choice stays in contrast with traditional wide-band speech enhancement methods. The proposed deep filter is able to discriminate between speech and noise by exploiting their different temporal and spatial characteristics: speech is non-stationary and spatially coherent while noise is relatively stationary and weakly correlated across channels. This is similar in spirit with unsupervised techniques, such as spectral subtraction and beamforming. We describe extensive experiments with both mixed signals (noise is added to clean speech) and real signals (live recordings). We empirically evaluate the proposed architecture variants using speech enhancement and speech recognition metrics, and we compare our results with the results obtained with several state of the art methods. In the light of these experiments we conclude that narrow-band deep filtering has very good performance, and excellent generalization capabilities in terms of speaker variability and noise type.

Audio Examples: CHiME3/CHiME4 dataset, using four microphones (unless otherwise specified) and with an SNR of 0 dB.

	BUS	CAF	PED	STR
clean
noisy (unproc.)
BeamformIt [1]
NN-GEV [2]
CRNN [3]
BLSTM-MRM
BLSTM-cIRM
BLSTM-CC
BLSTM-SF-2CH
LSTM-SF
BLSTM-SF

REAL Eval data

	BUS	CAF	PED	STR
noisy (unproc.)
BeamformIt [1]
NN-GEV [2]
CRNN [3]
BLSTM-MRM
BLSTM-cIRM
BLSTM-CC
BLSTM-SF-2CH
LSTM-SF
BLSTM-SF

[1] (BeamformIt) X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011–2022, 2007.
[2] (NN-GEV) J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 196–200.
[3] (CRNN) S. Chakrabarty and E. A. Habets, “Time-frequency masking based online multi-channel speech enhancement with convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 787–799, 2019.