Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering
Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud
IEEE/ACM Transactions on Audio, Speech and Language Processing, 27 (9), pp. 1365 – 1377, 2019.
[pdf] [matlab code]
Abstract. This paper addresses the problem of multichannel online dereverberation. The proposed method is performed in the shorttime Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the timedomain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the crossrelation method, and using the recursive least square criterion. Instead of the complexvalued CTF convolution model, we use a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multipleinput/output inverse theorem (MINT), and adaptively estimated based on the gradient descent criterion. Finally, the inverse filtering in ans applied onto the STFT magnitude of microphone signals, obtaining an estimate of the STFT magnitude of source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the moving speaker case.
REVERB Challenge Dataset RT60=0.7 s
Sim near  Sim far  Real near  Real far  
Clean 



Reverb. 




BWPE 2ch 




AWPE 2ch 




prop. 2ch 




BWPE 8ch 




AWPE 8ch 




prop. MC 8ch 




prop. PW 8ch 




REVERB SimData far with various SNRs
SNR [dB] 
Noisy

AWPE 2ch

prop. 2ch

20 



15 



10 



5 



0 



Dynamic Dataset RT60=0.75 s: speakers were static from the beginning, and start walking at 11 s and 9 s for female speaker and male speaker, respectively
Female speaker  Male speaker  
Closetalk 


Reverb. 


AWPE 2ch 


AWPE 8ch 


prop. 2ch 


prop. MC 8ch 


prop. PW 8ch 


prop. Batch 


Multichannel Identification and NonNegative Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function
Xiaofei Li, Radu Horaud, Laurent Girin and Sharon Gannot
IEEE/ACM Transactions on Audio, Speech, and Language Processing , 26(10), pp. 17551768, 2018 (arXiv)
Abstract. This paper addresses the problems of blind multichannel identification and equalization for joint speech dereverberation and noise reduction. The timedomain crossrelation method is hardly applicable for blind room impulse response identification due to the nearcommon zeros of the long impulse responses. We extend the crossrelation method to the shorttime Fourier transform (STFT) domain, in which the timedomain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the nonflat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good tradeoff between the frequency aliasing of the signals and the common zeros problem of CTFs. The identified complexvalued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complexvalued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the l2norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the l1norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Binaural Simulation Data: audio files correspond to the spectrogram examples in the paper. RT60=0.79 s.
Source Sig.  Early Rev.  Noisefree Micro. Sig.  Noisy Micro. Sig.  





Theor. CTF  Theor. CTF Mag.  Ident. CTF  Ident. CTF Mag.  
Noise free 





Prop.  NIM  NIMNME  WPE  CDR  
Noise free 





Noisy 5 dB 





Multichannel impulse response dataset. RT60=0.61s.
Source Sig.  Early Rev.  Micro. Sig  NIMNME 2ch  CDR 2ch  WPE 2ch  WPE 4ch  Prop. 2ch  Prop. 4ch  
Female 20 dB 









Female 5 dB 








Male 20 dB 









Male 5 dB 







REVERB challenge dataset. RT60 = 0.7s.
Micro. Sig.  WPE 2ch  WPE 8ch  Prop. 2ch  Prop. 8ch  
near 





far 




