Mar 08

PhD position on deep learning based noise reduction for ad-hoc microphone arrays

PI: Emmanuel Vincent
Co-PI: Romain Serizel
Start: October 2017 to January 2018
To apply: apply online or send the following documents to romain.serizel@loria.fr and emmanuel.vincent@inria.fr before May 26, 2017:

  • CV
  • motivation letter
  • degree certificates and transcripts for BSc and MSc
  • MSc thesis if already completed, or a description of the work in progress otherwise
  • a copy of your publications, if any
  • a recommendation letter from the supervisor of your MSc thesis, and up to two other recommendation letters, to be sent directly to us by the letter author.

Speech is one of the most intuitive means of communication between humans. Since the early 2010’s, with the emergence of reliable end-user voice applications, speech has even become one of the preferred ways of interacting with mobile devices and soon with your home. However, most of the applications that are based on speech communication rely on the assumption that a “clean” version of the speech is available. In real-life scenarios this is rarely true and speech is most generally corrupted by noise which can severely degrade communication. One solution to this noise problem is to apply so-called speech enhancement techniques that aim at extracting the speech component from a noisy speech mixture. In particular, multichannel approaches have attracted a lot attention over the years mainly because of their superiority to single channel approaches in many aspects. Yet, traditional microphone arrays have limitations in particular due to space constraints and ad-hoc microphone arrays composed of a set of wireless microphone nodes have recently proven to be a viable alternative.

The goal of this thesis is to generalize the recent improvements in speech enhancement obtained with deep learning techniques [1] to the case of ad-hoc microphone arrays. Current techniques are mostly limited to single channel [2, 3] or rely at some point on a standard beamforming techniques [4, 5] or averaging [6] in order to produce a single channel input to the deep network. These approaches therefore depend on a centralized processing at some stage and on assumptions about the microphone array topology. Therefore, their extension to ad-hoc arrays where the array topology is unconstrained and can vary over time and where distributed processing is usually preferred is not obvious. Reformulating the multichannel speech enhancement problem as a deep learning problem that takes multichannel audio as input and proposing distributed and online learning methods should allow extending the applicability of deep learning based speech enhancement to ad-hoc arrays and improve performance compared to state-of-the-art approaches [7].

Ideal profile:
MSc in computer science, machine learning, or signal processing
Experience with Python programming language
Experience with deep learning toolkits is a plus

[1] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.

[2] Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1849-1858.

[3] Chen, J., Wang, Y., & Wang, D. (2015). Noise perturbation improves supervised speech separation. In International Conference on Latent Variable Analysis and Signal Separation (pp. 83-90).

[4] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99).

[5] Pfeifenberger, L., Schrank, T., Zohrer, M., Hagm, M., & Pernkopf, F. (2015). Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 452-459).

[6] Nugraha, A. A., Liutkus, A., & Vincent, E. (2015). Multichannel audio source separation with deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (10), pp.1652-1664.

[7] Markovich-Golan, S., Bertrand, A., Moonen, M., & Gannot, S. (2015). Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Processing, 107, 4-20.