PI: Emmanuel Vincent
Co-PI: Romain Serizel
Start: October 2017 to January 2018
To apply: apply online at http://bit.ly/2kGCojd before April 30, 2017
Speech is one of the most intuitive means of communication between humans. Since the early 2010’s, with the emergence of reliable end-user voice applications, speech has even become one of the preferred ways of interacting with mobile devices and soon with your home. However, most of the applications that are based on speech communication rely on the assumption that a “clean” version of the speech is available. In real-life scenarios this is rarely true and speech is most generally corrupted by noise which can severely degrade communication. One solution to this noise problem is to apply so-called speech enhancement techniques that aim at extracting the speech component from a noisy speech mixture. In particular, multichannel approaches have attracted a lot attention over the years mainly because of their superiority to single channel approaches in many aspects. Yet, traditional microphone arrays have limitations in particular due to space constraints and ad-hoc microphone arrays composed of a set of wireless microphone nodes have recently proven to be a viable alternative.
The goal of this thesis is to generalize the recent improvements in speech enhancement obtained with deep learning techniques  to the case of ad-hoc microphone arrays. Current techniques are mostly limited to single channel [2, 3] or rely at some point on a standard beamforming techniques [4, 5] or averaging  in order to produce a single channel input to the deep network. These approaches therefore depend on a centralized processing at some stage and on assumptions about the microphone array topology. Therefore, their extension to ad-hoc arrays where the array topology is unconstrained and can vary over time and where distributed processing is usually preferred is not obvious. Reformulating the multichannel speech enhancement problem as a deep learning problem that takes multichannel audio as input and proposing distributed and online learning methods should allow extending the applicability of deep learning based speech enhancement to ad-hoc arrays and improve performance compared to state-of-the-art approaches .
MSc in computer science, machine learning, or signal processing
Experience with Python programming language
Experience with deep learning toolkits is a plus
 L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.
 Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1849-1858.
 Chen, J., Wang, Y., & Wang, D. (2015). Noise perturbation improves supervised speech separation. In International Conference on Latent Variable Analysis and Signal Separation (pp. 83-90).
 Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99).
 Pfeifenberger, L., Schrank, T., Zohrer, M., Hagm, M., & Pernkopf, F. (2015). Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 452-459).
 Nugraha, A. A., Liutkus, A., & Vincent, E. (2015). Multichannel audio source separation with deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (10), pp.1652-1664.
 Markovich-Golan, S., Bertrand, A., Moonen, M., & Gannot, S. (2015). Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Processing, 107, 4-20.