Mar 12

PhD position on deep learning for sound scene analysis in real environments

PI: Emmanuel Vincent
Co-PI: Romain Serizel
Start: January 2018
To apply: apply online or send the following documents to romain.serizel@loria.fr and emmanuel.vincent@inria.fr before May 26, 2017:

  • CV
  • motivation letter
  • degree certificates and transcripts for BSc and MSc
  • MSc thesis if already completed, or a description of the work in progress otherwise
  • a copy of your publications, if any
  • a recommendation letter from the supervisor of your MSc thesis, and up to two other recommendation letters, to be sent directly to us by the letter author.

We are constantly surrounded by a complex audio stream carrying information about our environment. Hearing is a privileged way to detect and identify events that may require quick action (ambulance siren, baby cries…). Indeed, audition offers several advantages compared to vision: it allows for omnidirectional detection, up to a few tens of meters and independently of the lighting conditions. For these reasons, automatic audio analysis has become increasingly popular over the past five years [1]. Yet, most work has focused on controlled scenarios and the deployment of automatic audio analysis systems into the real world still raises several issues: variability of the sounds associated to each event, signal degradation due to the acoustic propagation in far field conditions or to overlapping events and constraints on the location and the quality of the microphones. Current approaches do not fully take these problems into account and therefore quickly become unusable in real conditions.

The goal of this PhD is to design an automatic sound scene analysis system based on deep learning [2] that is robust to the variabilities and degradations induced by real conditions. A first research axis consists, based on an initial system trained for example on Audio Set [3], in simulating degradations in order to increase the variability and the amount of training data. We recently proposed an algorithm to automatically optimize this process that could be applied to sound scene analysis [4]. A second research axis is to exploit multiple microphones distributed over the environment forming a wireless ad-hoc sensor network. Such networks have been largely studied under a signal processing perspective [5]. We propose to exploit them within a deep learning framework in order to perform multi-view learning [6]. The goal is then to design an algorithm that allows each node of the array to refine its perception of the sound scene and to track moving sources based on the information exchanged with neighboring nodes. The resulting system will be evaluated on real urban sound scenes.

Ideal profile:
MSc in computer science, machine learning, or signal processing
Experience with Python programming language
Experience with deep learning toolkits is a plus

[1] http://www.cs.tut.fi/sgn/arg/dcase2016/index

[2] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.

[3] Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., … & Ritter, M. (2017). Audio Set: An ontology and human-labeled dataset for audio events. In Proc. ICASSP.

[4] Sivasankaran, S., Vincent, E., & Illina, I. (2017). Discriminative importance weighting of augmented training data for acoustic model training. In Proc. ICASSP.

[5] Bertrand, A. (2011). Applications and trends in wireless acoustic sensor networks: a signal processing perspective. In Proc. SCVT.

[6] Wang, W., Arora, R., Livescu, K., & Bilmes, J. A. (2015). On deep multi-view representation learning. In Proc. ICML.