(closed) Master Project: Sound-source localization based on audio-visual mapping learning

Deadline for sending applications: 30 November 2013. Project proposed by Radu Horaud and Antoine Deleforge.

An audio-visual robotic head

Training the visuo-acoustic mapping

Localization of Antoine saying “five”

We propose to address the interesting and challenging problem of sound-source localization. A sound source is generally localized using the time-difference of arrival (TDOA) between pairs of microphones. In general, the microphones are arranged in a linear or circular array and the localization can only be done in one dimension (1D), the azimuthal angle. But this is not sufficient when the task consists in localizing a sound source more precisely, as is needed in many applications. Recently, we proposed a method that uses a microphone pair plugged into the ears of a dummy head with a human-like head-related transfer function (HRTF). The HRTF introduces non-linear filtering effects that can in fact be advantageously used to locate sounds in two dimensions (2D), azimuth and elevation. The method that we proposed starts by learning a mapping between annotated 2D source positions and binaural spectral observations. A sound source with an unknown location can then be estimated via Bayes inversion [1], [2].

This method was successfully applied to localize sound in images. To do that, we used both microphone pair and a camera and we annotated the data such that the azimuth-elevation of a sound source (a loud-speaker) corresponds to a pixel location in the image of the camera. Hence, we were able to localize the face of a person speaking in front of the camera.

In this project we propose to extend this method such that we are able to estimate the pose of the face in addition to the location of the face. In more detail, we want to learn a mapping between binaural spectral observations and th location and orientation of a sound source (four degrees of freedom). The training could be done using a standard computer vision algorithm that estimates the 3D pose of an object from a single image. The estimation of the location and orientation of a speaker’s face has a lot of potential in human-robot interaction, for example the robot may estimate whether a person speaks to the robot or to another person present in the room. The method can also be combined with face detection methods in order to disambiguate between silent and speaking people. It can also be used to improve sound-source separation when several people speak simultaneously.

The candidate should have strong background in signal processing, acoustic signal analysis, and machine learning, as well as very good programming skills in C and Matlab.

References [1] A. Deleforge and R. Horaud. 2D Sound-Source Localization on the Binaural Manifold. IEEE Workshop on Machine Learning for Signal Processing, Sep 2012. [2] A. Deleforge, F. Forbes, and R. Horaud. Mapping Learning with Partially Latent Output. arXiv:1308.2302, 2013