(closed) Master Project: Audiovisual Diarization for Human-Robot Interaction

Deadline for sending applications: 30 November 2013. Project proposed by Laurent Girin and Radu Horaud.

The PERCEPTION team at INRIA investigates audiovisual scene analysis for a humanoid robot. In particular, the team is interested in implementing advanced “social skills” on the consumer robot NAO (http://www.aldebaran-robotics.com/fr/). The challenge is to provide NAO with both perceptual and motor capabilities, thus allowing natural interactions with humans. A typical scenario that we want to address is the interaction between 2 or 3 persons and the robot which is at approximatively 1—3m in front of the people. People talk to each other, and from time to time they interact with the robot as well. For example, they may be involved in a collaborative task that requires NAO’s assistance, e.g., providing some information. People-robot communication is expected to be as natural as possible, based on speech, expressive gestures, eye gaze, speakers turning towards the listener, etc. Moreover, all the sensors, i.e., cameras and microphones are embedded in the robot head and there are no specially equipped rooms, such as sensor networks. Implementing such skills in a robot requires us to develop and interconnect a large spectrum of signal processing, computer vision, machine learning and robotics techniques, ranging from low-level vision and audition algorithms to high-level “understanding” of the scene and behavior modeling.

We developed several low-level signal processing techniques for audio and video scene analysis by a consumer robot, e.g., audio-visual human detection and tracking [1] or audio source localization [2]. Audio-visual information fusion is also considered to improve the robustness of the analysis [3]. The goal is now to progressively advance towards higher levels of information processing, involving “social signal processing” and behavior modelling. In this master project we propose to develop a multispeaker-robot diarization method [4,6]. The robot’s audio and visual sensors observe the scene and the objective is to answer the following questions: How many people are out there? Where are they? Who is talking to whom? One desired feature of the system is that it is able to continuously adapt to the scene’s temporal evolution.

The expected output of the system is a spatiotemporal segmentation of the audiovisual scene in terms of participant locations, their activity, e.g., speaking or not, looking to each other, static or making gestures, etc.. Such an output could be used for further low/mid level processing, such as sound-source separation when several persons are speaking at the same time, speaker-adapted automatic speech recognition (ASR), and high-level processing, such as multimodal person-robot dialogue. For example, NAO may spontaneously and appropriately catch people’s attention by synthesizing speech and gestures. The diarization method will essentially rely on graphical probabilistic models for pattern recognition, such as Hidden Markov Models (HMMs) and their variants [5] [7]. These models will be fed with audio and video features extracted from the signals recorded by the robot. Different levels of pre-processing could be involved, from basic feature extraction, e.g., sound FFT-based spectrograms, to advanced analysis, e.g., face detection and tracking. Algorithms will be developed to i) estimate the optimal parameters of the graphical models from training data, and ii) estimate the optimal sequence of states given the observation of a new audio-visual scene. In addition, several open research issues could be addressed in this project:

  • Advanced feature extraction and integration. Video analysis may focus on lip movements since correlation with audio signal can be modelled and exploited for speaker diarizationin. The speaker’s gaze can also be considered in order to characterize the targetted listener(s).
  • A humanoid robot is a wonderful platform to assess sensorimotor contingencies, i.e., modeling how action can be controlled to optimize perception. For instance, NAO’s audition can be coupled with head and body movements to turn towards a speaker of interest, optimizing signal-to-noise ratio while providing a basic starting point for “social signal processing” paradigm. As for the video, the field of view of NAO’s cameras is limited, and therefore some participants can be out of the view. NAO’s movements can therefore compensate for this, which is another nice illustration of audio-visual complementarity (audition can help to detect a non-visible speaker and estimate its location).
  • Advanced participant state modelling. Beyond their activities, participants can be characterized in terms of cognitive state and intentions (attentive, passive, waiting for speech turn-taking, etc.). Modelling the cognitive state of the participants is a further step in social signal processing and behavior modeling. This can provide a baseline for modeling NAO’s appropriate behavior, specifically his engagement into dialogue.


[1] J. Cech, R. K. Mittal, A. Deleforge, J. Sanchez-Riera, X. Alameda-Pineda, R. P. Horaud. Active-Speaker Detection and Localization with Microphones and Cameras Embedded into a Robotic Head. In International Conference on Humanoid Robotics, Atlanta, GA, 2013.

[2] A. Deleforge & R. Horaud. 2D sound-source localization on the binaural manifold. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2012.

[3] X. Alameda-Pineda, V. Khalidov, R. P. Horaud, and F. Forbes. Finding audio-visual events in informal social gatherings. In ACM International Conference on Multimodal Interaction, Alicante, Spain, 2011.

[4] A. Noulas, G. Englebienne & B. Krose. Multimodal Speaker Diarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pages 79– 93, 2012.

[5] C. M. Bishop. Pattern recognition and machine Learning, Springer-Verlag, New York, NJ, 2006.

[6] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass. Unsupervised Methods for Speaker Diarization: An Integrated Approach. IEEE Transactions on Audio, Speech, and Language Processing. 2013 (to appear).

[7] Z. Wang et al. Probabilistic movement modeling for intention inference in human–robot interaction. International Journal of Robotics Resarch. 32(7) 841–858, 2013.