Apr 09

Audio-Visual Analysis in the Framework of Humans Interacting with Robots

PhD defense by Israel D. Gebru

Friday 13 April 2018, 9:30 – 10:30, Grand Amphithéatre

INRIA Grenoble Rhône-Alpes, Montbonnot Saint-Martin

In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g. find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activities, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these task necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot’s audio-visual sensory inputs consisting of microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (1) multiple speakers localization, (2) multiple-person location tracking, and (3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing either audio signals or visual information. However, in this thesis, we address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot.