A Bayesian Framework for Head Pose Estimation and Tracking

PhD defense by Vincent Drouard

Monday 18 December 2017, 11:00 – 12:00, Grand Amphithéatre

INRIA Montbonnot Saint-Martin

In this thesis, we address the well-known problem of head-pose estimation in the context of human-robot interaction (HRI). We accomplish this task in a two step approach. First, we focus on the estimation of the head pose from visual features. We design features that could represent the face under different orientations and various resolutions in the image. The resulting is a high-dimensional representation of a face from an RGB image. Inspired from Deleforge et al. 2015, we propose to solve the head-pose estimation problem by building a link between the head-pose parameters and the high-dimensional features perceived by a camera. This link is learned using a high-to-low probabilistic regression built using a mixture of affine transformations. With respect to classic head-pose estimation methods we extend the head-pose parameters by adding some variables to take into account variabilities in the observations (e.g. misaligned face bounding-box), to obtain a robust method under realistic conditions. Evaluation of the methods shows that our approach achieve better results than classic regression methods and comparative results with state of the art methods in head pose that use additional cues to estimate the head pose (e.g depth information). Secondly, we propose a temporal model by using tracker ability to combine information from both the present and the past. Our aim through this is to give a smoother estimation output, and to correct oscillations between two consecutive independent observations. The proposed approach embeds the previous regression into a temporal filtering framework. This extension is part of the family of switching linear dynamic models and keeps all the advantages of the mixture of affine regressions under consideration. Overall the proposed tracker gives a more accurate and smoother estimation of the head pose on a video sequence. In addition, the proposed switching linear dynamic model gives better results than standard tracking models such as Kalman filter. While being applied to the head-pose estimation problem the methodology presented in this thesis is really general and can be used to solve various regression and tracking problems, e.g. we applied it to the tracking of a sound source.