[Closed] Master of science internship: Dynamic face modeling for audio-visual speech processing

The analysis of human faces has been a thoroughly investigated topic for the last decades, leading to highly performant 2D and 3D face representations and face recognition models and systems. Nevertheless, the analysis of face movements has been, comparatively, much less investigated. Face movements play a crucial role in human-to-human, human-to-computer and human-to-robot interactions. In particular, we are interested in the role that face movements play in speech communication. Generally speaking, face movements are a combination of rigid head movements and non-rigid facial deformations, e.g. facial expressions. In particular, lip and jaw movements are correlated with speech production and hence they play a paramount role in visual and audio-visual speech processing. Recently, we developed a frame-wise method that estimates head movements and removes them such that the faces are frontally viewed [Kang et al 2021] (please see the figure below).

An example of the effect of face frontalization on the lip regions cropped from the images of a head-moving speaker. Top: input images. Middle: lip regions cropped from the input images (with head motions). Bottom: lip regions cropped from the frontalized images (head motions removed)

In this project we propose to investigate dynamic face frontalization, thus exploiting the temporal information available with a sequence, instead of a frame-by-frame analysis. For that purpose we will start with investigating a solution based on linear dynamical systems (LDSs) which is a natural extension of our current frame-wise model and method. Then we will investigate a non-linear extension based on dynamic variational auto-encoders (DVAE) that were recently proposed [Girin et al 2021]. The model and associated implementation will then be used for audio-visual speech enhancement along a recently proposed VAE framework [Sadeghi & Alameda Pineda 2021].

References

[Girin et al 2021] Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, Xavier Alameda-Pineda. Dynamical Variational Autoencoders: A Comprehensive Review. Foundations and Trends in Machine Learning, 2021.

[Kang et al 2021] Zhiqi Kang, Mostafa Sadeghi, Radu Horaud. Robust Face Frontalization for Visual Speech Recognition. IEEE International Conference on Computer Vision Workshops, 2021.

[Sadeghi & Alameda Pineda 2021] Mostafa Sadeghi and Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, 2021