[PhD position] Visually-assisted Speech Enhancement

Context: This is a fully-funded PhD position defined within the context of the ANR project REAVISE (Robust and Efficient Deep Learning-based Audiovisual Speech Enhancement), which aims at developing a unified, robust, and generalizable audiovisual speech enhancement framework. The PhD candidate will work in the MULTISPEECH, Inria Nancy – Grand Est., France, under the co-supervision of Mostafa Sadeghi (researcher, Inria), and Romain Serizel (associate professor, University of Lorraine).

Background: Audio-visual speech enhancement (AVSE) refers to the task of improving the intelligibility and quality of a noisy speech utilizing the complementary information of visual modality (lip movements of the speaker) [1]. Visual modality can help separate target speech from background sounds, especially in highly noisy environments. Recently, and due to the great success and progress of deep neural network (DNN) architectures, AVSE has been extensively revisited. Existing DNN-based AVSE methods are categorized into supervised and unsupervised approaches. In the former category, a DNN is trained to map a noisy speech signal and the associated video frames of the speaker into a clean estimate of the target speech. The recently introduced unsupervised methods [2] follow a statistical model-based approach combined with the expressive power of DNNs, without training on noisy data. Specifically, the prior distribution of clean speech signals is learned using deep generative models, e.g. variational autoencoders (VAEs), which is then combined with an observation model to estimate the clean speech signal in a probabilistic way.

Supervised methods require deep networks, with millions of parameters, as well as a large audiovisual dataset with diverse enough noise instances to be robust against acoustic noise. There is also no systematic way to achieve robustness to visual noise, e.g., head movements, face occlusions, changing illumination conditions, etc. Unsupervised methods, on the other hand, show a better generalization performance and can achieve robustness to visual noise thanks to their probabilistic nature [3]. Nevertheless, despite their potential advantages, they are significantly less explored.

Project description:  In this PhD project, we are going to bridge the gap between the supervised and unsupervised AVSE approaches, benefiting from the best of both worlds. The central task of this project is to design and implement a unified AVSE framework having the following features: 1- Robustness to visual noise, 2- Good generalization to unseen noise environments, and 3- Computational efficiency at test time. To achieve the first objective, various techniques will be investigated, including probabilistic switching (gating) mechanisms [3], face frontalization [4], and data augmentation [5]. The main idea is to adaptively lower bound the performance by that of audio-only speech enhancement when the visual modality is not reliable. To accomplish the second objective, we will explore efficient noise modeling frameworks inspired by unsupervised AVSE, e.g. by adaptively switching to different noise models during speech enhancement. Finally, concerning the third objective, lightweight inference methods, as well as efficient generative models, e.g. with Transformers [6], will be developed. We will work with the AVSpeech [7] and TCD-TIMIT [8] audiovisual speech corpora.

Requirements & skills:

  • Master’s degree, or equivalent, in the field of speech/audio processing, computer vision, machine learning, or in a related field,
  • Ability to work independently as well as in a team,
  • Solid programming skills (Python, PyTorch),
  • Good level of written and spoken English.

How to apply: Interested candidates are encouraged to contact Mostafa Sadeghi (mostafa.sadeghi@inria.fr) and Romain Serizel (romain.serizel@loria.fr) with their CV, motivation letter, and transcripts. They should also apply via the Inria job platform.

[1] D. Michelsanti, Z. H. Tan, S. X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep learning-based audio-visual speech enhancement and separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021.
[2] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual speech enhancement using conditional variational auto-encoders,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 1788 –1800, 2020.
[3] M. Sadeghi and X. Alameda-Pineda, “Switching variational autoencoders for noise-agnostic audio-visual speech enhancement,” in ICASSP, 2021.
[4] Z. Kang, M. Sadeghi, R. Horaud, “Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks,” arXiv:2010.13676, 2020.
[5] S. Cheng, P. Ma, G. Tzimiropoulos, S. Petridis, A. Bulat, J. Shen, M. Pantic, “Towards Pose-invariant Lip-Reading,” in ICASSP, 2020.
[6] J. Jiang, G. G Xia, D. B Carlton, C. N Anderson, and R. H Miyakawa, “Transformer VAE: A hierarchical model for structure-aware and interpretable music representation learning,” in ICASSP, 2020, pp. 516–520.
[7] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” SIGGRAPH 2018.
[8] N. Harte and E. Gillen, “TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech,” IEEE Transactions on Multimedia, vol.17, no.5, pp.603-615, May 2015.