Context: The Multispeech team, at Inria Nancy, France, seeks a qualified candidate to work on signal processing and machine learning techniques for robust audiovisual speech enhancement. The candidate will be working under the co-supervision of Mostafa Sadeghi (researcher, Multispeech team), Xavier Alameda-Pineda (researcher and team leader of RobotLearn team), and Radu Horaud (senior researcher, RobotLearn team).
Starting date & duration: October 2022 (flexible), for a duration of one year (renewable depending on funding availability and performance).
Background: Audio-visual speech enhancement (AVSE) refers to the task of improving the intelligibility and quality of a noisy speech signal utilizing the complementary information of visual modality (lip movements of the speaker) , which could be very helpful in highly noisy environments. Recently, and due to the great success and progress of deep neural network (DNN) architectures, AVSE has been extensively revisited . Existing DNN-based AVSE methods are categorized into supervised and unsupervised approaches. In the former category, a DNN is trained on a large audiovisual corpus, e.g., AVSpeech , with diverse enough noise instances, to directly map the noisy speech signal and the associated video frames of the speaker into a clean estimate of the target speech signal. The trained models are usually very complex and contain millions of parameters. The unsupervised methods  follow a statistical modeling-based approach combined with the expressive power of DNNs, which involves learning the prior distribution of clean speech using deep generative models, e.g., variational autoencoders (VAEs) , on clean corpora such as TCD-TIMIT , and estimating clean speech signal in a probabilistic way. As there is no training on noise, the models are much lighter than those of supervised methods. Furthermore, the unsupervised methods have potentially better generalization performance and robustness to visual noise thanks to their probabilistic nature [6-8]. Nevertheless, these methods are very recent and significantly less explored compared to the supervised approaches.
Project description: In this project, we plan to devise a robust and efficient AVSE framework by thoroughly investigating the coupling between the recently proposed deep learning architectures for speech enhancement, both supervised and unsupervised, benefiting from the best of both worlds, along with the state-of-the-art generative modeling approaches. This will include, e.g., the use of dynamical VAEs , temporal convolutional networks (TCNs) , and attention-based strategies [11,12]. The main objectives of this project are summarized as follows:
- Developing a neural architecture that identifies reliable (either frontal or non-frontal) and unreliable (occluded, extreme poses, missing) lip images by providing a normalized score at the output;
- Developing deep generative models that efficiently exploit the sequential nature of data;
- Integrating the developed visual reliability analysis network within the deep generative model that accordingly decides whether to utilize the visual data or not. This will provide a flexible and robust audiovisual fusion and enhancement framework.
Requirements & skills: The preferred profile is described below.
- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,
- Ability to work independently as well as in a team,
- Solid programming skills (Python, PyTorch), and deep learning knowledge,
- Good level of written and spoken English.
How to apply: Interested candidates are encouraged to contact Mostafa Sadeghi (email@example.com), Xavier Alameda-Pineda (firstname.lastname@example.org), and Radu Horaud (email@example.com), with the required documents (CV, transcripts, motivation letter, and recommendation letters).