[Closed] Master Internship on Disentanglement of Latent Codes in Dynamical Variational Autoencoders

Context: Deep latent variable models (DLVMs) provide an effective way to model the underlying hidden generative process of natural signals and images [1]. This allows us to approximate the probability density functions of data which in turn can be used for either generating new examples resembling training data or do probabilistic inference and estimation. Variational autoencoders (VAEs) present an efficient methodology to train a DLVM, where the intractable posterior distribution of latent variables, which is essential for probabilistic inference (maximum likelihood estimation), is approximated with an inference network, called the encoder [1]. To exploit the sequential nature of data, e.g., speech signals, dynamical versions of VAE, called DVAE, have been recently introduced [2]. Separating out (or disentangling) the latent code is an important demanding feature of a DVLM. A disentangled latent code contains interpretable information that can separate out various factors of variation present in data [3].

Objective: For DVAE applied to speech processing, there has been little attention towards disentangled latent representation. This is especially important for separating out different characteristics of speech, including speaker identity, gender, pitch, etc, which can be useful in several applications, e.g., voice conversion, conditional speech enhancement and separation. Inspired by recent advances
in disentangled representation learning [3], in this project, we aim to explore disentanglement in DVAE for speech processing, either in a supervised, unsupervised, or semi-supervised
setting.

Required skills: Theoretical and practical experience with deep learning (PyTorch). Knowledge of audio-visual speech processing, generative models, and probabilistic inference.

Environment: This project will be carried out in the Multispeech Team, at Inria Nancy – Grand Est, in collaboration with the Perception Team, at Inria Grenoble Rhône-Alpes. The research progress will be closely supervised by Dr. Mostafa Sadeghi, Dr. Xavier Alameda-Pineda, Prof. Laurent Girin, and Dr. Romain Serizel. At our teams, we have the necessary computational resources (GPU & CPU) to carry on the proposed research.

Contact: mostafa.sadeghi@inria.fr; xavier.alameda-pineda@inria.fr

References:

[1] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” CoRR, abs/1906.02691, 2019. URL http://arxiv.org/abs/1906.02691.

[2] L. Girin et al., ”Dynamical variational autoencoders: A comprehensive review,” arXiv preprint arXiv:2008.12595, 2020.

[3] F. Locatello et al., “A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation,” Journal of Machine Learning Research 21 (2020) 1-62.