General information
Duration: 5 to 6 months, starting date in March or April 2023 (flexible)
Location: Inria Nancy – Grand Est, team MULTISPEECH
Supervisors: Paul Magron (paul (dot) magron (at) inria (dot) fr), Mostafa Sadeghi (mostafa (dot) sadeghi (at) inria (dot) fr), Inria Researchers
Please apply by sending your CV and a short motivation letter directly to Paul Magron and Mostafa Sadeghi.
Motivation and context
Speech separation consists in isolating the signals that correspond to each speaker from an acoustic mixture where several persons might be speaking. This task is an important preprocessing step in many applications such as hearing aids or vocal assistants based on automatic speech recognition.
State-of-the-art separation systems rely on supervised deep learning, where a network is trained to predict the isolated speakers’ signals from their mixture [1,2]. However, these approaches are costly in terms of training data and have a limited capacity to generalize to unseen speakers.
Objectives
The goal of this internship is to design a fully unsupervised system for speech separation, which is more data-efficient than supervised approaches, and applicable to any mixture of speakers. To that end, we propose to combine variational autoencoders (VAEs) with dictionary models (DMs). DM consist in decomposing a given input matrix (usually: an audio spectrogram) as the product of two interpretable factors: a dictionary of spectra and a temporal activation matrix). This family of methods has been extensively researched before the era of deep learning [3], but it is limited since real-world audio spectrograms cannot be decomposed using such simple models.
Therefore, we propose to leverage VAEs as a tool to learn a latent representation of the data which is regularized using DMs. Such a system can be cast as an instance of transform learning [4]: the key idea is to apply a (learned) transform to the data so that it better complies with a desirable property – here, decomposition on a dictionary. A first attempt was recently proposed and has shown promising results in terms of speech modeling [5], although it was using a fixed dictionary. This internship aims at extending this work by considering a system where both the VAE and the dictionary are learned jointly, and applying it to the task of speech separation.
Once trained, the resulting system operates in three stages:
- the (mixture) audio spectrogram is projected through the encoder into some latent space;
- this latent representation is factorized efficiently using a DM learning algorithm, which provides a latent feature for each speaker;
- these latent features are passed through the decoder to retrieve a spectrogram for each speaker.
Such a system is promising since it is fully unsupervised (it can be trained without knowledge of specific mixtures), it yields an interpretable decomposition of the latent representation, and it can serve as a basis for other applications (including speaker diarization, speech enhancement or voice conversion).
Required Skills
A good practice in Python and basic knowledge about deep learning, both theoretical and practical (e.g., using PyTorch) are required. Some notions of audio/speech signal processing and machine learning is a plus.
Work Environment and Conditions
The trainee will be supervised by Paul Magron (Chargé de Recherche Inria) and Mostafa Sadeghi (Researcher, Inria Starting Faculty Position), and will benefit from the research environment and the expertise in audio signal processing of the MULTISPEECH team. This team includes many PhD students, post-docs, trainees, and permanent staff working in this field, and offers all the necessary computational resources (GPU and CPU, speech datasets) to conduct the proposed research.
The trainee will receive the minimum internship gratification of ~550€/month.
Bibliography
[1] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, 2018.
[2] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019.
[3] T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066-1074, 2007.
[4] D. Fagot, H. Wendt and C. Févotte, “Nonnegative Matrix Factorization with Transform Learning,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
[5] M. Sadeghi, and P. Magron, “A Sparsity-promoting Dictionary Model for Variational Autoencoders”, Interspeech, 2022.