[FILLED][Master Internship 2023] Disentanglement in speech data for privacy needs

General information

Duration: 6 months, starting date in February, March or April 2023 (flexible)
Location: Inria Nancy, team MULTISPEECH or Inria Lille, team MAGNET
Supervisors: Emmanuel Vincent (emmanuel.vincent@inria.fr) and Marc Tommasi (marc.tommasi@inria.fr)

Please apply by sending your CV and a short motivation letter directly to Emmanuel Vincent and Marc Tommasi.

Motivation and context

Large-scale collection, storage, and processing of speech data poses severe privacy threats [2]. Indeed, speech encapsulates a wealth of personal data (e.g., age and gender, ethnic origin, personality traits, health and socio-economic status, etc.) which can be linked to the speaker’s identity via metadata or via automatic speaker recognition. Speech data may also be used for voice spoofing using voice cloning software. With firm backing by privacy legislations such as the European general data protection regulation (GDPR), several initiatives are emerging to develop and evaluate privacy preservation solutions for speech technology. These include voice anonymization methods [4] which aim to conceal the speaker’s voice identity without degrading the utility for downstream tasks, and speaker re-identification attacks [5] which aim to assess the resulting privacy guarantees, e.g., in the scope of the VoicePrivacy challenge series [6].


The internship will tackle the objective of speech anonymization. Previous works have shown that simple adversarial approaches that aim at removing speaker identity from speech signals do not provide sufficient privacy guaranties [3]. An interpretation of this failure can be that adversaries were not strong enough. Moreover, there is no clear evidence that a transformation that removes speaker identity is informative enough to allow the reconstruction of intelligible speech signals. These observations raise a classical trade-off between privacy and utility that is essential in many privacy preservation scenarios. Instead of trying to remove speaker information, another option is to replace it by another one. To do so, a sub-objective is to disentangle speech signals, that is to isolate speech features that contribute to the success of speaker identification. Disentanglement is understood in this project as the process of embedding voice data in a new representation where different types of information (speaker identity, linguistic content, or even traits like age, gender or ethnicity) are separated and associated with disjoint sets of features. Variational autoencoders are supposed to naturally support disentanglement [1]. Additionally, variational approaches can also be used to make attackers stronger by introducing more diversity. Those two ways of improving adversarial approaches for learning a private representation of speech will be investigated.

Required Skills

A good practice in Python and basic knowledge about deep learning, both theoretical and practical (e.g., using PyTorch) are required. Some notions of audio/speech signal processing and machine learning is a plus.

Work Environment

The trainee will be supervised by Emmanuel Vincent and Marc Tommasi, and will benefit from the research environment and the expertise in audio signal processing of the MULTISPEECH team. This team includes many PhD students, post-docs, trainees, and permanent staff working in this field, and offers all the necessary computational resources (GPU and CPU, speech datasets) to conduct the proposed research.


[1] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda. Dynamical Variational Autoencoders: A Comprehensive Review, volume 15 of Foundations and Trends in Machine Learning. Now Publishers, 2021.
[2] A. Nautsch, A. Jimenez, A. Treiber, J. Kolberg, C. Jasserand, E. Kindt, H. Delgado, M. Todisco, M. A. Hmani, A. Mtibaa, M. A. Abdelraheem, A. Abad, F. Teixeira, M. Gomez-Barrero, D. Petrovska, G. Chollet, N. Evans, T. Schneider, J.-F. Bonastre, B. Raj, I. Trancoso, and C. Busch. Preserving privacy in speaker and speech characterisation. Computer Speech and Language, 58:441–480, 2019.
[3] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent. Privacy-preserving adversarial representation learning in ASR: Reality or illusion? In Interspeech, pages 3700–3704, 2019.
[4] B. M. L. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi. Privacy and utility of x-vector based speaker anonymization. IEEE/ACM Transactions on Audio, Speech and Language Processing, 30:2383–2395, 2022.
[5] B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent. Evaluating voice conversion-based privacy protection against informed attackers. In IEEE 2020 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 2802–2806, 2020.
[6] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, A. Chanclu, J.-F. Bonastre, M. Todisco, and M. Maouche. The VoicePrivacy 2020 Challenge: Results and findings. Computer Speech and Language, 74:101362, 2022.