Return to Research

Research themes

Beyond black-box supervised learning

This research axis focuses on fundamental, domain-agnostic challenges relating to deep learning, such as the integration of domain knowledge, data efficiency, or privacy preservation. The results of this axis naturally apply in the domains studied in the two other research axes.

Integrating domain knowledge

State-of-the-art methods in speech and audio are based on neural networks trained for the targeted task. This paradigm faces major limitations: lack of interpretability and of guarantees, large data requirements, and inability to generalize to unseen classes or tasks. We research deep generative models as a way to learn task-agnostic probabilistic models of audio signals and design inference methods to combine and reuse them for a variety of tasks. We pursue our investigation of hybrid methods that combine the representational power of deep learning with statistical signal processing expertise by leveraging recent optimization techniques for non-convex, non-linear inverse problems. We also explore the integration of deep learning and symbolic reasoning to increase the generalization ability of deep models and to empower researchers/engineers to improve them.

Learning from little/no labeled data

While fully labeled data are costly, unlabeled data are cheap but provide intrinsically less information. Weakly supervised learning based on not-so-expensive incomplete and/or noisy labels is a promising middle ground. This entails modeling label noise and leveraging it for unbiased training. Models may depend on the labeler, the spoken context (voice command), or the temporal structure (ambient sound analysis). We also study transfer learning to adapt an expressive (audiovisual) speech synthesizer trained on a given speaker to another speaker for which only neutral voice data has been collected.

Preserving privacy

Some voice technology companies process users’ voices in the cloud and store them for training purposes, which raises privacy concerns. We aim to hide speaker identity and (some) speaker states and traits from the speech signal, and evaluate the resulting automatic speech/speaker recognition accuracy and subjective quality/intelligibility/identifiability, possibly after removing private words from the training data. We also explore semi-decentralized learning methods for model personalization, and seek to obtain statistical guarantees.

Speech production and perception

This research axis covers topics related to the production of speech through articulatory modeling and multimodal expressive speech synthesis, and topics related to the perception of speech through the categorization of sounds and prosody in native and in non-native speech.

Articulatory modeling

Articulatory speech synthesis relies on 2D and 3D modeling of the dynamics of the vocal tract from real-time MRI data. The prediction of glottis opening is also considered so as to produce better quality acoustic events for consonants. The coarticulation model developed to handle the animation of the visible articulators will be extended to control the face and the tongue. This helps characterize links between the vocal tract and the face, and illustrate inner mouth articulation to learners. The suspension of articulatory movements in stuttering speech is also studied.

Multimodal expressive speech

The dynamic realism of the animation of the talking head, which has a direct impact on audiovisual intelligibility, continues to be our goal. Both the animation of the lower part of the face relating to speech and of the upper part relating to the facial expression are considered, and development continues towards a multilingual talking head. We investigate further the modeling of expressivity both for audio-only and for audiovisual speech synthesis. We also evaluate the benefit of the talking head in various use cases, including children with language and learning disabilities or deaf people.

Categorization of sounds and prosody

Reading and speaking are basic skills that need to be mastered. Further analysis of schooling experience will allow a better understanding of reading acquisition, especially for children with some language impairment. With respect to L1/L2 language interference1, a special focus is set on the impact of L2 prosody on segmental realizations. Prosody is also considered for its implication on the structuration of speech communication, including on discourse particles. Moreover, we experiment the usage of speech technologies for computer assisted language learning in middle and high schools, and, hopefully, also for helping children learning to read.

Speech in its environment

The themes covered by this research axis correspond to the acoustic environment analysis, to speech enhancement and noise robustness, and to linguistic and semantic processing.

Acoustic environment analysis

Audio scene analysis is key to characterize the environment in which spoken communication may take place. We investigate audio event detection methods that exploit both strongly/weakly labeled and unlabeled data, operate in real-world conditions, can discover novel events, and provide a semantic interpretation. We keep working on source localization in the presence of nearby acoustic reflectors. We also pursue our effort at the interface of room acoustics to blindly estimate room properties and develop acoustics-aware signal processing methods. Beyond spoken communication, this has many applications to surveillance, robot audition, building acoustics, and augmented reality.

Speech enhancement and noise robustness

We pursue speech enhancement methods targeting several distortions (echo, reverberation, noise, overlapping speech) for both speech and speaker recognition applications, and extend them to ad-hoc arrays made of the microphones available in our daily life using multi-view learning. We also continue to explore statistical signal models beyond the usual zero-mean complex Gaussian model in the time-frequency domain, e.g., deep generative models of the signal phase. Robust acoustic modeling will be achieved by learning domain-invariant representations or performing unsupervised domain adaptation on the one hand, and by extending our uncertainty-aware approach to more advanced (e.g., non-Gaussian) uncertainty models and accounting for the additional uncertainty due to short utterances on the other hand, with application to speaker and language recognition “in the wild”.

Linguistic and semantic processing

We seek to address robust speech recognition by exploiting word/sentence embeddings carrying semantic information and combining them with acoustical uncertainty to rescore the recognizer outputs. We also combine semantic content analysis with text obfuscation models (similar to the label noise models to be investigated for weakly supervised training of speech recognition) for the task of detecting and classifying (hateful, aggressive, insulting, ironic, neutral, etc.) hate speech in social media.