Sound – RobotLearn

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Xavier ALAMEDA-PINEDA 2023/08/29 2024/03/11Research, Software, Sound, Vision

by Louis Airale, Dominique Vaufreydaz, and Xavier Alameda-Pineda [paper][code] Abstract: Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the…

Unsupervised speech enhancement with deep dynamical generative speech and noise models

Xavier ALAMEDA-PINEDA 2023/08/13 2024/03/11Research, Software, Sound

by Xiaoyu Lin, Simon Leglaive, Laurent Girin, and Xavier Alameda-Pineda Interspeech 2023 [paper][code] Abstract: This work builds on previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF…

Speech Modeling with a Hierarchical Transformer Dynamical VAE

Xavier ALAMEDA-PINEDA 2023/05/17 2024/03/11Research, Software, Sound

by Xiaoyu Lin, Xiaoyu Bie, Simon Leglaive, Laurent Girin, and Xavier Alameda-Pineda IEEE International Conference on Acoustics, Speech and Signal Processing 2023 [paper][code] Abstract: The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a…

Learning and controlling the source-filter representation of speech with a variational autoencoder

Xavier ALAMEDA-PINEDA 2023/04/07 2024/03/11Research, Software, Sound

by Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier SpeechCom, 2023 [arXiv] [HAL] [code] [examples] Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms…

Expression-preserving face frontalization improves visually assisted speech processing

Radu HORAUD 2022/12/16 2024/03/11Research, Sound, Vision

by Zhiqi Kang, Mostafa Sadeghi, Radu Horaud and Xavier Alameda-Pineda International Journal of Computer Vision, 2023, 131 (5), pp.1122-1140 [arXiv] [HAL] [webpage] Abstract. Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations in order to boost…

The impact of removing head movements on audio-visual speech enhancement

Radu HORAUD 2022/02/01 2022/04/06Research, Sound, Vision

by Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar ICASSP’22, Singapore [paper][examples][code][slides] Abstract. This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today’s learning-based…

Dynamical Variational AutoEncoders

Xavier ALAMEDA-PINEDA 2021/10/12 2024/03/07Research, Software, Sound, Vision

by Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda Foundations and Trends in Machine Learning, 2021, Vol. 15, No. 1-2, pp 1–175. [Review paper] [Code] [Tutorial @ICASPP 2021] Abstract. Variational autoencoders (VAEs) are powerful deep generative models widely used to represent high-dimensional complex data through a low-dimensional…

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

Xiaoyu BIE 2021/09/10 2024/03/07Research, Software, Sound

by Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber and Xavier Alameda-Pineda Interspeech’21, Brno, Czech Republic [paper][slides][code][bibtex] Abstract. The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the…

Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement

Radu HORAUD 2021/05/06 2022/04/06Research, Sound

By Xiang Hao*,#, Xiangdong Su#, Radu Horaud and Xiaofei Li* (*Westlake University, #Inner Mongolia University, China) ICASSP 2021 [arXiv][github][youtube] Abstract. This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy…

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Xavier ALAMEDA-PINEDA 2021/03/30 2024/03/07Research, Software, Sound, Vision

by Mostafa Sadeghi, Xavier Alameda-Pineda IEEE TSP, 2021 [paper] [arXiv] Abstract. In this paper, we are interested in unsupervised (unknown noise) speech enhancement, where the probability distribution of clean speech spectrogram is simulated via a latent variable generative model, also called the decoder. Recently, variational autoencoders (VAEs) have gained much popularity…