Vision – Page 2 – RobotLearn

A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos

Xavier ALAMEDA-PINEDA 2022/04/28 2022/04/28Research, Vision

Hanyu Xuan, Zhiliang Wu, Jian Yang, Yan Yan, Xavier Alameda-Pineda IEEE/CVF International Conference on Computer Vision (CVPR) 2022, New Orleans, US [HAL] Abstract. Humans can easily recognize where and how the sound is produced via watching a scene and listening to corresponding audio cues. To achieve such cross-modal perception on machines, existing methods…

Continual Models are Self-Supervised Learners

Xavier ALAMEDA-PINEDA 2022/04/28 2024/03/07Research, Software, Vision

by Enrico Fini, Victor G. Turrisi da Costa, Xavier Alameda-Pineda, Elisa Ricci, Karteek Alahari, Julien Mairal IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, New Orleans, USA [arXiv][Code][HAL] Abstract. Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However,…

Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder

Xiaoyu LIN 2022/02/18 2024/03/07Research, Vision

by Xiaoyu Lin, Laurent Girin and Xavier Alameda-Pineda Introduction Multi-object tracking (MOT), or multi-target tracking, is a fundamental and very general pattern recognition task. Given an input time-series, the aim of MOT is to recover the trajectories of an unknown number of sources, that might appear and disappear at any point in time….

The impact of removing head movements on audio-visual speech enhancement

Radu HORAUD 2022/02/01 2022/04/06Research, Sound, Vision

by Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar ICASSP’22, Singapore [paper][examples][code][slides] Abstract. This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today’s learning-based…

Dynamical Variational AutoEncoders

Xavier ALAMEDA-PINEDA 2021/10/12 2024/03/07Research, Software, Sound, Vision

by Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda Foundations and Trends in Machine Learning, 2021, Vol. 15, No. 1-2, pp 1–175. [Review paper] [Code] [Tutorial @ICASPP 2021] Abstract. Variational autoencoders (VAEs) are powerful deep generative models widely used to represent high-dimensional complex data through a low-dimensional…

SocialInteractionGAN: Multi-person Interaction Sequence Generation

Louis AIRALE 2021/09/27 2022/04/04Research, Vision

by Louis Airale, Dominique Vaufreydaz and Xavier Alameda-Pineda [paper] Abstract. Prediction of human actions in social interactions has important applications in the design of social robots or artificial avatars. In this paper, we model human interaction generation as a discrete multi-sequence generation problem and present SocialInteractionGAN, a novel adversarial architecture for conditional interaction…

PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation

Wen GUO 2021/09/10 2024/03/07Research, Software, Vision

by Wen Guo, Enric Corona, Francesc Moreno-Noguer, Xavier Alameda-Pineda, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2021) [paper][code] Abstract. Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many everyday situations,…

Robust Face Frontalization For Visual Speech Recognition

Radu HORAUD 2021/08/17 2021/09/03Research, Vision

by Zhiqi Kang, Radu Horaud and Mostafa Sadeghi ICCV’21 Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV’21) [paper (extended version)][code][bibtex] Abstract. Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution is a robust method that preserves non-rigid facial deformations, i.e….

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

Yihong XU 2021/08/04 2024/03/07Research, Software, Vision

by Yihong Xu*, Yutong Ban*, Guillaume Delorme, Chuang Gan, Daniela Rus and Xavier Alameda-Pineda [arXiv] [paper] [code] Abstract: Transformers have proven superior performance for a wide variety of tasks since they were introduced, which has drawn in recent years the attention of the vision community where efforts were made such as…

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Xavier ALAMEDA-PINEDA 2021/03/30 2024/03/07Research, Software, Sound, Vision

by Mostafa Sadeghi, Xavier Alameda-Pineda IEEE TSP, 2021 [paper] [arXiv] Abstract. In this paper, we are interested in unsupervised (unknown noise) speech enhancement, where the probability distribution of clean speech spectrogram is simulated via a latent variable generative model, also called the decoder. Recently, variational autoencoders (VAEs) have gained much popularity…