Audio-visual tracking, speaker diarization and speech recognition

This video summarizes some of the work carried out by the Perception team in 2018. The video shows multiple person tracking, audio-source localization, audiovisual alignment, speaker diarization, as well as a complete pipeline, including the assignment of segments of speech to persons, and speech recognition.

Acknowledgments: Work funded by the European Union under the ERC Advanced Grant VHIA and ERC Proog of Concept VHIALab.