Audio-visual diarization dataset now available for download

We just made public our novel AVDIAR dataset. AVDIAR stands for “audio-visual diarization”. The dataset contains recordings of social gatherings done with two cameras and six microphones. Both the audio and visual data were carefully annotated, such that it is possible to evaluate the performance of various algorithms, such as person tracking, speech-source localization, speaker diarization, etc. The AVDIAR dataset is used in the paper “Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion“.