We release the AVTRACK-1 dataset: audio-visual recordings used in the paper .
This dataset can only be used for scientific purposes.
The dataset is fully annotated with the image locations of the active speakers and the other people present in the video. The annotated locations correspond to bounding boxes. Each person is given a unique anonymous identity under the form of a digit (1, 2, 3, …) and this identity is consistent through the entire video. The annotation is done in a semi-automatic fashion, i.e., a human annotated the bounding boxes on a small video segment, and a tracker  is used to interpolate the bounding boxes in the remaining frames. Tracking drifts and failures were manually corrected. The active speaker is manually annotated by selecting which person is speaking at each video frame. The dataset also contains sound source locations and upper-body detections as used in our paper .
Neither personal nor private data or meta-data are available (first and last names, age, nationality, profession, gender, ethnicity, etc.)
This release contains four audio-visual sequences. The description of the sequences follow:
|Sequence Name||Preview||Download Link||Comments|
|CHAT||[CHAT (46Mb)]||Two then three persons engaged in an informal dialog. The persons wander around and turn their heads towards the active speaker; occasionally two persons speak simultaneously. Moreover the speakers do not always face the camera.|
|MS||[MS (17Mb)]||Two persons that move around while they are always facing the camera. The persons take speech turns but there is a short overlap.|
|SS||[SS (16Mb)]||Two persons that are static while they are always facing the camera. Only one person speaks.|
|SS2||[SS2 (13Mb)]||Two persons that are static and very close to the camera. They are always facing the camera and take a speech turn with no overlap.|
Download all sequences: [AVTRACK-1 (90Mb) ]
If you use the dataset for a publication, please cite our ICCVW AVS Paper 
The recording is done with a dummy head equipped with 4 microphones and two cameras. The video from one of the camera is available in this dataset release. The video is recorded at 25 FPS. The Audio is sampled at 44.1Khz.
A great effort is put to make the audio and video synchronized. You can assume the audio start on the first video frame.
The annotation file format is a CSV text-file containing one object instance per line. Each line contain 10 values:
,,, : represents the bounding box region (rectangle) on the video frame.
,, : can be ignored and are only here for a future usage. We will provide the 3D position of the object in the future release of the dataset.
Dataset File Structure
The folder structure for each recording is as follows:
README.txt: README file.
preview_video.mp4: is a video synchronized with audio for a quick preview.
audio.wav : 4 channel audio file
video.avi : the video file.
SpeakerBbox.txt: the active speaker(s) bounding box
GtBbox.txt: annotated bounding box of people upper-body region
vad.txt: contains the result of voice activity detection per frame. 1 is a voice is detected 0 otherwise.
ssl_xx.txt: contains the sound source localization results. The xx in the file name refer to the window length used to perform localization. For example, if xx = 10, the window length is equivalent to 10 video frames =: 10 * 1/25 = 400ms.
Code and Script
A demo MATLAB script to visualize the annotation is provided with the dataset.
If you would like to be cited here just drop me a note.
 Israel D Gebru, Silèye Ba, Georgios Evangelidis and Radu Horaud. Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model. In ICCV 2015 workshop on 3D Reconstruction and Understanding with Video and Sound, 2015. Research Page
 Gebru, I. D., Ba, S., Evangelidis, G., & Horaud, R., Audio-Visual Speech-Turn Detection and Tracking. In Latent Variable Analysis and Signal Separation, LVA/ICA 2015. Research Page
 Gebru, I. D., Alameda-Pineda, X., Horaud, R., & Forbes, F., Audio-visual speaker localization via weighted clustering. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2014. Research Page
 Gebru, I. D., Alameda-Pineda, X., Forbes, F., & Horaud, R., EM algorithms for weighted-data clustering with application to audio-visual scene analysis. arXiv preprint arXiv:1509.01509. Research Page