Speaker: Théo Mariotte
Data and time: Feb 9, 2023, at 10:30
Abstract: Speaker diarization answers the question « Who spoke and when? » in an audio stream. Most diarization systems consist of two major steps: segmentation and clustering. The former is related to speakers activity and detects time borders in the signal. The latter groups segments featuring similar speaker information. This work focuses on speech segmentation. This problem can be addressed following three sub-tasks:
- Voice Activity Detection (VAD) which consists in detecting speech segments while discarding non-speech ones (silence, noise…),
- Overlapped Speech Detection (OSD) which detects segments where at least 2 speakers are simultaneously active,
- Speaker Change Detection (SCD) which detects time instants where the current active speaker is changing.
These tasks have been extensively studied in the literature. However, most approaches are focused on close-talk and clean speech data (e.g. broadcast news). A few works have been conducted on the use of distant speech material, which can be encountered in the meeting context. When facing distant recording conditions, it is common practice to use devices composed of multiple microphones (microphone arrays). The resulting signal is then composed of multiple channels, one per microphone. This consideration allows to capture spatial information about the sound field due to the spatial sampling operated by the microphones.
This work explores the use of the Self Attention Channel Combinator (SACC), previously proposed in the literature, as VAD and OSD front-ends. We show that this approaches can improve the overall performance in the distant speech scenario compared to standard MFCC features. Furthermore, this algorithm is extended to learn complex weights in order to improve its interpretability.
Although multi-channel front-ends improve the VAD and OSD under distant conditions, the performance highly relies on the number of available channels in the training data.
Learning on a fixed array geometry may lead to strong performance degradation in case of configuration mismatch in evaluation data.
Thus, we propose a training procedure to add channel-number invariance during training. This approach shows better generalization properties.