Zenith seminar : 13/05/2019, 10h30
Campus Saint Priest, BAT5-02.124
Neural speaker diarization
Hervé Bredin (CNRS, LIMSI)
Speaker diarization is the task of determining “who speaks when” in an audio stream. It is an enabling technology for multiple downstream applications such as meeting transcription or indexing of ever-growing audio-visual archives.
Speaker diarization workflows usually consist of four consecutive tasks: speech activity detection, speaker change detection, speech turn clustering, and re-segmentation.
Recent advances in deep learning led to major improvements in multiple domains such as computer vision or natural language processing, and speaker diarization is no exception to the rule. In this talk, I will discuss our recent progress towards end-to-end neural speaker diarization (including speech and overlap detection with recurrent neural networks, and triplet loss for speaker embedding).
“Tristounet: Triplet Loss for Speaker Turn Embedding.” Bredin 2017. ICASSP.
“Speaker Change Detection in Broadcast TV Using Bidirectional Long Short- Term Memory Networks.”
Yin 2017. Interspeech.
“Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization.”
Yin 2018. Interspeech.
pyannote.audio: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding github.com/pyannote/pyannote-audio