[FILLED][Master Internship 2023] Multimodal Stuttering Detection Using Self-supervised Learning

General information

Duration: 6 months, starting date in February, March or April 2023 (flexible)
Location: Inria Nancy – Grand Est, team MULTISPEECH
Supervisors: Shakeel Ahmad Sheikh (shakeel-ahmad.sheikh@loria.fr) and Slim Ouni(slim.ouni@loria.fr)

Please apply by sending your CV and a short motivation letter directly to Shakeel Ahmad Sheikh and Slim Ouni.


Stuttering is a neuro-developmental speech disorder that starts appearing when language, speech, and emotion supporting neural connections are changing quickly [2]. In standard stuttering therapy sessions, the speech pathologists or speech therapists either manually examine and analyze the person who stutter (PWS) speech or their recordings. In order to rectify the stuttering, the speech therapists carefully observe and monitor the patterns in speech utterances of PWS. However, this convention of stuttering detection is very time consuming and strenuous. It is also biased towards the subjective belief of speech language therapists. Thus, it is important to build stuttering detection interactive tools that provide impartial objective assessment, and can be utilized to tune and improve various ASR virtual assistants for stuttered speech.

Deep learning has been used tremendously in domains like speech recognition [5], emotion detection [1], however, in stuttering domain, its application is limited. The acoustic cues embedded in the speech of PWS can be exploited by various deep learning methods in the detection of stuttering. Most of the existing stuttering detection techniques utilize spectral features such as spectrograms and MFCCs as an input representation of the stuttered speech [12, 11, 3]. The most common problem in the stuttering domain is the dataset issue. There are few stuttering datasets like UCLASS, FluencyBank, and SEP28K [3], which are small containing only a few dozens of speakers. While deep learning methods have shown substantial gains in domains like ASR, speaker verification, emotion detection, etc, however, the improvement in stuttering detection is very limited, most likely due to the miniature size of datasets.

The common strategy in dealing with training on small datasets is to apply transfer learning, where the pretrained model (trained first on some auxiliary task on a large dataset) is used to enhance the performance of the desired task, for which data is very scarce. The deep learning model trained on some auxiliary task can be fine-tuned by re-training, or replacing some of its last layers, or it can also be employed as a feature extractor for the desired task, that we are trying to address. Transfer learning methodology has been explored in various fields like ASR, emotion detection [8], etc. Recently, self-supervised learning has shown significant improvement in stuttering detection [11, 18, 17, 16].

Multimodal Stuttering Detection

Stuttering can be characterized as an audio-visual problem. Cues are present both in the visual (e.g., head nodding, lip tremors, quick eye blinks, and unusual lip shapes) as well as in the audio modality [4]. This multimodal learning paradigm could be helpful in learning robust stutter-specific hidden representations across the cross-modality platform, and could also help in building robust automatic stuttering detection systems. Self- supervised learning can also be exploited to capture acoustic stutter-specific representations based on guided video frames. As proposed by Shukla et al. [14], this framework could be helpful in learning stutter-specific features from audio signals guided by visual frames or vice versa. Altinkaya and Smeulders [15] recently presented the first audio-visual stuttered dataset which consists of 25 speakers (14 male, 11 female). They trained ResNet-based RNN (gated recurrent unit) on the audio-visual modality for the detection of block stuttering type. The main idea in this internship is to explore the impact of further self supervised learning in stuttering detection in combination with audio-visual setup. The goal of the proposed study is to develop and evaluate audio-visual based self supervised stuttering detection classifiers, that will be able to distinguish among several stutter classes.

  1. Objective 1 : Lliterature survey by looking at the existing work in stuttering detection.
  2. Objective 2 : Developing a pre-trained stuttering classifier based on self-supervised learning ; Some initial experiments would be carried out. We would explore the self supervised models such as wav2vec 2.0, a modified version of wav2vec [9], and their variants such as Unispeech, HuBERT, etc. We would use wav2vec 2.0 either as a feature extractor or just fine tune it by replacing the last few layers and adapt it for stuttering detection.
  3. The experiments would be carried out on the newly developed French stuttering dataset.
  4. Objective 3 : Carrying out the actual experiments and the impact of fine-tuning and pre-trained features would be analyzed on the raw stuttered embedded audio-visual stuttered samples.


  • [1]  Mehmet Berkehan Ak Cay and Kaya Oguz L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M. H. Yang, Speech emotion recognition : Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers” Speech Communication, 116 (2020) pp.56- 76.
  • [2]  Smith, Anne and Weber, Christine How stuttering develops : The multifactorial dynamic pathways theory” Journal of Speech, Language, and Hearing Research, 60 (2017) pp.2483–2505.
  • [3]  Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni, Machine learning for stuttering identification : Review, challenges and future directions, Neurocomputing, 514 (2022), pp 385-402,
  • [4]  Guitar, Barry. Stuttering : An integrated approach to its nature and treatment. Lippincott Williams & Wilkins, 2013.
  • [5]  A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, “Speech Recognition Using Deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19143-19165, 2019.
  • [6]  Latif, Siddique, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Björn W. Schuller. “Deep representation learning in speech processing : Challenges, recent advances, and future trends.” arXiv preprint arXiv :2001.00378 (2020).
  • [7]  Ning, Y., He, S., Wu, Z., Xing, C. and Zhang, L.J., 2019. A review of deep learning based speech synthesis. Applied Sciences, 9(19), p.4050.
  • [8]  Wang, Yingzhi, Abdelmoumene Boumadane, and Abdelwahab Heba. “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.” arXiv preprint arXiv :2111.02735 (2021).
  • [9]  Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. “wav2vec 2.0 : A framework for self-supervised learning of speech representations.” Advances in Neural Information Processing Systems, 33 (2020) : 12449-12460.
  • [10]  Lea, Colin, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, and Jeffrey P. Bigham. “Sep-28k : A dataset for stuttering event detection from podcasts with people who stutter.” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6798-6802. IEEE, 2021.
  • [11]  Sheikh, Shakeel A., Md Sahidullah, Slim Ouni, and Fabrice Hirsch. “End-to-End and Self-supervised learning for ComParE 2022 stuttering sub-challenge.” In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7104-7108. 2022.
  • [12]  Sheikh, Shakeel A., Md Sahidullah, Fabrice Hirsch, and Slim Ouni. “Robust stuttering detection via multi-task and adversarial learning.” In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 190-194. IEEE, 2022.
  • [13]  Ngiam, Jiquan, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. “Multimodal deep learning.” In ICML. 2011.
  • [14]  Shukla, Abhinav, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, and Maja Pantic. “Visually guided self supervised learning of speech representations.” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6299-6303. IEEE, 2020.
  • [15]  Altinkaya, Mehmet, and Arnold WM Smeulders. “A dynamic, self supervised, large scale audiovisual dataset for stuttered speech.” In Proceedings of the 1st International Workshop on Multimodal Conversational AI, pp. 9-13. 2020.
  • [16]  Mohapatra, Payal, Akash Pandey, Bashima Islam, and Qi Zhu. “Speech disfluency detection with contextual representation and data distillation.” In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications, pp. 19-24. 2022.
  • [17]  Grósz, Tamás, Dejan Porjazovski, Yaroslav Getman, Sudarsana Kadiri, and Mikko Kurimo. “Wav2vec2- based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering.” In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7026-7029. 2022.
  • [18]  Bayerl, Sebastian P., Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. “Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0.” arXiv preprint arXiv :2204.03417 (2022).