Return to Software

COMPRISE Weakly Supervised Speech-to-Text

An open source tool for training STT models.


COMPRISE Weakly Supervised Speech-to-Text (STT) provides components for semi-supervised and weakly supervised training of STT Acoustic Models (AMs) and Language Models (LMs), thus reducing the need for time-consuming and expensive manual transcription of speech data. It consists of three components.

  • Error detection driven training component: processes untranscribed speech data to obtain automatic transcripts along with tags representing reliable and unreliable regions, and finally performs semi-supervised training of STT models.
  • Dialogue state based training component: exploits weak supervision from utterance level dialogue state labels to obtain better automatic transcriptions on untranscribed speech data, and hence better STT models.
  • Confusion network based language model training component: trains statistical n-gram LMs and Recurrent Neural Network (RNN) LMs using STT confusion networks obtained from untranscribed speech data.

Example of “Error detection driven semi-supervised AM training. Dashed arrows indicate ‘use for training’

Readers interested in the high-level design and experimental evaluation of these components can refer to COMPRISE deliverables D4.2 and D4.4.


  • Semi-supervised and weakly supervised training of STT models
  • Readily supports the Kaldi speech recognition toolkit
  • Examples for training with state-of-the-art TDNN Chain AMs and LF-MMI
  • Methods that generalise to other types of STT AMs and LMs
  • Support to train RNN/LSTM/GRU LMs in Pytorch and use with Kaldi
  • Tools to train STT error detection models
  • Learn n-gram and RNN LMs from STT confusion networks
  • Interpolated expected KN smoothing for training n-gram LMs from fractional counts


  • Kaldi toolkit
  • Python 3.X
  • Numpy 1.19
  • PyTorch 1.5.1

Recommended usage in combination with the COMPRISE Cloud Platform.

Documentation & Download

          Installation & Usage guide (video)

         Functionality (video)

         Code and documentation (Gitlab)

For scientific details and experimental results, refer to the following paper: I. Sheikh, E. Vincent, I. Illina, “On semi-supervised LF-MMI training of acoustic models with limited data“, in Interspeech, 2020.