Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated

Speaker: Vinicius Ribeiro

Date and place: July 6, 2021 at 10:30, VISIO-CONFERENCE


Continuous speech is a dynamic and non-stationary process that requires the interaction of several articulators. It is essentially the rapid transitions between vocal tract configurations that allow speech production, and the articulation of phonemes is thus very context-dependent.
In this work, we address the prediction of speech articulators’ temporal geometric position from the sequence of phonemes to be articulated. We start from a set of real-time MRI sequences uttered by a native French speaker. The contours of the vocal tract articulators were tracked automatically in each of the frames in the MRI video. Then, we explore the capacity of a recurrent neural network to predict each articulators’ shape and position given the sequence of phonemes and their duration.
The presentation will be divided into two parts: First, I will present the paper’s results accepted to Interspeech 2021. In this part, we are limited to a few articulators in the upper part of the vocal tract, with a small dataset. In the second part, I will present the most recent results, with a larger dataset and the complete vocal tract shape prediction.