Multimodal coarticulation modeling

Speaker: Théo Biasutto-Lervat

Date and place: September 30, 2021, at 10:30 – Videoconference

Abstract: This thesis deals with neural network-based coarticulation modeling, and aims to synchronize facial animation of a 3D talking head with speech. Predicting articulatory movements is not a trivial task, as it is well known that the production of a phoneme is greatly affected by its phonetic context, a phenomenon called coarticulation. We propose in this work a coarticulation model, i.e. a model able to predict spatial trajectories of articulators from speech. We rely on a sequential model, the recurrent neural networks, and more specifically the Gated Recurrent Units, which are able to consider the articulation dynamic as a central component of its modeling. Unfortunately, the typical amount of data in articulatory and audiovisual databases seems to be quite low for a deep learning approach. To overcome this difficulty, we propose to integrate articulatory knowledge into the networks during its initialization. The RNNs robustness allows us to apply our coarticulation model to predict both face and tongue movements, in french and german for the face, and in English and German for the tongue. Evaluation has been conducted through objective measures of the trajectories, and through experiments to ensure a complete reach of critical articulatory targets. We also conducted a subjective evaluation to attest the perceptual quality of the predicted articulation once applied to our facial animation system. Finally, we analyzed the model after training to explore phonetic knowledge learned.