Speaker: Rémi Blandin from TU Dresden
Date and place: February 18, 2021 at 10:30, VISIO-CONFERENCE
Abstract:
Speech sounds are produced by multiple complex physical phenomena such as fluid structure interaction or turbulent flow. One use greatly simplified description of them to simulate speech production. As an example, the vocal tract (the air volume from the throat to the mouth) is often simply represented as a concatenation of straight tubes with varying diameter. These simplifications allow one to synthesize speech sounds with relatively short computation times. Such physics based synthesis can rely on articulatory models which describe the time variation of the vocal tract shape with the articulatory movements (e.g. when we move our tongue, jaw or lips). This is called articulatory synthesis.
However, over-simplifying is not always good. As an example, the tube concatenation representation of the vocal tract does not work well above 4-5 kHz, and give inaccurate resonance frequencies below. This can reduce the naturalness and the intelligibility of the synthesis. On the other hand, studying the perception of high frequencies in speech requires acoustic simulations valid at high frequency. More complex acoustic models are necessary for the full audible range and for more precision at low frequency. But they require much more computation time. This is a strong limitation for applications such as speech synthesis or medical research.
A solution to reduce the computation time is to take advantage of the elongated shape of the vocal tract, which allows one to express the solution of the wave equation on the basis of the local propagation modes. This is called the multimodal method. Integrated in an articulatory synthesizer, it has the potential to generate speech sound relying on an acoustic model valid for all the audible frequencies. This is the aim of my postdoctoral project.
I intend to further develop solutions to reduce the computational cost of physical modelling of speech in a long term research project. This would imply exploring other methods such as integrating the benefit of machine learning to physics simulation, as well as increasing the flexibility of the implementation of physical modelling of speech for non-specialists.