Seminar by Thomas Hueber, CNRS, GIPSA-lab, Grenoble
Thursday 8 February, 10:00 – 11:00, room F107
INRIA Montbonnot Saint-Martin
Abstract. Propelled by the progress of machine learning, speech technologies such as automatic speech recognition and text-to-speech synthesis have become enough advanced to be deployed in several consumer products and used in our daily lives. However, using our voice to interact with a machine is sometimes difficult, for instance when communicating in noisy environments, or not suitable for confidentiality and privacy issues. Moreover, some of these technologies can often not be easily used by people with speech disorders. To tackle some of these issues, an increasing number of studies have proposed to develop speech technologies which do not exploit the acoustic signal of our voice, but rather other physiological activities related to its production such as articulatory gestures (lips, tongue, jaw, velum), the electrical activity of the face and neck muscles, or the electrical activity of the central or peripheral nervous system. This research field, which can be referred to as “Biosignal-based speech processing”, is at the intersection of various disciplines, ranging from engineering, speech sciences and machine learning to medicine, neuroscience and physiology. It generally follows two main goals: 1) converting one of these physiological activities into an intelligible audio speech signal – this would enable to build prosthetic devices supplementing a defective speech production system or allowing to “speak silently”, and 2) providing valuable biofeedback to speakers about their own voice production, in order to increase articulatory awareness in speech therapy or language learning. After introducing this research field, I will focus on two of my projects in line with these two goals. In the first project on “silent speech interface”, articulatory movements are captured in a non-invasive manner using ultrasound and video imaging and are converted either into text (i.e. visual speech recognition) or directly, in real-time, into an audible speech signal. In the second project on “visual biofeedback”, articulatory movements are estimated in real-time from the audio speech signal of any arbitrary user (after a short calibration step). They are displayed using a 3D talking head, showing both the external (face, jaw, lips) and internal structures (tongue, velum) of the vocal apparatus. For both projects, I will describe the machine learning techniques that we developed to model the statistical relationships between articulatory gestures and the audio speech signal, as well as experimental evaluations conducted either in lab or clinical environments.