Ensimag Rubrique Formation 2022

Multimodal speech synthesis - 5MMPLTVS

  • Number of hours

    • Lectures 18.0


    ECTS 1.75


This course gives an introduction to speech technologies (speech coding, synthesis and recognition) that process audible (acoustic signal) and visible (lip movements, etc.) consequences of underlying articulatory movements (produced by the jaw, the tongue, the larynx, the velum, etc.). We first introduce the basic knowledge in physiology, phonetics, phonology and linguistics necessary to understand the mechanisms underlying speech production, perception and comprehension. Then fundamentals in signal processing, representation and modeling are presented. We go on with a review of current systems that enable spoken interaction embodied by anthropoids or virtual conversational agents.

Contact Pascal PERRIER


• Multimodal speech production and perception
• Phonological structures of world’s languages. Example of French
• Phonetic representations and speech processing
• Text-to-speech systems and facial animation
• Audiovisual speech recognition
• Systems for situated verbal interaction




Written exam


Dutoit, T. (1997) An introduction to text-to-speech synthesis. Dordrecht/ Boston/ London: Kluwer Academic.
Parke, F.I. and K. Waters (1996) Computer Facial Animation.Wellesley, MA, USA: A.K. Peters
O'Shaughnessy, D. (2nd edition, 2000) Speech Communication - Human and Machine.New York: IEEE Press