Audio-Visual TED Corpus

Enhancing the TED-LIUM Corpus with Facial Information, Contextual Text and Object Recognition

Guan-Lin Chao*, Chih Chi Hu*, Bing Liu, John Paul Shen, Ian Lane
(*: equal contribution)
Electrical and Computer Engineering, Carnegie Mellon University


We present a variety of new visual features in extension to the TED-LIUM corpus. We re-aligned the original TED talk audio transcriptions with official videos. By utilizing state-of-the-art models for face and facial landmarks detection, optical character recognition, object detection and classification, we extract four new visual features that can be used for Large-Vocabulary Continuous Speech Recognition (LVCSR) systems, including facial images, landmarks, text, and objects in the scenes. The facial images and land- marks can be used in combination with audio for audio-visual acoustic modeling where the visual modality provides robust features in adverse acoustic environments. The contextual information, i.e. extracted text and detected objects in the scene can be used as prior knowledge to create contextual language models. Experimental results showed the efficacy of using visual features on top of acoustic features for speech recognition in overlapping speech scenarios.