Audio-Visual TED Corpus: Enhancing the TED-LIUM Corpus with Facial Information, Contextual Text and Object Recognition

Guan-Lin Chao, Chih Chi Hu, Bing Liu, John Paul Shen, Ian Lane
In UbiComp 2019 Workshop on Continual and Multimodal Learning for Internet of Things
[bib] [pdf]

@inproceedings{chao2019audio,
title={Audio-Visual {TED} Corpus: Enhancing the {TED-LIUM} Corpus with Facial Information, Contextual Text and Object Recognition},
author={Chao, Guan-Lin and Hu, Chih Chi and Liu, Bing and Shen, John Paul and Lane, Ian},
booktitle={UbiComp Workshop on Continual and Multimodal Learning for Internet of Things},
year={2019}
}

Abstract
We present a variety of new visual features in extension to the TED-LIUM corpus. We re-aligned the original TED talk audio transcriptions with official TED.com videos. By utilizing state-of-the-art models for face and facial landmarks detection, optical character recognition, object detection and classification, we extract four new visual features that can be used for Large-Vocabulary Continuous Speech Recognition (LVCSR) systems, including facial images, landmarks, text, and objects in the scenes. The facial images and landmarks can be used in combination with audio for audio-visual acoustic modeling where the visual modality provides robust features in adverse acoustic environments. The contextual information, i.e. extracted text and detected objects in the scene can be used as prior knowledge to create contextual language models. Experimental results showed the efficacy of using visual features on top of acoustic features for speech recognition in overlapping speech scenarios.