Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Guan-Lin Chao, Abhinav Rastogi, Semih Yavuz, Dilek Hakkani-Tür, Jindong Chen, Ian Lane
In SIGDIAL 2019 and SIGIR 2019 Workshop on Conversational Interaction Systems
[bib] [pdf] [slides] [poster]

@inproceedings{chao2019learning,
title={Learning Question-Guided Video Representation for Multi-Turn Video Question Answering},
author={Chao, Guan-Lin and Rastogi, Abhinav and Yavuz, Semih and Hakkani-T{\"u}r, Dilek and Chen, Jindong and Lane, Ian},
booktitle={SIGDIAL},
year={2019}
}

Abstract
Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset, our proposed models in single-turn and multi-turn question answering achieve state-of-the-art performance on several automatic natural language generation evaluation metrics.Abstract

BERT-DST: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer

Guan-Lin Chao, Ian Lane
In INTERSPEECH 2019
[bib] [pdf] [code] [slides]

@inproceedings{chao2019bert,
title={{BERT-DST}: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer},
author={Chao, Guan-Lin and Lane, Ian},
booktitle={INTERSPEECH},
year={2019}
}

Abstract
An important yet rarely tackled problem in dialogue state tracking (DST) is scalability for dynamic ontology (e.g., movie, restaurant) and unseen slot values. We focus on a specific condition, where the ontology is unknown to the state tracker, but the target slot value (except for none and dontcare), possibly unseen during training, can be found as word segment in the dialogue context. Prior approaches often rely on candidate generation from n-gram enumeration or slot tagger outputs, which can be inefficient or suffer from error propagation. We propose BERT-DST, an end-to-end dialogue state tracker which directly extracts slot values from the dialogue context. We use BERT as dialogue context encoder whose contextualized language representations are suitable for scalable DST to identify slot values from their semantic context. Furthermore, we employ encoder parameter sharing across all slots with two advantages: (1) Number of parameters does not grow linearly with the ontology. (2) Language representation knowledge can be transferred among slots. Empirical evaluation shows BERT-DST with cross-slot parameter sharing outperforms prior work on the benchmark scalable DST datasets Sim-M and Sim-R, and achieves competitive performance on the standard DSTC2 and WOZ 2.0 datasets.

Audio-Visual TED Corpus: Enhancing the TED-LIUM Corpus with Facial Information, Contextual Text and Object Recognition

Guan-Lin Chao, Chih Chi Hu, Bing Liu, John Paul Shen, Ian Lane
In UbiComp 2019 Workshop on Continual and Multimodal Learning for Internet of Things
[bib] [pdf]

@inproceedings{chao2019audio,
title={Audio-Visual {TED} Corpus: Enhancing the {TED-LIUM} Corpus with Facial Information, Contextual Text and Object Recognition},
author={Chao, Guan-Lin and Hu, Chih Chi and Liu, Bing and Shen, John Paul and Lane, Ian},
booktitle={UbiComp Workshop on Continual and Multimodal Learning for Internet of Things},
year={2019}
}

Abstract
We present a variety of new visual features in extension to the TED-LIUM corpus. We re-aligned the original TED talk audio transcriptions with official TED.com videos. By utilizing state-of-the-art models for face and facial landmarks detection, optical character recognition, object detection and classification, we extract four new visual features that can be used for Large-Vocabulary Continuous Speech Recognition (LVCSR) systems, including facial images, landmarks, text, and objects in the scenes. The facial images and landmarks can be used in combination with audio for audio-visual acoustic modeling where the visual modality provides robust features in adverse acoustic environments. The contextual information, i.e. extracted text and detected objects in the scene can be used as prior knowledge to create contextual language models. Experimental results showed the efficacy of using visual features on top of acoustic features for speech recognition in overlapping speech scenarios.

Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

Guan-Lin Chao, John Paul Shen, Ian Lane
In International Conference on Natural Language Processing and Information Retrieval (NLPIR) 2019 (Best Paper)
[bib] [pdf]

@inproceedings{chao2019deep,
title={Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition},
author={Chao, Guan-Lin and Shen, John Paul and Lane, Ian},
booktitle={International Conference on Natural Language Processing and Information Retrieval (NLPIR)},
year={2019}}

Abstract
In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker’s existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker’s characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.

DEEPCOPY: Grounded Response Generation with Hierarchical Pointer Networks

Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek Hakkani-Tür
In SIGDIAL 2019 and NeurIPS 2018 Conversational AI Workshop (Best Paper)
[bib] [pdf]

@inproceedings{yavuz2019deepcopy,
title={{DEEPCOPY}: Grounded Response Generation with Hierarchical Pointer Networks},
author={Yavuz, Semih and Rastogi, Abhinav and Chao, Guan-Lin and Hakkani-T{\"u}r, Dilek},
booktitle={SIGDIAL},
year={2019}
}

Abstract
Recent advances in neural sequence-to-sequence models have led to promising results for several language generation-based tasks, including dialogue response generation, summarization, and machine translation. However, these models are known to have several problems, especially in the context of chit-chat based dialogue systems: they tend to generate short and dull responses that are often too generic. Furthermore, these models do not ground conversational responses on knowledge and facts, resulting in turns that are not accurate, informative and engaging for the users. These indeed are the essential features that dialogue response generation models should be equipped with to serve in more realistic and useful conversational applications. Recently, several dialogue datasets accompanied with relevant external knowledge have been released to facilitate research into remedying such issues encountered by traditional models by resorting to this additional information. In this paper, we propose and experiment with a series of response generation models that aim to serve in the general scenario where in addition to the dialogue context, relevant unstructured external knowledge in the form of text is also assumed to be available for models to harness. Our approach extends pointer-generator networks by allowing the decoder to hierarchically attend and copy from external knowledge in addition to the dialogue context. We empirically show the effectiveness of the proposed model compared to several baselines including on CONVAI2 challenge.

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Guan-Lin Chao, William Chan, Ian Lane
In INTERSPEECH 2016
[bib] [pdf] [slides]

@inproceedings{chao2016speaker,
title={Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments},
author={Chao, Guan-Lin and Chan, William and Lane, Ian},
booktitle={INTERSPEECH},
year={2016}
}

Abstract
Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker’s identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers’s speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3%. The audio-visual model improved the WER to 4.4%. Introducing speaker identity information had an even more pronounced effect, improving the WER to 3.6%. Combining both approaches, however, did not significantly improve performance further. Our work demonstrates that speaker-targeted models can significantly improve the speech recognition in cocktail party environments.

City-Identification of Flickr Videos Using Semantic Acoustic Features

Benjamin Elizalde, Guan-Lin Chao, Ming Zeng, Ian Lane
In IEEE International Conference on Multimedia Big Data (BigMM) 2016
[bib] [pdf]

@inproceedings{elizalde2016city,
title={City-identification of flickr videos using semantic acoustic features},
author={Elizalde, Benjamin and Chao, Guan-Lin and Zeng, Ming and Lane, Ian},
booktitle={International Conference on Multimedia Big Data (BigMM)},
year={2016}
}

Abstract
City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city-soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing audio clips labeled by sound-type. Our method improved the state-of-the-art performance and provides a novel semantic approach to this task.