Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Guan-Lin Chao, Abhinav Rastogi, Semih Yavuz, Dilek Hakkani-Tür, Jindong Chen, Ian Lane
In SIGDIAL 2019 and SIGIR 2019 Workshop on Conversational Interaction Systems
[bib] [pdf] [slides] [poster]

@inproceedings{chao2019learning,
title={Learning Question-Guided Video Representation for Multi-Turn Video Question Answering},
author={Chao, Guan-Lin and Rastogi, Abhinav and Yavuz, Semih and Hakkani-T{\"u}r, Dilek and Chen, Jindong and Lane, Ian},
booktitle={SIGDIAL},
year={2019}
}

Abstract
Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset, our proposed models in single-turn and multi-turn question answering achieve state-of-the-art performance on several automatic natural language generation evaluation metrics.Abstract

DEEPCOPY: Grounded Response Generation with Hierarchical Pointer Networks

Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek Hakkani-Tür
In SIGDIAL 2019 and NeurIPS 2018 Conversational AI Workshop (Best Paper)
[bib] [pdf]

@inproceedings{yavuz2019deepcopy,
title={{DEEPCOPY}: Grounded Response Generation with Hierarchical Pointer Networks},
author={Yavuz, Semih and Rastogi, Abhinav and Chao, Guan-Lin and Hakkani-T{\"u}r, Dilek},
booktitle={SIGDIAL},
year={2019}
}

Abstract
Recent advances in neural sequence-to-sequence models have led to promising results for several language generation-based tasks, including dialogue response generation, summarization, and machine translation. However, these models are known to have several problems, especially in the context of chit-chat based dialogue systems: they tend to generate short and dull responses that are often too generic. Furthermore, these models do not ground conversational responses on knowledge and facts, resulting in turns that are not accurate, informative and engaging for the users. These indeed are the essential features that dialogue response generation models should be equipped with to serve in more realistic and useful conversational applications. Recently, several dialogue datasets accompanied with relevant external knowledge have been released to facilitate research into remedying such issues encountered by traditional models by resorting to this additional information. In this paper, we propose and experiment with a series of response generation models that aim to serve in the general scenario where in addition to the dialogue context, relevant unstructured external knowledge in the form of text is also assumed to be available for models to harness. Our approach extends pointer-generator networks by allowing the decoder to hierarchically attend and copy from external knowledge in addition to the dialogue context. We empirically show the effectiveness of the proposed model compared to several baselines including on CONVAI2 challenge.