DEEPCOPY: Grounded Response Generation with Hierarchical Pointer Networks

Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek Hakkani-Tür
In SIGDIAL 2019 and NeurIPS 2018 Conversational AI Workshop (Best Paper)
[bib] [pdf]

title={{DEEPCOPY}: Grounded Response Generation with Hierarchical Pointer Networks},
author={Yavuz, Semih and Rastogi, Abhinav and Chao, Guan-Lin and Hakkani-T{\"u}r, Dilek},

Recent advances in neural sequence-to-sequence models have led to promising results for several language generation-based tasks, including dialogue response generation, summarization, and machine translation. However, these models are known to have several problems, especially in the context of chit-chat based dialogue systems: they tend to generate short and dull responses that are often too generic. Furthermore, these models do not ground conversational responses on knowledge and facts, resulting in turns that are not accurate, informative and engaging for the users. These indeed are the essential features that dialogue response generation models should be equipped with to serve in more realistic and useful conversational applications. Recently, several dialogue datasets accompanied with relevant external knowledge have been released to facilitate research into remedying such issues encountered by traditional models by resorting to this additional information. In this paper, we propose and experiment with a series of response generation models that aim to serve in the general scenario where in addition to the dialogue context, relevant unstructured external knowledge in the form of text is also assumed to be available for models to harness. Our approach extends pointer-generator networks by allowing the decoder to hierarchically attend and copy from external knowledge in addition to the dialogue context. We empirically show the effectiveness of the proposed model compared to several baselines including on CONVAI2 challenge.

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Guan-Lin Chao, William Chan, Ian Lane
[bib] [pdf] [slides]

title={Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments},
author={Chao, Guan-Lin and Chan, William and Lane, Ian},

Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker’s identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers’s speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3%. The audio-visual model improved the WER to 4.4%. Introducing speaker identity information had an even more pronounced effect, improving the WER to 3.6%. Combining both approaches, however, did not significantly improve performance further. Our work demonstrates that speaker-targeted models can significantly improve the speech recognition in cocktail party environments.

City-Identification of Flickr Videos Using Semantic Acoustic Features

Benjamin Elizalde, Guan-Lin Chao, Ming Zeng, Ian Lane
In IEEE International Conference on Multimedia Big Data (BigMM) 2016
[bib] [pdf]

title={City-identification of flickr videos using semantic acoustic features},
author={Elizalde, Benjamin and Chao, Guan-Lin and Zeng, Ming and Lane, Ian},
booktitle={International Conference on Multimedia Big Data (BigMM)},

City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city-soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing audio clips labeled by sound-type. Our method improved the state-of-the-art performance and provides a novel semantic approach to this task.