Publications

Detailed Information

Pano-AVQA: Grounded Audio-Visual Question Answering on 360◦ Videos

Cited 0 time in Web of Science Cited 19 time in Scopus
Authors

Yun, Heeseung; Yu, Youngjae; Yang, Wonsuk; Lee, Kangil; Kim, Gun Hee

Issue Date
2021-01
Publisher
IEEE
Citation
Proceedings of the IEEE International Conference on Computer Vision, pp.2011-2021
Abstract
© 2021 IEEE360◦ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360◦ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.
ISSN
1550-5499
URI
https://hdl.handle.net/10371/183774
DOI
https://doi.org/10.1109/ICCV48922.2021.00204
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share