comparemela.com

Latest Breaking News On - Uestion answering information retrieval - Page 1 : comparemela.com

Question-Aware Global-Local Video Understanding Network for Audio-Visu by Zailong Chen, Lei Wang et al

As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of

© 2025 Vimarsana

vimarsana © 2020. All Rights Reserved.