Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

University of Surrey
NeurIPS 2023 Workshop on Machine Learning for Audio 2023

We use a Feature Pyramid Network (FPN) to encode audio-visual action features along different temporal resolutions. We then gate the fusion of the audio features depending on their application to the action classification and regression boundaries. For example, the action `take' requires no audio, which is gated out. In contrast, the action `chop' can be better localised by combining high-temporal resolution audio features with visual features. Our method learns both the temporal resolution and the gating values end-to-end.

Abstract

Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.

A high-level representation of our multi-resolution audio-fusion method. (a) Audio and visual features are projected to a shared dimension via a 1D convolution. (b) Max-Pooling is applied to downsample features. (c) Following downsampling, we apply multi-headed cross attention in each temporal layer between audio and visual features. (d) The video features are then used as context to scale audio and visual attended embeddings. (e) The concatenated embedding is then used for both regression and classification.

Poster

BibTeX

@inproceedings{Fish:NeurIPSWS:2023,
        AUTHOR = Fish, ed and Weinbren, Jon and Gilbert, Andrew",
        TITLE = "Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization​",
        BOOKTITLE = " NeurIPS 2023 Workshop on Machine Learning for Audio",
        YEAR = "2023",
        ​}