DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

Mona Ahmadian, University of Surrey, m.ahmadian@surrey.ac.uk
Amir Shirian, JPMorgan Chase, amirdonte15@gmail.com
Frank Guerin, University of Surrey, f.guerin@surrey.ac.uk
Andrew Gilbert, University of Surrey, a.gilbert@surrey.ac.uk

International Conference on Computer Vision (ICCV'25) - The 4th Workshop on What is Next in Multimodal Foundation Models? (MMFM4)

Download Paper (PDF)

Research pipeline for comparing human and AI performance

Real-world videos often feature overlapping events of different lengths, making localization difficult. This image compares ground-truth (GT) with predictions from DEL, an audio-only model (A), and a visual-only model (V). While A and V struggle with a specific category, DEL accurately detects both short and long events, even when overlapping.

Abstract

Real-world videos often exhibit overlapping events and intricate temporal dependencies, posing significant challenges for effective multimodal interaction modeling. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features, which leverages masked self-attention to enhance intra-mode consistency, and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling both high-level semantics and fine-grained details. We report results on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100. The source code will be made publicly available. These advances enable more accurate analysis of complex, real-world scenes, from surveillance to accessible media understanding.

Overview of our proposed DEL framework

Our model integrates 1, an adaptive attention mechanism for aligning audio and visual features, 2, inter- and intra-sample contrastive learning to enhance event discrimination, and 3, a multi-scale path aggregation network for feature fusion. DEL efficiently localizes fine-grained and overlapping events in untrimmed videos, leveraging cross-modal dependencies for improved accuracy.

Qualiative results

Recognition gap frequency distribution for Easy, Hard, and combined sets

We present qualitative results demonstrating the effectiveness of our DEL framework in comparison to unimodal baselines. For the first example, our model, leveraging audio and visual streams, accurately localizes events such as wind noise, cars passing by, driving a motorcycle, and skidding, even in scenarios with overlapping or short-duration occurrences. In contrast, the audio-only model struggles with visually-driven events like skidding, while the vision-only model fails to detect sound-based events like wind noise. Moreover, relying solely on audio leads to incorrect predictions, such as misclassifying the scene as engine knocking due to the absence of visual context. Similarly, the vision-only model, lacking critical audio cues, misinterprets the scene as auto racing from start to finish, based purely on visual perception. This shows the impact of audio in disambiguating visually similar activities. Distinguishing between "man speaking" and another potential sound source in the third scenario is only possible with audio input, as visual information alone is insufficient. Similarly, in the last example, the scene is crowded, making it challenging to infer "people cheering" using only visual cues. The vision-only model struggles to recognize this category, while the audio modality provides crucial information for its correct identification. Additionally, detecting "people slapping" relies primarily on visual cues. These results highlight how integrating audio and visual streams leads to more accurate and robust event localization, particularly in complex multimodal scenarios.

BibTeX

@inproceedings{Ahmadian:DEL:ICCVWS:2025,
  AUTHOR = "Mona Ahmadian and Amir Shirian and Frank Guerin and Andrew Gilbert",
  TITLE = "DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding",
  BOOKTITLE = "International Conference on Computer Vision (ICCV'25) - The 4th Workshop on What is Next in Multimodal Foundation Models? (MMFM4)",
  YEAR = "2025"
}