Real-world videos often exhibit overlapping events and intricate temporal dependencies, posing significant challenges for effective multimodal interaction modeling. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features, which leverages masked self-attention to enhance intra-mode consistency, and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling both high-level semantics and fine-grained details. We report results on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100. The source code will be made publicly available. These advances enable more accurate analysis of complex, real-world scenes, from surveillance to accessible media understanding.
@inproceedings{Ahmadian:DEL:ICCVWS:2025,
AUTHOR = "Mona Ahmadian and Amir Shirian and Frank Guerin and Andrew Gilbert",
TITLE = "DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding",
BOOKTITLE = "International Conference on Computer Vision (ICCV'25) - The 4th Workshop on What is Next in Multimodal Foundation Models? (MMFM4)",
YEAR = "2025"
}