DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

University of Surrey
arXiv preprint arXiv:X.X, 2025

Our DANTE-AD method extracts frame- and scene-level visual information fused via a sequential cross-attention module for both frame and scene context-aware AD over extended video sequences.

Abstract

Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.

Overview of our audio description generation pipeline. The system features two primary branches: a frame-level visual branch (blue) and a scene-level visual branch (red). Ground-truth references are embedded and processed auto-regressively using a causal attention mask. Sequential fusion integrates the visual embeddings within the Dual-Vision Attention Network (purple). The fused representation is fed to our LLaMA language model and decoded into a natural language AD prediction.

We propose a sequential fusion method within the Dual-Vision Attention Network to integrate frame- and scene-level embeddings. Ground-truth word embeddings are processed using a causal self-attention mask.

Comparisons of AD performance on the CMD-AD dataset. LLM-AD-Eval is evaluated with LLaMA2-7B-chat (left) and GPT-3.5-turbo (right). We report the results for our method DANTE-AD using the sequential fusion of our visual embeddings.

Qualitative results of our DANTE-AD method on CMD-AD-Eval. Our method uses sequential fusion cross-attention between frame- and scene-level visual embeddings.

BibTeX

@inproceedings{Deganutti:DANTE-AD:ArXiv:2025,
        AUTHOR = Deganutti Adrienne, and  Hadfield, Simon and Gilbert, Andrew ",
        TITLE = "DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description",
        BOOKTITLE = " ArXiv abs/X.X",
        YEAR = "2025",
        }