Interpretable Long-term Action Quality Assessment

Abstract

Long-term Action Quality Assessment (AQA) evaluates the execution of activities in videos. However, the length presents challenges in fine-grained interpretability, with current AQA methods typically producing a single score by averaging clip features, lacking detailed semantic meanings of individual clips. Long-term videos pose additional difficulty due to the complexity and diversity of actions, exacerbating interpretability challenges. While query-based transformer networks offer promising long-term modelling capabilities, their interpretability in AQA remains unsatisfactory due to a phenomenon we term Temporal Skipping, where the model skips self-attention layers to prevent output degradation. To address this, we propose an attention loss function and a query initialization method to enhance performance and interpretability. Additionally, we introduce a weight-score regression module designed to approximate the scoring patterns observed in human judgments and replace conventional single-score regression, improving the rationality of interpretability. Our approach achieves state-of-the-art results on three real-world, long-term AQA benchmarks.

@inproceedings{Dong:BMVC:2024, AUTHOR = Dong, Xu & Liu, Xinran & Li, Wanqing & Adeyemi-Ejeye, Anthony & Gilbert, Andrew", TITLE = "Interpretable Long-term Action Quality Assessment", BOOKTITLE = "British Machine Vision Confernce (BMVC'24)", YEAR = "2024", }

Interpretable Action Recognition on Hard to Classify Actions

Abstract

Video Presentation

Visualization of our clip-level weight-score regression method on Rhythmic Gymnastics dataset.

Poster

BibTeX