"> Interpretable Long-term Action Quality Assessment

Interpretable Action Recognition on Hard to Classify Actions

Unviersity of Surrey[1] Univesity of Wollongong [2]
The 35th British Machine Vision Conference 2024 (Oral)

The visualization of the clip-level weight-score regression method illustrates that our network can adhere to the same evaluative logic as human judges in real-world scenarios. The green curve representing weight delineates the significance of the respective action clip, whereas the orange curve for score quantifies the execution quality of the action, the overall score is shown by the blue curve. All scores are normalized to a range of 0 to 1 for easier comparison.

Abstract

Long-term Action Quality Assessment (AQA) evaluates the execution of activities in videos. However, the length presents challenges in fine-grained interpretability, with current AQA methods typically producing a single score by averaging clip features, lacking detailed semantic meanings of individual clips. Long-term videos pose additional difficulty due to the complexity and diversity of actions, exacerbating interpretability challenges. While query-based transformer networks offer promising long-term modelling capabilities, their interpretability in AQA remains unsatisfactory due to a phenomenon we term Temporal Skipping, where the model skips self-attention layers to prevent output degradation. To address this, we propose an attention loss function and a query initialization method to enhance performance and interpretability. Additionally, we introduce a weight-score regression module designed to approximate the scoring patterns observed in human judgments and replace conventional single-score regression, improving the rationality of interpretability. Our approach achieves state-of-the-art results on three real-world, long-term AQA benchmarks.

Video Presentation

The overview architecture of our \networkName. The input video is divided into clips and fed into a backbone network. A temporal decoder models the clip-level features into temporal representations via learnable positionally encoded queries. The interpretable weight-score regression head can regress the final score by multiplying the weight and score of each clip. By minimizing the similarity between the self-attention map and cross-attention map, as well as query initialization, the problem of temporal collapse common in longer-term video sequences disappears and improves human interpretability.

This figure shows the self-attention map (a) and (c) (ours) and visualization of segmented score of each clip (b) and (d) (ours). (a) and (b) represent the same action sequences, as do (c) and (d). We can observe that in (a), the self-attention map severely suffers from Temporal Skipping problem where (c) shows high correlations between queries

Visualization of our clip-level weight-score regression method on Rhythmic Gymnastics dataset.

Poster

BibTeX

@inproceedings{Dong:BMVC:2024,
      AUTHOR = Dong, Xu & Liu, Xinran & Li, Wanqing & Adeyemi-Ejeye, Anthony & Gilbert, Andrew",
      TITLE = "Interpretable Long-term Action Quality Assessment",
      BOOKTITLE = "British Machine Vision Confernce (BMVC'24)",
      YEAR = "2024",
      }