UIL-AQA: Uncertainty-aware Clip-level Interpretable Action Quality Assessment

[1] University of Surrey [2] University of Wollongong
International Journal of Computer Vision (IJCV), 2025
UIL-AQA Teaser figure

A visualisation of our clip-level difficulty-quality regression method on Rhythmic Gymnastics (RG) dataset.

Abstract

Action Quality Assessment (AQA) aims to score how well an action is performed in a video, but long-term sequences, subjective judging and a single global score make this problem challenging. Existing methods often focus on short clips, lack clip-level interpretability, and ignore uncertainty arising from human bias. In this work we propose UIL-AQA, a query-based Transformer framework designed to be both clip-level interpretable and uncertainty-aware for long-term AQA. The model introduces an Attention Loss and a Query Initialization module to mitigate a phenomenon we term Temporal Skipping, where self-attention layers are effectively bypassed, weakening temporal modelling. We further add a Gaussian Noise Injection module that simulates variability in human scoring, improving robustness to subjective and noisy labels. Finally, a Difficulty-Quality Regression Head decomposes each clip's contribution into separate difficulty and quality components, enabling fine-grained, human-aligned analysis of performances. Experiments on three long-term AQA benchmarks - Rhythmic Gymnastics (RG), Figure Skating Video (Fis-V), and LOng-form GrOup (LOGO) - show that UIL-AQA achieves state-of-the-art performance while providing more interpretable clip-wise scores.

Motivation

UIL-AQA System figure

Comparison with previous AQA methods, which perform single-score regression on short-term videos dataset, lacking interpretability and failing to account for uncertainty. Our proposed network extends to long-term video datasets, addressing subjectivity and scoring bias among different judges, ensuring more robust and reliable predictions. Furthermore, leveraging clip-level features and a dual difficulty-quality head enhances interpretability and improves regression performance.

Overview of the UIL-AQA Architecture

UIL-AQA System figure

The input video is divided into clips and passed through a frozen feature extractor (I3D or Video Swin Transformer). A DETR-style Transformer decoder with learnable queries models long-range temporal structure. Attention Loss and Query Initialization mitigate Temporal Skipping by encouraging strong inter-query correlations. The Difficulty-Quality Regression Head outputs clip-level difficulty weights and quality scores, while the Gaussian Noise Injection module models uncertainty in human judgement and improves robustness.

Method Overview

UIL-AQA is designed to provide interpretable and uncertainty-aware scoring for long-term Action Quality Assessment (AQA). Unlike prior methods that produce a single global score, UIL-AQA predicts per-clip difficulty and quality, while also modelling the uncertainty in human judgement. The framework consists of four key components:

1. Feature Extraction

The input video is divided into fixed-length clips. Each clip is processed using a pretrained backbone (I3D or Video Swin Transformer) to extract spatiotemporal features. These features are frozen during training to ensure stability and efficiency.

2. Temporal Decoder with Learnable Queries

UIL-AQA uses a DETR-style Transformer decoder equipped with learnable queries. Each query corresponds to a video clip and interacts with clip features through self-attention and cross-attention layers to capture long-range temporal structure. To prevent loss of temporal information, we introduce two innovations:

  • Attention Loss – Encourages alignment between self-attention and cross-attention maps, mitigating a failure mode we call Temporal Skipping.
  • Query Initialization – High-variance Gaussian initialisation promotes diverse temporal representations, strengthening inter-query relationships.

3. Difficulty–Quality Regression Head

Instead of predicting only a single score, UIL-AQA regresses:

  • Difficulty – A softmax-normalised weight representing the relative contribution of each clip.
  • Quality – A clip-level quality score predicted independently for each segment.

The final video score is obtained via a weighted sum of clip quality scores. This mirrors human judging processes in sports such as rhythmic gymnastics and figure skating, enabling fine-grained interpretability.

4. Gaussian Noise Injection for Uncertainty Modelling

Human scoring often includes subjective variation. UIL-AQA introduces a Gaussian Noise Injection module during training to simulate this variability by perturbing predicted scores. This encourages the model to learn uncertainty-aware representations, improving robustness against noisy labels and inconsistent judge behaviour.

Summary

Together, these components allow UIL-AQA to achieve state-of-the-art performance on three challenging long-form AQA benchmarks (RG, Fis-V, LOGO), while providing the rare ability to both explain its scoring and model uncertainty — two critical requirements for real-world deployment.

Results

UIL-AQA achieves state-of-the-art performance across all three major long-term AQA datasets. Improvements are most pronounced on the Rhythmic Gymnastics (RG) and LOGO benchmarks, demonstrating the model's strength in handling long temporal dependencies, uncertainty, and clip-level interpretability.

Performance Overview

Dataset Metric Previous SOTA UIL-AQA (Ours) Improvement
Rhythmic Gymnastics (RG) Avg. SRCC 0.842 (Inter-AQA) 0.858 +1.6%
Avg. MSE 7.61 (Inter-AQA) 5.65 -25.7%
Figure Skating (Fis-V) Avg. SRCC 0.780 (Inter-AQA) 0.796 +1.6%
Avg. MSE 1.745 (Inter-AQA) 1.72 Comparable
LOGO (Artistic Swimming) Avg. SRCC 0.780 (Inter-AQA) 0.796 +2.1%
Avg. R-l2 1.745 (Inter-AQA) 3.084 Better ranking

Data extracted from the IJCV 2025 manuscript, including Tables 2, 3, and 4 (RG, Fis-V, LOGO results).

UIL-AQA QualiativeResults figure

UIL-AQA is a long-term Action Quality Assessment framework that regresses clip-level difficulty and quality while explicitly modelling judge uncertainty. A query-based Transformer decoder, an attention loss to prevent temporal skipping, and a Gaussian noise injection module together yield interpretable clip-wise scores and robust predictions across rhythmic gymnastics, figure skating, and group performance datasets.

BibTeX

@article{Dong2025UILAQA,
  author    = {Xu Dong and Xinran Liu and Wanqing Li and Anthony Adeyemi-Ejeye and Andrew Gilbert},
  title     = {UIL-AQA: Uncertainty-aware Clip-level Interpretable Action Quality Assessment},
  journal   = {International Journal of Computer Vision},
  year      = {2025},
}