UIL-AQA: Uncertainty-aware Clip-level Interpretable Action Quality Assessment

[1] University of Surrey [2] University of Wollongong
International Journal of Computer Vision (IJCV), 2025
UIL-AQA Teaser figure

A visualisation of our clip-level difficulty-quality regression method on Rhythmic Gymnastics (RG) dataset.

Abstract

Action Quality Assessment (AQA) aims to score how well an action is performed in a video, but long-term sequences, subjective judging and a single global score make this problem challenging. Existing methods often focus on short clips, lack clip-level interpretability, and ignore uncertainty arising from human bias. In this work we propose UIL-AQA, a query-based Transformer framework designed to be both clip-level interpretable and uncertainty-aware for long-term AQA. The model introduces an Attention Loss and a Query Initialization module to mitigate a phenomenon we term Temporal Skipping, where self-attention layers are effectively bypassed, weakening temporal modelling. We further add a Gaussian Noise Injection module that simulates variability in human scoring, improving robustness to subjective and noisy labels. Finally, a Difficulty-Quality Regression Head decomposes each clip's contribution into separate difficulty and quality components, enabling fine-grained, human-aligned analysis of performances. Experiments on three long-term AQA benchmarks - Rhythmic Gymnastics (RG), Figure Skating Video (Fis-V), and LOng-form GrOup (LOGO) - show that UIL-AQA achieves state-of-the-art performance while providing more interpretable clip-wise scores.

Overview of the UIL-AQA Architecture

UIL-AQA System figure

The input video is divided into clips and passed through a frozen feature extractor (I3D or Video Swin Transformer). A DETR-style Transformer decoder with learnable queries models long-range temporal structure. Attention Loss and Query Initialization mitigate Temporal Skipping by encouraging strong inter-query correlations. The Difficulty-Quality Regression Head outputs clip-level difficulty weights and quality scores, while the Gaussian Noise Injection module models uncertainty in human judgement and improves robustness.

Results

UIL-AQA achieves state-of-the-art performance across all three major long-term AQA datasets. Improvements are most pronounced on the Rhythmic Gymnastics (RG) and LOGO benchmarks, demonstrating the model's strength in handling long temporal dependencies, uncertainty, and clip-level interpretability.

Performance Overview

Dataset Metric Previous SOTA UIL-AQA (Ours) Improvement
Rhythmic Gymnastics (RG) Avg. SRCC 0.842 (Inter-AQA) 0.858 +1.6%
Avg. MSE 7.61 (Inter-AQA) 5.65 -25.7%
Figure Skating (Fis-V) Avg. SRCC 0.780 (Inter-AQA) 0.796 +1.6%
Avg. MSE 1.745 (Inter-AQA) 1.72 Comparable
LOGO (Artistic Swimming) Avg. SRCC 0.780 (Inter-AQA) 0.796 +2.1%
Avg. R-l2 1.745 (Inter-AQA) 3.084 Better ranking

Data extracted from the IJCV 2025 manuscript, including Tables 2, 3, and 4 (RG, Fis-V, LOGO results).

UIL-AQA QualiativeResults figure

UIL-AQA is a long-term Action Quality Assessment framework that regresses clip-level difficulty and quality while explicitly modelling judge uncertainty. A query-based Transformer decoder, an attention loss to prevent temporal skipping, and a Gaussian noise injection module together yield interpretable clip-wise scores and robust predictions across rhythmic gymnastics, figure skating, and group performance datasets.

BibTeX

@article{Dong2025UILAQA,
  author    = {Xu Dong and Xinran Liu and Wanqing Li and Anthony Adeyemi-Ejeye and Andrew Gilbert},
  title     = {UIL-AQA: Uncertainty-aware Clip-level Interpretable Action Quality Assessment},
  journal   = {International Journal of Computer Vision},
  year      = {2025},
}