Action Quality Assessment (AQA) aims to score how well an action is performed in a video, but long-term sequences, subjective judging and a single global score make this problem challenging. Existing methods often focus on short clips, lack clip-level interpretability, and ignore uncertainty arising from human bias. In this work we propose UIL-AQA, a query-based Transformer framework designed to be both clip-level interpretable and uncertainty-aware for long-term AQA. The model introduces an Attention Loss and a Query Initialization module to mitigate a phenomenon we term Temporal Skipping, where self-attention layers are effectively bypassed, weakening temporal modelling. We further add a Gaussian Noise Injection module that simulates variability in human scoring, improving robustness to subjective and noisy labels. Finally, a Difficulty-Quality Regression Head decomposes each clip's contribution into separate difficulty and quality components, enabling fine-grained, human-aligned analysis of performances. Experiments on three long-term AQA benchmarks - Rhythmic Gymnastics (RG), Figure Skating Video (Fis-V), and LOng-form GrOup (LOGO) - show that UIL-AQA achieves state-of-the-art performance while providing more interpretable clip-wise scores.
UIL-AQA achieves state-of-the-art performance across all three major long-term AQA datasets. Improvements are most pronounced on the Rhythmic Gymnastics (RG) and LOGO benchmarks, demonstrating the model's strength in handling long temporal dependencies, uncertainty, and clip-level interpretability.
| Dataset | Metric | Previous SOTA | UIL-AQA (Ours) | Improvement |
|---|---|---|---|---|
| Rhythmic Gymnastics (RG) | Avg. SRCC | 0.842 (Inter-AQA) | 0.858 | +1.6% |
| Avg. MSE | 7.61 (Inter-AQA) | 5.65 | -25.7% | |
| Figure Skating (Fis-V) | Avg. SRCC | 0.780 (Inter-AQA) | 0.796 | +1.6% |
| Avg. MSE | 1.745 (Inter-AQA) | 1.72 | Comparable | |
| LOGO (Artistic Swimming) | Avg. SRCC | 0.780 (Inter-AQA) | 0.796 | +2.1% |
| Avg. R-l2 | 1.745 (Inter-AQA) | 3.084 | Better ranking |
Data extracted from the IJCV 2025 manuscript, including Tables 2, 3, and 4 (RG, Fis-V, LOGO results).
@article{Dong2025UILAQA,
author = {Xu Dong and Xinran Liu and Wanqing Li and Anthony Adeyemi-Ejeye and Andrew Gilbert},
title = {UIL-AQA: Uncertainty-aware Clip-level Interpretable Action Quality Assessment},
journal = {International Journal of Computer Vision},
year = {2025},
}