Generative Data Augmentation for Skeleton Action Recognition

TL;DR: A label-conditioned diffusion model generates diverse, high-fidelity skeleton motion to augment training data, boosting skeleton action recognition—especially in low-data regimes.

  • Conditional diffusion generates label-consistent skeleton sequences.
  • Transformer encoder–decoder models motion semantics and denoising.
  • GRM + sampling-time dropout balance fidelity and diversity during synthesis.
Conditional diffusion for skeleton augmentation (overview)

Overview of our approach. With only a small set of labelled skeleton sequences, the model generates diverse and high-fidelity samples. When combined with a reduced amount of real data for training, these synthetic samples enable our skeleton action recognisers to achieve performance close to the state of the art on HumanAct12 and Refined NTU RGB+D.

Abstract

Skeleton-based human action recognition is powerful but collecting large-scale, diverse, well-annotated 3D skeleton datasets is expensive. We propose a conditional generative pipeline for data augmentation in skeleton action recognition. The method learns the distribution of real skeleton sequences under action-label constraints, enabling synthesis of diverse and high-fidelity data. A Transformer-based encoder–decoder, combined with a Generative Refinement Module and sampling-time dropout, balances fidelity and diversity. Experiments on HumanAct12 and refined NTU-RGBD (NTU-VIBE) show consistent improvements across multiple backbones in few-shot and full-data settings.

Method Overview

The model uses a Transformer conditional encoder (label + timestep) and a Transformer decoder that denoises a noise-corrupted skeleton sequence. An auxiliary classifier encourages label-consistent generation. Sampling-time dropout increases diversity, while the GRM filters low-fidelity samples.

Conditional skeleton diffusion architecture (see Fig. 2 in the paper)

We propose a generative data augmentation pipeline for skeleton action recognition that synthesises label-consistent 3D skeleton sequences and uses them to augment training data. The core generator is a label-conditioned diffusion model implemented with a Transformer encoder–decoder architecture.

1. Conditional diffusion for skeleton sequences

Given an action label, we learn to denoise a noise-corrupted skeleton motion sequence through iterative diffusion steps. The model conditions on the action class (and diffusion timestep) to generate diverse motions that remain consistent with the intended action.

2. Transformer encoder–decoder denoiser

A Transformer encoder injects conditioning information (label + timestep) and captures global motion context, while a Transformer decoder predicts the denoised skeleton sequence. This design supports long-range temporal dependencies and produces realistic motion trajectories.

3. Balancing fidelity and diversity

To avoid mode collapse and low-quality samples, we combine (i) sampling-time dropout to increase diversity during generation and (ii) a Generative Refinement Module (GRM) to filter/refine low-fidelity sequences, yielding synthetic data that is both varied and usable for training.

4. Training with real + synthetic data

Finally, we train standard skeleton action recognisers on real-only data versus real + synthetic data generated by our model. Across HumanAct12 and NTU-VIBE, augmentation consistently improves accuracy—especially in low-data regimes.

Main Results

We evaluate generative augmentation by training skeleton action recognisers on real-only data versus real + synthetic (ours). Results are reported as mean ± std over 5 independent runs.

Source: Paper Table II (HumanAct12) and Table III (Refined NTU-RGBD / NTU-VIBE).

HumanAct12

Backbone 100% 95% 90% 75%
RealReal + OursΔ RealReal + OursΔ RealReal + OursΔ RealReal + OursΔ
STGCN++ 78.47±2.0983.19±2.73+4.72 77.78±2.5581.63±2.05+3.85 75.83±1.2481.50±1.47+5.66 73.89±0.3881.11±0.80+7.22
MSG3D 80.42±1.9983.11±3.46+2.69 77.64±1.5083.24±1.23+5.60 76.94±2.4381.77±1.18+4.83 74.86±1.8080.50±0.68+5.64
CTRGCN 77.78±1.9779.42±2.02+1.64 76.94±2.1079.59±1.83+2.65 75.56±1.4280.16±2.20+4.60 73.61±2.4178.25±1.72+4.64
BlockGCN 77.78±1.3078.91±0.41+1.13 75.67±1.3078.67±1.63+3.00 75.56±0.7678.19±0.38+2.63 75.56±0.9077.17±0.72+1.61

Refined NTU-RGBD (NTU-VIBE)

Backbone 25% 20% 15% 10%
RealReal + OursΔ RealReal + OursΔ RealReal + OursΔ RealReal + OursΔ
STGCN++ 91.55±0.6292.36±0.33+0.81 90.95±1.0492.14±0.87+1.18 89.94±1.0692.07±0.76+2.13 83.01±2.1585.38±1.13+2.37
MSG3D 90.97±1.0892.30±0.39+1.33 89.74±2.3390.36±0.68+0.62 87.41±1.3089.90±1.59+2.49 79.48±1.8783.17±1.13+3.69
CTRGCN 90.81±1.0791.13±1.34+0.32 90.78±0.2090.97±0.49+0.19 87.57±2.6989.45±0.35+1.88 79.28±1.4683.17±1.34+3.89
BlockGCN 90.03±0.7290.91±0.54+0.88 88.51±1.1189.13±1.16+0.62 86.70±1.4686.05±1.42-0.65 75.05±1.4384.43±0.72+9.38

Note: “Real + Ours” corresponds to training on real samples augmented with synthetic sequences generated by our method (see paper for protocol details). Δ indicates the improvement in accuracy from augmentation.

HumanAct12 Visualisation

Visualisation of HumanAct12 dataset

Visualisation of HumanAct12 dataset.

NTURGB-D Visualisation

Visualisation of Refined NTURGB-D dataset.

Visualisation of Refined NTURGB-D dataset.

BibTeX

@inproceedings{Dong2026SkelAug,
  author    = {Xu Dong and Wanqing Li and Anthony Adeyemi-Ejeye and Andrew Gilbert},
  title     = {Generative Data Augmentation for Skeleton Action Recognition},
  booktitle = {20th IEEE International Conference on Automatic Face and Gesture Recognition (FG'26)},
  year      = {2026}
}