Generative Data Augmentation for Skeleton Action Recognition

TL;DR: A label-conditioned diffusion model generates diverse, high-fidelity skeleton motion to augment training data, boosting skeleton action recognition—especially in low-data regimes.

Conditional diffusion generates label-consistent skeleton sequences.
Transformer encoder–decoder models motion semantics and denoising.
GRM + sampling-time dropout balance fidelity and diversity during synthesis.

Abstract

Skeleton-based human action recognition is powerful but collecting large-scale, diverse, well-annotated 3D skeleton datasets is expensive. We propose a conditional generative pipeline for data augmentation in skeleton action recognition. The method learns the distribution of real skeleton sequences under action-label constraints, enabling synthesis of diverse and high-fidelity data. A Transformer-based encoder–decoder, combined with a Generative Refinement Module and sampling-time dropout, balances fidelity and diversity. Experiments on HumanAct12 and refined NTU-RGBD (NTU-VIBE) show consistent improvements across multiple backbones in few-shot and full-data settings.

We propose a generative data augmentation pipeline for skeleton action recognition that synthesises label-consistent 3D skeleton sequences and uses them to augment training data. The core generator is a label-conditioned diffusion model implemented with a Transformer encoder–decoder architecture.

1. Conditional diffusion for skeleton sequences

Given an action label, we learn to denoise a noise-corrupted skeleton motion sequence through iterative diffusion steps. The model conditions on the action class (and diffusion timestep) to generate diverse motions that remain consistent with the intended action.

2. Transformer encoder–decoder denoiser

A Transformer encoder injects conditioning information (label + timestep) and captures global motion context, while a Transformer decoder predicts the denoised skeleton sequence. This design supports long-range temporal dependencies and produces realistic motion trajectories.

3. Balancing fidelity and diversity

To avoid mode collapse and low-quality samples, we combine (i) sampling-time dropout to increase diversity during generation and (ii) a Generative Refinement Module (GRM) to filter/refine low-fidelity sequences, yielding synthetic data that is both varied and usable for training.

4. Training with real + synthetic data

Finally, we train standard skeleton action recognisers on real-only data versus real + synthetic data generated by our model. Across HumanAct12 and NTU-VIBE, augmentation consistently improves accuracy—especially in low-data regimes.

Main Results

We evaluate generative augmentation by training skeleton action recognisers on real-only data versus real + synthetic (ours). Results are reported as mean ± std over 5 independent runs.

Source: Paper Table II (HumanAct12) and Table III (Refined NTU-RGBD / NTU-VIBE).

HumanAct12

Backbone	100%			95%			90%			75%
Backbone	Real	Real + Ours	Δ	Real	Real + Ours	Δ	Real	Real + Ours	Δ	Real	Real + Ours	Δ
STGCN++	78.47±2.09	83.19±2.73	+4.72	77.78±2.55	81.63±2.05	+3.85	75.83±1.24	81.50±1.47	+5.66	73.89±0.38	81.11±0.80	+7.22
MSG3D	80.42±1.99	83.11±3.46	+2.69	77.64±1.50	83.24±1.23	+5.60	76.94±2.43	81.77±1.18	+4.83	74.86±1.80	80.50±0.68	+5.64
CTRGCN	77.78±1.97	79.42±2.02	+1.64	76.94±2.10	79.59±1.83	+2.65	75.56±1.42	80.16±2.20	+4.60	73.61±2.41	78.25±1.72	+4.64
BlockGCN	77.78±1.30	78.91±0.41	+1.13	75.67±1.30	78.67±1.63	+3.00	75.56±0.76	78.19±0.38	+2.63	75.56±0.90	77.17±0.72	+1.61

Refined NTU-RGBD (NTU-VIBE)

Backbone	25%			20%			15%			10%
Backbone	Real	Real + Ours	Δ	Real	Real + Ours	Δ	Real	Real + Ours	Δ	Real	Real + Ours	Δ
STGCN++	91.55±0.62	92.36±0.33	+0.81	90.95±1.04	92.14±0.87	+1.18	89.94±1.06	92.07±0.76	+2.13	83.01±2.15	85.38±1.13	+2.37
MSG3D	90.97±1.08	92.30±0.39	+1.33	89.74±2.33	90.36±0.68	+0.62	87.41±1.30	89.90±1.59	+2.49	79.48±1.87	83.17±1.13	+3.69
CTRGCN	90.81±1.07	91.13±1.34	+0.32	90.78±0.20	90.97±0.49	+0.19	87.57±2.69	89.45±0.35	+1.88	79.28±1.46	83.17±1.34	+3.89
BlockGCN	90.03±0.72	90.91±0.54	+0.88	88.51±1.11	89.13±1.16	+0.62	86.70±1.46	86.05±1.42	-0.65	75.05±1.43	84.43±0.72	+9.38

Note: “Real + Ours” corresponds to training on real samples augmented with synthetic sequences generated by our method (see paper for protocol details). Δ indicates the improvement in accuracy from augmentation.

BibTeX

@inproceedings{Dong2026SkelAug, author = {Xu Dong and Wanqing Li and Anthony Adeyemi-Ejeye and Andrew Gilbert}, title = {Generative Data Augmentation for Skeleton Action Recognition}, booktitle = {20th IEEE International Conference on Automatic Face and Gesture Recognition (FG'26)}, year = {2026} }

Generative Data Augmentation for Skeleton Action Recognition

Abstract

Method Overview

The model uses a Transformer conditional encoder (label + timestep) and a Transformer decoder that denoises a noise-corrupted skeleton sequence. An auxiliary classifier encourages label-consistent generation. Sampling-time dropout increases diversity, while the GRM filters low-fidelity samples.

1. Conditional diffusion for skeleton sequences

2. Transformer encoder–decoder denoiser

3. Balancing fidelity and diversity

4. Training with real + synthetic data

Main Results

HumanAct12

Refined NTU-RGBD (NTU-VIBE)

HumanAct12 Visualisation

Visualisation of HumanAct12 dataset

NTURGB-D Visualisation

Visualisation of Refined NTURGB-D dataset.

BibTeX