We propose a generative data augmentation pipeline for skeleton action recognition that synthesises
label-consistent 3D skeleton sequences and uses them to augment training data. The core generator is a
label-conditioned diffusion model implemented with a Transformer encoder–decoder architecture.
1. Conditional diffusion for skeleton sequences
Given an action label, we learn to denoise a noise-corrupted skeleton motion sequence through iterative diffusion steps.
The model conditions on the action class (and diffusion timestep) to generate diverse motions that remain consistent with
the intended action.
2. Transformer encoder–decoder denoiser
A Transformer encoder injects conditioning information (label + timestep) and captures global motion context, while a
Transformer decoder predicts the denoised skeleton sequence. This design supports long-range temporal dependencies and
produces realistic motion trajectories.
3. Balancing fidelity and diversity
To avoid mode collapse and low-quality samples, we combine (i) sampling-time dropout to increase diversity
during generation and (ii) a Generative Refinement Module (GRM) to filter/refine low-fidelity sequences,
yielding synthetic data that is both varied and usable for training.
4. Training with real + synthetic data
Finally, we train standard skeleton action recognisers on real-only data versus real + synthetic
data generated by our model. Across HumanAct12 and NTU-VIBE, augmentation consistently improves accuracy—especially in low-data regimes.