PLOT-TAL: Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Ed Fish, Andrew Gilbert

University of Surrey
International Conference on Computer Vision (ICCV’25) – 6th Workshop on Closing the Loop between Vision and Language, 2025

Paper (arXiv) Paper Supplementary

Abstract

Few-shot temporal action localization (TAL) methods that adapt large models via single-prompt tuning often fail to produce precise temporal boundaries. This is because a single prompt tends to learn a non-discriminative mean representation from sparse data, limiting generalization. We propose PLOT-TAL, a multi-prompt ensemble framework that encourages each prompt to specialize on compositional sub-events of an action. To enforce this specialization, we leverage Optimal Transport (OT) to find globally optimal alignments between the prompt ensemble and a video’s temporal features. Our approach eliminates the need for complex meta-learning while achieving state-of-the-art results on THUMOS’14 and EPIC-Kitchens. The significant improvements, particularly at higher IoU thresholds, validate that learning distributed, compositional representations leads to more precise temporal localization in few-shot settings.

TL;DR: PLOT-TAL replaces single prompts with Optimal Transport–regulated prompt ensembles, learning compositional sub-events for actions. This yields state-of-the-art few-shot TAL, especially at high IoU thresholds.

Learning for Few-Shot Generalization. A single prompt trained on a few examples of “diving” in a specific context (top-right) tends to overfit to environmental cues like the cliffs and sea. This holistic representation fails to generalize to a novel environment. In contrast, our method (bottom right) learns an ensemble of prompts that specialize on the compositional, environment-agnostic sub-events of the action: (1) the preparation/stance, (2) the mid-air rotation, and (3) the water entry splash. Optimal Transport is the key mechanism that enforces this specialization, ensuring the prompts remain diverse and discriminative. By identifying these core components, our framework can robustly localize the “diving” action with high precision, even when presented with a completely different environment, such as an indoor swimming pool (left panel), using only a few samples. This compositional approach leads to more precise temporal localization in few-shot settings.

Overview of PLOT-TAL. (A) Extract video clips. (B) Generate an ensemble of learnable prompts per class. (C) Encode video features with a frozen 3D CNN and text prompts with a frozen CLIP encoder. (D) Build a temporal feature pyramid. (E) Use Optimal Transport to align prompts with video features at multiple resolutions. (F) Pass aligned features to lightweight heads for classification and boundary regression. Only the fire-marked modules are trained; all others are frozen.

PLOT-TAL outperforms state-of-the-art few-shot TAL methods on THUMOS’14. The graph shows the mean Average Precision (mAP) at different IoU thresholds, demonstrating significant improvements, especially at higher IoU levels. This indicates that our method provides more precise temporal localization, confirming the effectiveness of Optimal Transport in learning discriminative prompt ensembles.

Normalized transport cost per prompt for “Cricket Shot” after training. Prompt 1 aligns with global information, while other prompts capture complementary sub-events. Lower transport cost indicates closer alignment in the optimal transport algorithm. This demonstrates how our method learns to specialize prompts for different aspects of the action, leading to improved temporal localization. The diverse transport costs across prompts highlight the effectiveness of our multi-prompt approach, where each prompt captures unique, discriminative features of the action. The transport cost is normalized to the maximum cost across all prompts, allowing for a clear comparison

BibTeX

@inproceedings{Fish2025PLOTTAL,
  author    = {Edward Fish and Andrew Gilbert},
  title     = {PLOT-TAL: Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  year      = {2025},
  series    = {Closing the Loop between Vision and Language, 6th Workshop},
  publisher = {IEEE},
  url       = {https://andrewjohngilbert.github.io/PLOT-TAL/}
}