PLOT-TAL--Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

University of Surrey
​arXiv preprint arXiv:2403.18915, 2024

Existing methods learn a single prompt to identify the location and class of a given action, but multiple complimentary views can help with both class generalization and temporal discrimination within the video. Green frames indicate the foreground features.

An overview of the approach. A.) We sample T overlapping segments of videos V. B.) For each class label K, we randomly initialize N learnable vectors concatenated with the class label. C.) Video features are extracted via a pre-trained 3D CNN encoder (I3D) while N prompts for each class k are also extracted via the pre-trained CLIP text encoder. D.) We temporally downsample the features using max-pooling. E.) We search the optimal transport plan between the N prompt features and video segments at each temporal level. Following this stage, we sum all N vectors for each K. F.) At each temporal level L, we compute the cosine similarity between each prompt vector Pk and each video segment xi and then apply a threshold to retrieve action candidates. These candidates are passed to the regression head, which minimizes the distance between the start and end actions and each embedding. Only components with the fire symbol are trained, and all others are frozen.

Abstract

This paper introduces a novel approach to temporal action localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to overfitting due to the inability to generalize across varying contexts in real-world videos. Recognizing the diversity of camera views, backgrounds, and objects in videos, we propose a multi-prompt learning framework enhanced with optimal transport. This design allows the model to learn a set of diverse prompts for each action, capturing general characteristics more effectively and distributing the representation to mitigate the risk of overfitting. Furthermore, by employing optimal transport theory, we efficiently align these prompts with action features, optimizing for a comprehensive representation that adapts to the multifaceted nature of video data. Our experiments demonstrate significant improvements in action localization accuracy and robustness in few-shot settings on the standard challenging datasets of THUMOS-14 and EpicKitchens100, highlighting the efficacy of our multi-prompt optimal transport approach in overcoming the challenges of conventional few-shot TAL methods.

-->

BibTeX

@inproceedings{Fish:arxiv:2024,
        AUTHOR =  Fish, Ed and Weinbren, Jon  and Gilbert, Andrew ",
        TITLE = "PLOT-TAL--Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization​",
        BOOKTITLE = " ArXiv abs/2403.18915",
        YEAR = "2024",
        ​}