FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

University of Surrey
British Machine Vision Confernce (BMVC'24), 2024.

We introduce FILS, a framework for deeper video understanding that leverages a unified embedding space that integrates video and natural language for multimodal learning. The image above presents an overview of our method; it has two objectives: 1) Feature Prediction, where the input is masked and encoded (the student mode with random initialisation), and the predictor predicts representations. Then, we create representations of all the input data, which are meant to be targeted for the learning task (the teacher mode). To prevent the collapse, the teacher tracks student parameters, and their weights are derived from the exponentially moving average of the student. 2) ActCLIP, an auxiliary CLIP-based self-supervised objective performing contrastive learning between motion or action area patches and relevant text, aligning video and language spaces to learn semantic context.

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel Self-Supervised Video Feature Prediction In Semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

To gain deeper insight into the learned representations by FILS, we employ Grad-CAMs to visualize the prominent areas that significantly contribute to the accomplishment of the action recognition task. This visualization helps us better comprehend the spatiotemporal cues acquired during the self-supervised learning step. In the image above, we illustrate attention visualization for a few sample videos selected from the Epic-Kitchens-100 dataset; visualize attention heatmaps of the first, middle, and last frame of the video using the models trained with these training strategies: our proposed FILS, our first objective, which is FP, and MSE in the pixel domain.

-->

BibTeX

@inproceedings{Ahmadian:ArXiv_2406.03447:2024,
        AUTHOR = Ahmadian, Mona and Guerin, Frank and Gilbert, Andrew",
        TITLE = "Self-Supervised Video Feature Prediction In Semantic Language Space (FILS)",
        BOOKTITLE = " ArXiv abs/2406.03447",
        YEAR = "2024",
      }