Advancing Efficiency and Accessibility in Multimodal Video Understanding with Deep Learning

PhD in Multimodal Video Understanding, University of Surrey, UK
2024

TL;DR: This PhD develops resource-efficient deep learning methods that make multimodal video understanding more accessible:

  • Weakly supervised multimodal semantic clustering for fine-grained video retrieval and recommendation.
  • A two-stream long-form video transformer using pre-trained encoders and multi-resolution spatiotemporal features.
  • Multi-resolution audio-visual fusion with gated cross-attention for temporal action localisation.
  • PLOT-TAL: prompt learning with optimal transport for few-shot temporal action localisation using pre-trained vision–language features.
Retrieval results from bottleneck embeddings before and after self-supervised fine-tuning, showing improved alignment with themes, style, and actions compared to coarse genre labels.

Retrieval results from our fine-grained semantic clustering model. Compared with traditional genre labels, the learned multimodal embeddings group films by their underlying themes, style and actions, enabling more meaningful video recommendation and retrieval.

Abstract

In the rapidly expanding digital landscape, the ability to extract meaningful insights from vast quantities of video content is transformative. However, many organisations face a critical challenge: they lack the substantial computational resources and the time-intensive annotation processes required to leverage advanced video analysis technologies fully. This thesis addresses this gap by introducing several resource-efficient deep learning strategies tailored for multimodal video understanding applications. The presented methodologies focus on leveraging pre-trained foundational neural networks for multimodal feature extraction, fusion, and spatiotemporal understanding.

First, we present a method for fusing multimodal features from video for enhanced style and semantic clustering with weakly labelled video data. By tapping into the capabilities of pre-trained foundational models, we develop a method that captures intricate contextual cues within multimodal video data for improved semantic video recommendation and retrieval applications.

Advancing the challenge of long video understanding, we present an innovative architecture that utilises pre-trained encoders to extract spatiotemporal features at various resolutions. This approach achieves state-of-the-art performance in tasks requiring fine-grained temporal analysis, such as speaker recognition and character identification, while maintaining computational efficiency.

We continue with a novel approach to audio-visual fusion for temporal action localisation by introducing a gated cross-attention mechanism, which effectively integrates audio and visual features for activity recognition and localisation applications. This results in a low-parameter solution that optimises data utility while improving performance over uni-modal approaches.

The final contribution of this thesis is developing a technique for aligning text prompts with visual features using prompt learning and optimal transport. This strategy significantly reduces training overhead and improves generalisation by leveraging pre-trained visual-language features and optimising only a few learnable parameters. This enables precise action localisation and discrimination between foreground and background elements using only a few labelled samples per class.

Collectively, these contributions improve access to advanced video analytical tools, making them more accessible to a broader audience, including those constrained by computational and financial limitations. This work advances the technical boundaries of video analysis and democratises its applications, fostering innovation across various fields.

Downloads

Please cite the thesis if you use these ideas or models in your own work. A BibTeX entry is provided below.

Publications Linked to This Thesis

The core chapters of this thesis are closely related to the following peer-reviewed publications:

  1. Rethinking Genre Classification with Fine-Grained Semantic Clustering
    Ed Fish, Jon Weinbren, Andrew Gilbert.
    IEEE International Conference on Image Processing (ICIP), 2021.
    Project page
  2. Two-Stream Transformer Architecture for Long Form Video Understanding
    Ed Fish, Jon Weinbren, Andrew Gilbert.
    British Machine Vision Conference (BMVC), 2022.
    Paper (PDF)
  3. Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization
    Ed Fish, Jon Weinbren, Andrew Gilbert.
    NeurIPS 2023 Workshop on Machine Learning for Audio, 2023.
    Project page
  4. PLOT-TAL – Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization
    Ed Fish, Andrew Gilbert.
    International Conference on Computer Vision (ICCV’25) – 6th Workshop on Closing the Loop between Vision and Language, 2025.
    Project page

These works are integrated and extended within the thesis, which provides a unified framework for efficient, multimodal video understanding under realistic resource constraints.

BibTeX

@phdthesis{Fish2024Thesis,
  author    = {Ed Fish},
  title     = {Advancing Efficiency and Accessibility in Multimodal Video Understanding with Deep Learning},
  school    = {University of Surrey},
  year      = {2024},
  address   = {Guildford, United Kingdom},
  note      = {PhD thesis on resource-efficient multimodal video understanding, including semantic clustering, long-form video transformers, multi-resolution audio-visual fusion, and prompt-based few-shot temporal action localisation}
}