In the rapidly expanding digital landscape, the ability to extract meaningful insights from vast quantities of video content is transformative. However, many organisations face a critical challenge: they lack the substantial computational resources and the time-intensive annotation processes required to leverage advanced video analysis technologies fully. This thesis addresses this gap by introducing several resource-efficient deep learning strategies tailored for multimodal video understanding applications. The presented methodologies focus on leveraging pre-trained foundational neural networks for multimodal feature extraction, fusion, and spatiotemporal understanding.
First, we present a method for fusing multimodal features from video for enhanced style and semantic clustering with weakly labelled video data. By tapping into the capabilities of pre-trained foundational models, we develop a method that captures intricate contextual cues within multimodal video data for improved semantic video recommendation and retrieval applications.
Advancing the challenge of long video understanding, we present an innovative architecture that utilises pre-trained encoders to extract spatiotemporal features at various resolutions. This approach achieves state-of-the-art performance in tasks requiring fine-grained temporal analysis, such as speaker recognition and character identification, while maintaining computational efficiency.
We continue with a novel approach to audio-visual fusion for temporal action localisation by introducing a gated cross-attention mechanism, which effectively integrates audio and visual features for activity recognition and localisation applications. This results in a low-parameter solution that optimises data utility while improving performance over uni-modal approaches.
The final contribution of this thesis is developing a technique for aligning text prompts with visual features using prompt learning and optimal transport. This strategy significantly reduces training overhead and improves generalisation by leveraging pre-trained visual-language features and optimising only a few learnable parameters. This enables precise action localisation and discrimination between foreground and background elements using only a few labelled samples per class.
Collectively, these contributions improve access to advanced video analytical tools, making them more accessible to a broader audience, including those constrained by computational and financial limitations. This work advances the technical boundaries of video analysis and democratises its applications, fostering innovation across various fields.