Motion, Language, and Multimodal Integration for Video Understanding

PhD in Video Understanding, University of Surrey, UK
January 2026

TL;DR: This PhD in Video Understanding develops a unified framework for understanding complex actions and events in untrimmed, real-world videos by:

MOFO: motion-focused self-supervision that explicitly models motion dynamics and motion-sensitive regions in videos.
FILS: language-grounded self-supervision that predicts video features in a semantic language space.
DEL: supervised multimodal dense event localisation for overlapping and asynchronous audio–visual events.
Together, these contributions advance interpretable, motion-aware, and semantically grounded video understanding in the context of digital media.

Publications Linked to This Thesis

The core chapters of this thesis correspond to the following peer-reviewed publications:

Chapter 3 – MOFO
M. Ahmadian, F. Guerin, and A. Gilbert.
“MOFO: MOtion FOcused Self-Supervision for Video Understanding,”
In Proceedings of the Self-Supervised Learning Workshop – Theory and Practice (NeurIPS), 2023.
Project page | Paper (PDF)
Chapter 4 – FILS
M. Ahmadian, F. Guerin, and A. Gilbert.
“FILS: Self-Supervised Video Feature Prediction in Semantic Language Space,”
In Proceedings of the 35th British Machine Vision Conference (BMVC), 2024.
Project page | Paper (PDF)
Chapter 5 – DEL
M. Ahmadian, A. Shirian, F. Guerin, and A. Gilbert.
“DEL: Dense Event Localisation for Multi-modal Audio-Visual Understanding,”
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshop on What is Next in Multimodal Foundation Models (MMFM4), and under review at Multimedia, 2026.
Project page | Paper (PDF)

These works are integrated and extended within the thesis, which situates them in a broader framework for motion, language, and multimodal integration in video understanding.

BibTeX

@phdthesis{Ahmadian2026Thesis, author = {Mona Ahmadian}, title = {Motion, Language, and Multimodal Integration for Video Understanding}, school = {University of Surrey}, year = {2026}, month = {January}, address = {Guildford, United Kingdom}, note = {PhD in Video Understanding. Chapters based on MOFO (NeurIPS SSL Workshop 2023), FILS (BMVC 2024), and DEL (ICCV MMFM4 Workshop / under review at Multimedia 2026)} }

Motion, Language, and Multimodal Integration for Video Understanding

Downloads

Publications Linked to This Thesis

BibTeX