Motion, Language, and Multimodal Integration for Video Understanding

PhD in Video Understanding, University of Surrey, UK
January 2026

TL;DR: This PhD in Video Understanding develops a unified framework for understanding complex actions and events in untrimmed, real-world videos by:

  • MOFO: motion-focused self-supervision that explicitly models motion dynamics and motion-sensitive regions in videos.
  • FILS: language-grounded self-supervision that predicts video features in a semantic language space.
  • DEL: supervised multimodal dense event localisation for overlapping and asynchronous audio–visual events.
  • Together, these contributions advance interpretable, motion-aware, and semantically grounded video understanding in the context of digital media.

Downloads

For project pages with additional figures, videos, and implementation details, see: MOFO, FILS, DEL.

Publications Linked to This Thesis

The core chapters of this thesis correspond to the following peer-reviewed publications:

  1. Chapter 3 – MOFO
    M. Ahmadian, F. Guerin, and A. Gilbert.
    “MOFO: MOtion FOcused Self-Supervision for Video Understanding,”
    In Proceedings of the Self-Supervised Learning Workshop – Theory and Practice (NeurIPS), 2023.
    Project page | Paper (PDF)
  2. Chapter 4 – FILS
    M. Ahmadian, F. Guerin, and A. Gilbert.
    “FILS: Self-Supervised Video Feature Prediction in Semantic Language Space,”
    In Proceedings of the 35th British Machine Vision Conference (BMVC), 2024.
    Project page | Paper (PDF)
  3. Chapter 5 – DEL
    M. Ahmadian, A. Shirian, F. Guerin, and A. Gilbert.
    “DEL: Dense Event Localisation for Multi-modal Audio-Visual Understanding,”
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshop on What is Next in Multimodal Foundation Models (MMFM4), and under review at Multimedia, 2026.
    Project page | Paper (PDF)

These works are integrated and extended within the thesis, which situates them in a broader framework for motion, language, and multimodal integration in video understanding.

BibTeX

@phdthesis{Ahmadian2026Thesis,
  author    = {Mona Ahmadian},
  title     = {Motion, Language, and Multimodal Integration for Video Understanding},
  school    = {University of Surrey},
  year      = {2026},
  month     = {January},
  address   = {Guildford, United Kingdom},
  note      = {PhD in Video Understanding. Chapters based on MOFO (NeurIPS SSL Workshop 2023), FILS (BMVC 2024), and DEL (ICCV MMFM4 Workshop / under review at Multimedia 2026)}
}