TL;DR: This PhD in Video Understanding develops a unified framework for understanding complex actions and events in untrimmed, real-world videos by:
- MOFO: motion-focused self-supervision that explicitly models motion dynamics and motion-sensitive regions in videos.
- FILS: language-grounded self-supervision that predicts video features in a semantic language space.
- DEL: supervised multimodal dense event localisation for overlapping and asynchronous audio–visual events.
- Together, these contributions advance interpretable, motion-aware, and semantically grounded video understanding in the context of digital media.