Multimodal Conditioning for Controllable Image and Video Generation

Abstract

The field of generative AI has progressed at a rapid pace and can now produce high-quality images and videos from text prompts. This evolution has also led to greater user demand for precise control over the outcomes, posing new challenges in effectively directing generation processes. Standard conditioning techniques, mainly using text and image inputs, have proven useful but remain limited in handling more complex requirements, such as specific human pose, camera orientation or fine-grained visual appearance.

This PhD research enhances conditioning techniques by introducing a parametric approach that emphasises multimodal conditioning for image and video generation models. It focuses on developing methods to enable more comprehensive user control, incorporating various modalities such as pose and spatial inputs to improve alignment between model capabilities and user intentions across different aspects of generation. By refining the conditioning mechanisms, this research aims to bridge the gap between user specifications and model outputs, ensuring greater flexibility, precision, and coherence in generated content.

This thesis presents several key contributions. Traditional methods for human pose conditioning, which rely on skeleton images, contain substantial redundancy and are computationally inefficient for modern architectures. To address this, we proposed the concept of pose token, where raw pose parameters are compressed into tokens that can be used as conditioning elements via attention mechanisms, a common approach in advanced architectures. We validated this token-based approach with both 2D body keypoints and 3D body parameters, demonstrating its effectiveness across multiple architectures, from transformers to diffusion models. Additionally, our parametric approach introduces techniques for human and camera pose interpolation within image generation.

A common approach for conditioning diffusion models involves incorporating adapters – lightweight models to deliver control signals to pre-trained image models. However, our research has revealed that this method often introduces a critical issue of mode conflict. This problem, worsened by cascading multiple adapters, results from an imbalance in control signals: the model can become dominated by one adapter, limiting the generative power of both the base model and other adapters. Despite its prevalence, this issue remains largely unaddressed in existing research. To solve this, we devised a unified adapter architecture that integrates both structural and visual conditioning within a single, harmonised control pathway. This unified approach delivers balanced multimodal conditioning, avoiding the pitfalls of adapter cascade and enabling greater model flexibility. As a result, our approach’s high controllability empowers versatile human image generation and editing tasks.

Our research in 2D image generation was extended to video generation. We show that architectural differences in transformer-based diffusion models make existing camera control methods for U-Net-based diffusion models ineffective. Through extensive experimentation, optimal architectures and camera representations were identified. Combined with our novel camera motion guidance, camera control was restored for video diffusion transformers, with motion boosted by over 400%. Our research on human pose conditioning for images extends to video generation. Unlike existing methods that require detailed camera pose input for every frame, our approach achieves smooth video motion with minimal input. By specifying only the initial and final camera poses, our system interpolates between frames to produce continuous camera movements, enabling consistent, controlled video generation with reduced data requirements. This sparse video conditioning approach significantly simplifies user interaction while ensuring fluid transitions and stable pose dynamics across frames.

Many of the challenges we aimed to address were novel, often lacking established evaluation methods. As a result, we proposed new evaluation metrics to rigorously assess these areas. One of these, People Count Error (PCE), identifies a unique type of error specific to AI-generated human images, such as inaccurate body part generation. This metric has already gained traction in the research community and is being adopted in image generation benchmarks, helping to set new standards for evaluating AI-driven human image quality.

Publications Linked to This Thesis

The main chapters of this thesis are closely related to the following peer-reviewed publications:

UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
Soon Cheong, Armin Mustafa, Andrew Gilbert.
ICCV Workshop on Computer Vision for the Metaverse (ICCVWS’23), 2023.
Code & project page
ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet
Soon Cheong, Armin Mustafa, Andrew Gilbert.
European Conference on Computer Vision (ECCV) 2024 – FashionAI: Exploring the Intersection of Fashion and Artificial Intelligence for Reshaping the Industry, 2024.
Project page
Boosting Camera Motion Control for Video Diffusion Transformers
Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, Chun-Hao Paul Huang.
British Machine Vision Conference (BMVC), 2025.
Project page

These works are integrated and extended within the thesis, which presents a unified perspective on multimodal conditioning for controllable image and video generation.

BibTeX

@phdthesis{Cheong2025Thesis,
  author    = {Soon Yau Cheong},
  title     = {Multimodal Conditioning for Controllable Image and Video Generation},
  school    = {University of Surrey},
  year      = {2025},
  address   = {Guildford, United Kingdom},
  note      = {PhD thesis on pose tokens, unified multimodal adapters, sparse camera and pose control for diffusion transformers, and People Count Error (PCE) for evaluating human image generation}
}

Multimodal Conditioning for Controllable Image and Video Generation

High-resolution virtual try-on with real-world background. Top: reference fashion images used for visual conditioning. Bottom: virtual try-on results generated by the proposed multimodal conditioning framework.

Abstract

Publications Linked to This Thesis

BibTeX