Multitwine: Multi-Object Compositing with Text and Layout Control

Multitwine teaser showcasing multi-object compositing

Multitwine leverages RGB and depth map frames for supervised action recognition. Using the S4V network and VideoMamba, it extracts depth features and fuses modalities with Gated Cross-Attention (GCA), achieving robust multi-object compositing.

Abstract

Multitwine introduces a generative model for simultaneous multi-object compositing, guided by text and layout. It supports adding multiple objects to a scene, capturing interactions from simple positional relations (e.g., "next to") to complex actions (e.g., "hugging"). The model autonomously generates supporting props when needed (e.g., for "taking a selfie"). By jointly training for compositing and customization, Multitwine achieves state-of-the-art performance in both tasks, offering a versatile solution for text-driven object compositing.

Model Architecture

Our model integrates a Stable Diffusion backbone with a U-Net, autoencoder, text encoder, image encoder, and an adaptor. It combines text and image embeddings to create a multimodal representation, enabling seamless multi-object compositing with layout and background harmonization.

Comparison to Generative Object Compositing Models

Multitwine outperforms recent generative object compositing models by enabling simultaneous multi-object compositing. This approach ensures cohesive harmonization, captures complex interactions (e.g., reposing), and autonomously adds elements for realism.

Comparison to Customization Models

Our primary task is Object Compositing, but we also train for Subject-Driven Generation as an auxiliary task for achieving a better balance between text and image alignment in the compositing task. As a side effect, our model is also able to perform Multi-Entity Subject-Driven Generation, achieving comparable performance to state-of-the-art customization models.

Applications

Multitwine demonstrates emerging capabilities, including: Multi-Object Generation: Compositing multiple objects simultaneously with cohesive interactions. Subject-Driven Inpainting: Seamlessly completing scenes by generating and integrating additional objects guided by text and layout.

Poster

BibTeX

@inproceedings{Tarrés:Multitwine:CVPR:2025,
        AUTHOR =  "Tarrés, Gemma C and Lin, Zhe and Zhang, Zhifei and Zhang, He and Gilbert, Andrew and  Collomosse, John and Kim, Soo Ye",
        TITLE = "Multitwine: Multi-Object Compositing with Text and Layout Control",
        BOOKTITLE = "The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'25)",
        YEAR = "2025",
        }