Multitwine introduces a generative model for simultaneous multi-object compositing, guided by text and layout. It supports adding multiple objects to a scene, capturing interactions from simple positional relations (e.g., "next to") to complex actions (e.g., "hugging"). The model autonomously generates supporting props when needed (e.g., for "taking a selfie"). By jointly training for compositing and customization, Multitwine achieves state-of-the-art performance in both tasks, offering a versatile solution for text-driven object compositing.
@inproceedings{Tarrés:Multitwine:CVPR:2025,
AUTHOR = "Tarrés, Gemma C and Lin, Zhe and Zhang, Zhifei and Zhang, He and Gilbert, Andrew and Collomosse, John and Kim, Soo Ye",
TITLE = "Multitwine: Multi-Object Compositing with Text and Layout Control",
BOOKTITLE = "The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'25)",
YEAR = "2025",
}