Thinking Outside the BBox: Unconstrained Generative Object Compositing

University of Surrey[1], Adobe Resarch[2]
The European Conference of Computer Vision 2024

Our unconstrained object compositing model has various advantages. When using a bbox (bottom), our model achieves better background preservation (see bird in background) and more natural shadows and reflections than SotA models by allowing generation beyond the bbox. Without any bbox input (top), our model can automatically place and composite objects in diverse ways.

Video Presentation

Abstract

Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object \sooye{during} training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of \textit{unconstrained generative object compositing}, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

Model architecture. Our model consists of: (i) an object encoder E and a content adaptor A that encode the object at different scales; (ii) a Stable Diffusion backbone comprised of an autoencoder (G, D) and a U-Net. The multiscale embeddings from (i) are averaged to condition the U-Net via cross-attention. Background image IBG and a mask Ip are concatenated to the input of (ii). Ip can be empty by setting all values to -1. The U-Net is adapted to return the predicted mask I'm as an additional output

Visual comparison to generative image compositing models, providing the same input object, background, and bbox. Our model generates realistic results with natural shadows and reflections while preserving the original background

Visualization of location and scale prediction (marked in yellow) of our model and prior object placement prediction works. For visualization purpose, we display our generated image for our model and copy-paste the object into the predicted bounding box for the compared models.

-->

BibTeX

@inproceedings{Tarres:ECCV:2024,
        AUTHOR = Gemma C Tarrés, Zhe Lin, Zhifei Zhang, Jianming Zhang, Yizhi Song, Dan Ruta, Andrew Gilbert, John Collomosse, Soo Ye Kim",
        TITLE = "Thinking Outside the BBox: Unconstrained Generative Object Compositing​",
        BOOKTITLE = "European Conference on Computer Vision ECCV'24",
        YEAR = "2024",
        ​}