Diff-nst: Diffusion interleaving for deformable neural style transfer

Deformable style transfer using DIFF-NST, compared to baselines: NNST [13], CAST [41], NeAT [25], and PARASOL [31]. Our DIFF-NST method performs style transfer with much stronger style-based form alteration - matching the shapes and structures to those in the style image, not just the colors and textures.

Abstract

Neural Style Transfer (NST) is the field of study applying neural techniques to modify the artistic appearance of a content image to match the style of a reference style image. Traditionally, NST methods have focused on texture-based image edits, affecting mostly low level information and keeping most image structures the same. However, style-based deformation of the content is desirable for some styles, especially in cases where the style is abstract or the primary concept of the style is in its deformed rendition of some content. With the recent introduction of diffusion models, such as Stable Diffusion, we can access far more powerful image generation techniques, enabling new possibilities. In our work, we propose using this new class of models to perform style transfer while enabling deformable style transfer, an elusive capability in previous models. We show how leveraging the priors of these models can expose new artistic controls at inference time, and we document our findings in exploring this new direction for the field of style transfer.

High level visualization of our diffusion-based neural style transfer process. (left) Trainable MLP in the self-attention blocks of the LDM Unet modules. (right) Attention values and ALADIN style codes are extracted from the style image. The content image is re-colored by the style image, after which the LDM extracts content noises from it. These are interleaved into the reverse diffusion process at multiple time steps to generate a stylized version for the loss objective. Green modules are trainable, and blue modules are frozen.

Visualization of style code ablation. The more disentangled ALADIN-NST embedding carries over less semantic information from the style images.

Controlling the style-based content deformation of the stylized image at inference time by varying the starting timestep to apply pre-extracted content noises from the content image inversion.

Deformable style transfer, comparing to NNST, CAST, NeAT, and PARASOL. All our figures are generated using images from the ALADIN-NST test set, which were not seen during training. More in the supplementary materials.

BibTeX

@inproceedings{Ruta:diffnst:ECCVWS:2024,
        AUTHOR = Ruta, Dan and  Tarrés, Gemma C and Gilbert, Andrew and Shechtman, Eli and Kolkin, Nick and Collomosse, John",
        TITLE = "Diff-nst: Diffusion interleaving for deformable neural style transfer",
        BOOKTITLE = "European Conference of Computer Vision 2024, Vision for Art (VISART VII) Workshop, 2024",
        YEAR = "2024",
        ​}