Controlled generation¶

Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.

Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.

Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.

We will document some of the techniques diffusers supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced.

We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.

Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined.

Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.

InstructPix2Pix
Depth2Image
DreamBooth
Textual Inversion
ControlNet
DiffEdit
T2I-Adapter

For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.

Method	Inference only	Requires training / fine-tuning	Comments
InstructPix2Pix	✅	❌	Can additionally be fine-tuned for better performance on specific edit instructions.
Depth2Image	✅	❌
DreamBooth	❌	✅
Textual Inversion	❌	✅
ControlNet	✅	❌	A ControlNet can be trained/fine-tuned on a custom conditioning.
DiffEdit	✅	❌
T2I-Adapter	✅	❌

InstructPix2Pix¶

Paper

InstructPix2Pix is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. InstructPix2Pix has been explicitly trained to work well with InstructGPT-like prompts.

Depth2Image¶

Project

Depth2Image is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.

It conditions on a monocular depth estimate of the original image.

DreamBooth¶

Project

DreamBooth fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.

Textual Inversion¶

Paper

Textual Inversion fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.

ControlNet¶

Paper

ControlNet is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, depth maps, and semantic segmentations.

DiffEdit¶

Paper

DiffEdit allows for semantic editing of input images along with input prompts while preserving the original input images as much as possible.

T2I-Adapter¶

Paper

T2I-Adapter is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, depth maps, and semantic segmentations.