Skip to content

Controlled generation

Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.

Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.

Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.

We will document some of the techniques diffusers supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced.

We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.

Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined.

Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.

  1. InstructPix2Pix
  2. Depth2Image
  3. DreamBooth
  4. Textual Inversion
  5. ControlNet
  6. DiffEdit
  7. T2I-Adapter

For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.

Method Inference only Requires training /
fine-tuning
Comments
InstructPix2Pix โœ… โŒ Can additionally be
fine-tuned for better
performance on specific
edit instructions.
Depth2Image โœ… โŒ
DreamBooth โŒ โœ…
Textual Inversion โŒ โœ…
ControlNet โœ… โŒ A ControlNet can be
trained/fine-tuned on
a custom conditioning.
DiffEdit โœ… โŒ
T2I-Adapter โœ… โŒ

InstructPix2Pix

Paper

InstructPix2Pix is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. InstructPix2Pix has been explicitly trained to work well with InstructGPT-like prompts.

Depth2Image

Project

Depth2Image is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.

It conditions on a monocular depth estimate of the original image.

DreamBooth

Project

DreamBooth fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.

Textual Inversion

Paper

Textual Inversion fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.

ControlNet

Paper

ControlNet is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, depth maps, and semantic segmentations.

DiffEdit

Paper

DiffEdit allows for semantic editing of input images along with input prompts while preserving the original input images as much as possible.

T2I-Adapter

Paper

T2I-Adapter is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, depth maps, and semantic segmentations.