Controlled generation¶
Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.
Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.
Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.
We will document some of the techniques diffusers
supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced.
We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.
Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined.
Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.
- InstructPix2Pix
- Semantic Guidance
- Self-attention Guidance
- Depth2Image
- MultiDiffusion Panorama
- DreamBooth
- Textual Inversion
- ControlNet
- DiffEdit
- T2I-Adapter
For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.
Method | Inference only | Requires training / fine-tuning |
Comments |
---|---|---|---|
InstructPix2Pix | โ | โ | Can additionally be fine-tuned for better performance on specific edit instructions. |
Semantic Guidance | โ | โ | |
Self-attention Guidance | โ | โ | |
Depth2Image | โ | โ | |
MultiDiffusion Panorama | โ | โ | |
DreamBooth | โ | โ | |
Textual Inversion | โ | โ | |
ControlNet | โ | โ | A ControlNet can be trained/fine-tuned on a custom conditioning. |
DiffEdit | โ | โ | |
T2I-Adapter | โ | โ |
InstructPix2Pix¶
InstructPix2Pix is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. InstructPix2Pix has been explicitly trained to work well with InstructGPT-like prompts.
Semantic Guidance (SEGA)¶
SEGA allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.
Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.
Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
Self-attention Guidance (SAG)¶
Self-attention Guidance improves the general quality of images.
SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
Depth2Image¶
Depth2Image is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
It conditions on a monocular depth estimate of the original image.
MultiDiffusion Panorama¶
MultiDiffusion Panorama defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. MultiDiffusion Panorama allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
DreamBooth¶
DreamBooth fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.
Textual Inversion¶
Textual Inversion fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.
ControlNet¶
ControlNet is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, depth maps, and semantic segmentations.
DiffEdit¶
DiffEdit allows for semantic editing of input images along with input prompts while preserving the original input images as much as possible.
T2I-Adapter¶
T2I-Adapter is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, depth maps, and semantic segmentations.