AsymmetricAutoencoderKL¶
Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: Designing a Better Asymmetric VQGAN for StableDiffusion by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
The abstract from the paper is:
StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN
Evaluation results can be found in section 4.1 of the original paper.
Example Usage¶
from mindone.diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
from mindone.diffusers.utils import load_image, make_image_grid
prompt = "a photo of a person with beard"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
original_image = load_image(img_url).resize((512, 512))
mask_image = load_image(mask_url).resize((512, 512))
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
image = pipe(prompt=prompt, image=original_image, mask_image=mask_image).images[0]
make_image_grid([original_image, mask_image, image], rows=1, cols=3)
mindone.diffusers.models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL
¶
Bases: ModelMixin
, ConfigMixin
Designing a Better Asymmetric VQGAN for StableDiffusion https://arxiv.org/abs/2306.04632 . A VAE model with KL loss for encoding images into latents and decoding latent representations into images.
This model inherits from [ModelMixin
]. Check the superclass documentation for it's generic methods implemented
for all models (such as downloading or saving).
PARAMETER | DESCRIPTION |
---|---|
in_channels |
Number of channels in the input image.
TYPE:
|
out_channels |
Number of channels in the output.
TYPE:
|
down_block_types |
Tuple of downsample block types.
TYPE:
|
down_block_out_channels |
Tuple of down block output channels.
TYPE:
|
layers_per_down_block |
Number layers for down block.
TYPE:
|
up_block_types |
Tuple of upsample block types.
TYPE:
|
up_block_out_channels |
Tuple of up block output channels.
TYPE:
|
layers_per_up_block |
Number layers for up block.
TYPE:
|
act_fn |
The activation function to use.
TYPE:
|
latent_channels |
Number of channels in the latent space.
TYPE:
|
sample_size |
Sample input size.
TYPE:
|
norm_num_groups |
Number of groups to use for the first normalization layer in ResNet blocks.
TYPE:
|
scaling_factor |
The component-wise standard deviation of the trained latent space computed using the first batch of the
training set. This is used to scale the latent space to have unit variance when training the diffusion
model. The latents are scaled with the formula
TYPE:
|
Source code in mindone/diffusers/models/autoencoders/autoencoder_asym_kl.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
mindone.diffusers.models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL.construct(sample, mask=None, sample_posterior=False, return_dict=False, generator=None)
¶
PARAMETER | DESCRIPTION |
---|---|
sample |
Input sample.
TYPE:
|
mask |
Optional inpainting mask.
TYPE:
|
sample_posterior |
Whether to sample from the posterior.
TYPE:
|
return_dict |
Whether or not to return a [
TYPE:
|
Source code in mindone/diffusers/models/autoencoders/autoencoder_asym_kl.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
mindone.diffusers.models.autoencoders.autoencoder_kl.AutoencoderKLOutput
dataclass
¶
Bases: BaseOutput
Output of AutoencoderKL encoding method.
PARAMETER | DESCRIPTION |
---|---|
latent |
Encoded outputs of
TYPE:
|
Source code in mindone/diffusers/models/modeling_outputs.py
8 9 10 11 12 13 14 15 16 17 18 19 |
|
mindone.diffusers.models.autoencoders.vae.DecoderOutput
dataclass
¶
Bases: BaseOutput
Output of decoding method.
PARAMETER | DESCRIPTION |
---|---|
sample |
The decoded output sample from the last layer of the model.
TYPE:
|
Source code in mindone/diffusers/models/autoencoders/vae.py
31 32 33 34 35 36 37 38 39 40 41 42 |
|