Skip to content

AsymmetricAutoencoderKL

Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: Designing a Better Asymmetric VQGAN for StableDiffusion by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.

The abstract from the paper is:

StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN

Evaluation results can be found in section 4.1 of the original paper.

Example Usage

from mindone.diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
from mindone.diffusers.utils import load_image, make_image_grid


prompt = "a photo of a person with beard"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"

original_image = load_image(img_url).resize((512, 512))
mask_image = load_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")

image = pipe(prompt=prompt, image=original_image, mask_image=mask_image).images[0]
make_image_grid([original_image, mask_image, image], rows=1, cols=3)

mindone.diffusers.models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL

Bases: ModelMixin, ConfigMixin

Designing a Better Asymmetric VQGAN for StableDiffusion https://arxiv.org/abs/2306.04632 . A VAE model with KL loss for encoding images into latents and decoding latent representations into images.

This model inherits from [ModelMixin]. Check the superclass documentation for it's generic methods implemented for all models (such as downloading or saving).

PARAMETER DESCRIPTION
in_channels

Number of channels in the input image.

TYPE: int, *optional*, defaults to 3 DEFAULT: 3

out_channels

Number of channels in the output.

TYPE: int, *optional*, defaults to 3 DEFAULT: 3

down_block_types

Tuple of downsample block types.

TYPE: `Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)` DEFAULT: ('DownEncoderBlock2D')

down_block_out_channels

Tuple of down block output channels.

TYPE: `Tuple[int]`, *optional*, defaults to `(64,)` DEFAULT: (64)

layers_per_down_block

Number layers for down block.

TYPE: `int`, *optional*, defaults to `1` DEFAULT: 1

up_block_types

Tuple of upsample block types.

TYPE: `Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)` DEFAULT: ('UpDecoderBlock2D')

up_block_out_channels

Tuple of up block output channels.

TYPE: `Tuple[int]`, *optional*, defaults to `(64,)` DEFAULT: (64)

layers_per_up_block

Number layers for up block.

TYPE: `int`, *optional*, defaults to `1` DEFAULT: 1

act_fn

The activation function to use.

TYPE: `str`, *optional*, defaults to `"silu"` DEFAULT: 'silu'

latent_channels

Number of channels in the latent space.

TYPE: `int`, *optional*, defaults to 4 DEFAULT: 4

sample_size

Sample input size.

TYPE: `int`, *optional*, defaults to `32` DEFAULT: 32

norm_num_groups

Number of groups to use for the first normalization layer in ResNet blocks.

TYPE: `int`, *optional*, defaults to `32` DEFAULT: 32

scaling_factor

The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.

TYPE: `float`, *optional*, defaults to 0.18215 DEFAULT: 0.18215

Source code in mindone/diffusers/models/autoencoders/autoencoder_asym_kl.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
class AsymmetricAutoencoderKL(ModelMixin, ConfigMixin):
    r"""
    Designing a Better Asymmetric VQGAN for StableDiffusion https://arxiv.org/abs/2306.04632 . A VAE model with KL loss
    for encoding images into latents and decoding latent representations into images.

    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
    for all models (such as downloading or saving).

    Parameters:
        in_channels (int, *optional*, defaults to 3): Number of channels in the input image.
        out_channels (int,  *optional*, defaults to 3): Number of channels in the output.
        down_block_types (`Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)`):
            Tuple of downsample block types.
        down_block_out_channels (`Tuple[int]`, *optional*, defaults to `(64,)`):
            Tuple of down block output channels.
        layers_per_down_block (`int`, *optional*, defaults to `1`):
            Number layers for down block.
        up_block_types (`Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)`):
            Tuple of upsample block types.
        up_block_out_channels (`Tuple[int]`, *optional*, defaults to `(64,)`):
            Tuple of up block output channels.
        layers_per_up_block (`int`, *optional*, defaults to `1`):
            Number layers for up block.
        act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
        latent_channels (`int`, *optional*, defaults to 4): Number of channels in the latent space.
        sample_size (`int`, *optional*, defaults to `32`): Sample input size.
        norm_num_groups (`int`, *optional*, defaults to `32`):
            Number of groups to use for the first normalization layer in ResNet blocks.
        scaling_factor (`float`, *optional*, defaults to 0.18215):
            The component-wise standard deviation of the trained latent space computed using the first batch of the
            training set. This is used to scale the latent space to have unit variance when training the diffusion
            model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the
            diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z = 1
            / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image
            Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper.
    """

    @register_to_config
    def __init__(
        self,
        in_channels: int = 3,
        out_channels: int = 3,
        down_block_types: Tuple[str, ...] = ("DownEncoderBlock2D",),
        down_block_out_channels: Tuple[int, ...] = (64,),
        layers_per_down_block: int = 1,
        up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",),
        up_block_out_channels: Tuple[int, ...] = (64,),
        layers_per_up_block: int = 1,
        act_fn: str = "silu",
        latent_channels: int = 4,
        norm_num_groups: int = 32,
        sample_size: int = 32,
        scaling_factor: float = 0.18215,
    ) -> None:
        super().__init__()

        # pass init params to Encoder
        self.encoder = Encoder(
            in_channels=in_channels,
            out_channels=latent_channels,
            down_block_types=down_block_types,
            block_out_channels=down_block_out_channels,
            layers_per_block=layers_per_down_block,
            act_fn=act_fn,
            norm_num_groups=norm_num_groups,
            double_z=True,
        )

        # pass init params to Decoder
        self.decoder = MaskConditionDecoder(
            in_channels=latent_channels,
            out_channels=out_channels,
            up_block_types=up_block_types,
            block_out_channels=up_block_out_channels,
            layers_per_block=layers_per_up_block,
            act_fn=act_fn,
            norm_num_groups=norm_num_groups,
        )
        self.diag_gauss_dist = DiagonalGaussianDistribution()

        self.quant_conv = nn.Conv2d(2 * latent_channels, 2 * latent_channels, 1, has_bias=True)
        self.post_quant_conv = nn.Conv2d(latent_channels, latent_channels, 1, has_bias=True)

        self.use_slicing = False
        self.use_tiling = False

        self.register_to_config(block_out_channels=up_block_out_channels)
        self.register_to_config(force_upcast=False)

    def encode(self, x: ms.Tensor, return_dict: bool = False) -> Union[AutoencoderKLOutput, Tuple[ms.Tensor]]:
        h = self.encoder(x)
        moments = self.quant_conv(h)

        if not return_dict:
            return (moments,)

        return AutoencoderKLOutput(latent=moments)

    def _decode(
        self,
        z: ms.Tensor,
        image: Optional[ms.Tensor] = None,
        mask: Optional[ms.Tensor] = None,
        return_dict: bool = False,
    ) -> Union[DecoderOutput, Tuple[ms.Tensor]]:
        z = self.post_quant_conv(z)
        dec = self.decoder(z, image, mask)

        if not return_dict:
            return (dec,)

        return DecoderOutput(sample=dec)

    def decode(
        self,
        z: ms.Tensor,
        generator: Optional[np.random.Generator] = None,
        image: Optional[ms.Tensor] = None,
        mask: Optional[ms.Tensor] = None,
        return_dict: bool = False,
    ) -> Union[DecoderOutput, Tuple[ms.Tensor]]:
        decoded = self._decode(z, image, mask)[0]

        if not return_dict:
            return (decoded,)

        return DecoderOutput(sample=decoded)

    def construct(
        self,
        sample: ms.Tensor,
        mask: Optional[ms.Tensor] = None,
        sample_posterior: bool = False,
        return_dict: bool = False,
        generator: Optional[np.random.Generator] = None,
    ) -> Union[DecoderOutput, Tuple[ms.Tensor]]:
        r"""
        Args:
            sample (`ms.Tensor`): Input sample.
            mask (`ms.Tensor`, *optional*, defaults to `None`): Optional inpainting mask.
            sample_posterior (`bool`, *optional*, defaults to `False`):
                Whether to sample from the posterior.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
        """
        x = sample
        latent = self.encode(x)[0]
        if sample_posterior:
            z = self.diag_gauss_dist.sample(latent)
        else:
            z = self.diag_gauss_dist.mode(latent)

        dec = self.decode(z, generator, sample, mask)[0]

        if not return_dict:
            return (dec,)

        return DecoderOutput(sample=dec)

mindone.diffusers.models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL.construct(sample, mask=None, sample_posterior=False, return_dict=False, generator=None)

PARAMETER DESCRIPTION
sample

Input sample.

TYPE: `ms.Tensor`

mask

Optional inpainting mask.

TYPE: `ms.Tensor`, *optional*, defaults to `None` DEFAULT: None

sample_posterior

Whether to sample from the posterior.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

return_dict

Whether or not to return a [DecoderOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: False

Source code in mindone/diffusers/models/autoencoders/autoencoder_asym_kl.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
def construct(
    self,
    sample: ms.Tensor,
    mask: Optional[ms.Tensor] = None,
    sample_posterior: bool = False,
    return_dict: bool = False,
    generator: Optional[np.random.Generator] = None,
) -> Union[DecoderOutput, Tuple[ms.Tensor]]:
    r"""
    Args:
        sample (`ms.Tensor`): Input sample.
        mask (`ms.Tensor`, *optional*, defaults to `None`): Optional inpainting mask.
        sample_posterior (`bool`, *optional*, defaults to `False`):
            Whether to sample from the posterior.
        return_dict (`bool`, *optional*, defaults to `True`):
            Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
    """
    x = sample
    latent = self.encode(x)[0]
    if sample_posterior:
        z = self.diag_gauss_dist.sample(latent)
    else:
        z = self.diag_gauss_dist.mode(latent)

    dec = self.decode(z, generator, sample, mask)[0]

    if not return_dict:
        return (dec,)

    return DecoderOutput(sample=dec)

mindone.diffusers.models.autoencoders.autoencoder_kl.AutoencoderKLOutput dataclass

Bases: BaseOutput

Output of AutoencoderKL encoding method.

PARAMETER DESCRIPTION
latent

Encoded outputs of Encoder represented as the mean and logvar of DiagonalGaussianDistribution. DiagonalGaussianDistribution allows for sampling latents from the distribution.

TYPE: `ms.Tensor`

Source code in mindone/diffusers/models/modeling_outputs.py
 8
 9
10
11
12
13
14
15
16
17
18
19
@dataclass
class AutoencoderKLOutput(BaseOutput):
    """
    Output of AutoencoderKL encoding method.

    Args:
        latent (`ms.Tensor`):
            Encoded outputs of `Encoder` represented as the mean and logvar of `DiagonalGaussianDistribution`.
            `DiagonalGaussianDistribution` allows for sampling latents from the distribution.
    """

    latent: ms.Tensor

mindone.diffusers.models.autoencoders.vae.DecoderOutput dataclass

Bases: BaseOutput

Output of decoding method.

PARAMETER DESCRIPTION
sample

The decoded output sample from the last layer of the model.

TYPE: `ms.Tensor` of shape `(batch_size, num_channels, height, width)`

Source code in mindone/diffusers/models/autoencoders/vae.py
31
32
33
34
35
36
37
38
39
40
41
42
@dataclass
class DecoderOutput(BaseOutput):
    r"""
    Output of decoding method.

    Args:
        sample (`ms.Tensor` of shape `(batch_size, num_channels, height, width)`):
            The decoded output sample from the last layer of the model.
    """

    sample: ms.Tensor
    commit_loss: Optional[ms.Tensor] = None