Skip to content

VQModel

The VQ-VAE model was introduced in Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike AutoencoderKL, the VQModel works in a quantized latent space.

The abstract from the paper is:

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

mindone.diffusers.VQModel

Bases: ModelMixin, ConfigMixin

A VQ-VAE model for decoding latent representations.

This model inherits from [ModelMixin]. Check the superclass documentation for it's generic methods implemented for all models (such as downloading or saving).

PARAMETER DESCRIPTION
in_channels

Number of channels in the input image.

TYPE: int, *optional*, defaults to 3 DEFAULT: 3

out_channels

Number of channels in the output.

TYPE: int, *optional*, defaults to 3 DEFAULT: 3

down_block_types

Tuple of downsample block types.

TYPE: `Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)` DEFAULT: ('DownEncoderBlock2D')

up_block_types

Tuple of upsample block types.

TYPE: `Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)` DEFAULT: ('UpDecoderBlock2D')

block_out_channels

Tuple of block output channels.

TYPE: `Tuple[int]`, *optional*, defaults to `(64,)` DEFAULT: (64)

layers_per_block

Number of layers per block.

TYPE: `int`, *optional*, defaults to `1` DEFAULT: 1

act_fn

The activation function to use.

TYPE: `str`, *optional*, defaults to `"silu"` DEFAULT: 'silu'

latent_channels

Number of channels in the latent space.

TYPE: `int`, *optional*, defaults to `3` DEFAULT: 3

sample_size

Sample input size.

TYPE: `int`, *optional*, defaults to `32` DEFAULT: 32

num_vq_embeddings

Number of codebook vectors in the VQ-VAE.

TYPE: `int`, *optional*, defaults to `256` DEFAULT: 256

norm_num_groups

Number of groups for normalization layers.

TYPE: `int`, *optional*, defaults to `32` DEFAULT: 32

vq_embed_dim

Hidden dim of codebook vectors in the VQ-VAE.

TYPE: `int`, *optional* DEFAULT: None

scaling_factor

The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.

TYPE: `float`, *optional*, defaults to `0.18215` DEFAULT: 0.18215

norm_type

Type of normalization layer to use. Can be one of "group" or "spatial".

TYPE: `str`, *optional*, defaults to `"group"` DEFAULT: 'group'

Source code in mindone/diffusers/models/autoencoders/vq_model.py
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
class VQModel(ModelMixin, ConfigMixin):
    r"""
    A VQ-VAE model for decoding latent representations.

    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
    for all models (such as downloading or saving).

    Parameters:
        in_channels (int, *optional*, defaults to 3): Number of channels in the input image.
        out_channels (int,  *optional*, defaults to 3): Number of channels in the output.
        down_block_types (`Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)`):
            Tuple of downsample block types.
        up_block_types (`Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)`):
            Tuple of upsample block types.
        block_out_channels (`Tuple[int]`, *optional*, defaults to `(64,)`):
            Tuple of block output channels.
        layers_per_block (`int`, *optional*, defaults to `1`): Number of layers per block.
        act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
        latent_channels (`int`, *optional*, defaults to `3`): Number of channels in the latent space.
        sample_size (`int`, *optional*, defaults to `32`): Sample input size.
        num_vq_embeddings (`int`, *optional*, defaults to `256`): Number of codebook vectors in the VQ-VAE.
        norm_num_groups (`int`, *optional*, defaults to `32`): Number of groups for normalization layers.
        vq_embed_dim (`int`, *optional*): Hidden dim of codebook vectors in the VQ-VAE.
        scaling_factor (`float`, *optional*, defaults to `0.18215`):
            The component-wise standard deviation of the trained latent space computed using the first batch of the
            training set. This is used to scale the latent space to have unit variance when training the diffusion
            model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the
            diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z = 1
            / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image
            Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper.
        norm_type (`str`, *optional*, defaults to `"group"`):
            Type of normalization layer to use. Can be one of `"group"` or `"spatial"`.
    """

    @register_to_config
    def __init__(
        self,
        in_channels: int = 3,
        out_channels: int = 3,
        down_block_types: Tuple[str, ...] = ("DownEncoderBlock2D",),
        up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",),
        block_out_channels: Tuple[int, ...] = (64,),
        layers_per_block: int = 1,
        act_fn: str = "silu",
        latent_channels: int = 3,
        sample_size: int = 32,
        num_vq_embeddings: int = 256,
        norm_num_groups: int = 32,
        vq_embed_dim: Optional[int] = None,
        scaling_factor: float = 0.18215,
        norm_type: str = "group",  # group, spatial
        mid_block_add_attention=True,
        lookup_from_codebook=False,
        force_upcast=False,
    ):
        super().__init__()

        # pass init params to Encoder
        self.encoder = Encoder(
            in_channels=in_channels,
            out_channels=latent_channels,
            down_block_types=down_block_types,
            block_out_channels=block_out_channels,
            layers_per_block=layers_per_block,
            act_fn=act_fn,
            norm_num_groups=norm_num_groups,
            double_z=False,
            mid_block_add_attention=mid_block_add_attention,
        )

        vq_embed_dim = vq_embed_dim if vq_embed_dim is not None else latent_channels

        self.quant_conv = nn.Conv2d(latent_channels, vq_embed_dim, 1, has_bias=True)
        self.quantize = VectorQuantizer(num_vq_embeddings, vq_embed_dim, beta=0.25, remap=None, sane_index_shape=False)
        self.post_quant_conv = nn.Conv2d(vq_embed_dim, latent_channels, 1, has_bias=True)

        # pass init params to Decoder
        self.decoder = Decoder(
            in_channels=latent_channels,
            out_channels=out_channels,
            up_block_types=up_block_types,
            block_out_channels=block_out_channels,
            layers_per_block=layers_per_block,
            act_fn=act_fn,
            norm_num_groups=norm_num_groups,
            norm_type=norm_type,
            mid_block_add_attention=mid_block_add_attention,
        )

    def encode(self, x: ms.Tensor, return_dict: bool = False):
        h = self.encoder(x)
        h = self.quant_conv(h)

        if not return_dict:
            return (h,)

        return VQEncoderOutput(latents=h)

    def decode(self, h: ms.Tensor, force_not_quantize: bool = False, return_dict: bool = False, shape=None):
        # also go through quantization layer
        if not force_not_quantize:
            quant, commit_loss, _ = self.quantize(h)
        elif self.config["lookup_from_codebook"]:
            quant = self.quantize.get_codebook_entry(h, shape)
            commit_loss = ops.zeros((h.shape[0],), dtype=h.dtype)
        else:
            quant = h
            commit_loss = ops.zeros((h.shape[0],), dtype=h.dtype)
        quant2 = self.post_quant_conv(quant)
        dec = self.decoder(quant2, quant if self.config["norm_type"] == "spatial" else None)

        if not return_dict:
            return dec, commit_loss

        return DecoderOutput(sample=dec, commit_loss=commit_loss)

    def construct(self, sample: ms.Tensor, return_dict: bool = False) -> Union[DecoderOutput, Tuple[ms.Tensor, ...]]:
        r"""
        The [`VQModel`] forward method.

        Args:
            sample (`ms.Tensor`): Input sample.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`models.autoencoders.vq_model.VQEncoderOutput`] instead of a plain tuple.

        Returns:
            [`~models.autoencoders.vq_model.VQEncoderOutput`] or `tuple`:
                If return_dict is True, a [`~models.autoencoders.vq_model.VQEncoderOutput`] is returned, otherwise a
                plain `tuple` is returned.
        """

        h = self.encode(sample)[0]
        dec = self.decode(h)

        if not return_dict:
            return dec  # (dec.sample, dec.commit_loss)

        return DecoderOutput(sample=dec[0], commit_loss=dec[1])

mindone.diffusers.VQModel.construct(sample, return_dict=False)

The [VQModel] forward method.

PARAMETER DESCRIPTION
sample

Input sample.

TYPE: `ms.Tensor`

return_dict

Whether or not to return a [models.autoencoders.vq_model.VQEncoderOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: False

RETURNS DESCRIPTION
Union[DecoderOutput, Tuple[Tensor, ...]]

[~models.autoencoders.vq_model.VQEncoderOutput] or tuple: If return_dict is True, a [~models.autoencoders.vq_model.VQEncoderOutput] is returned, otherwise a plain tuple is returned.

Source code in mindone/diffusers/models/autoencoders/vq_model.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def construct(self, sample: ms.Tensor, return_dict: bool = False) -> Union[DecoderOutput, Tuple[ms.Tensor, ...]]:
    r"""
    The [`VQModel`] forward method.

    Args:
        sample (`ms.Tensor`): Input sample.
        return_dict (`bool`, *optional*, defaults to `True`):
            Whether or not to return a [`models.autoencoders.vq_model.VQEncoderOutput`] instead of a plain tuple.

    Returns:
        [`~models.autoencoders.vq_model.VQEncoderOutput`] or `tuple`:
            If return_dict is True, a [`~models.autoencoders.vq_model.VQEncoderOutput`] is returned, otherwise a
            plain `tuple` is returned.
    """

    h = self.encode(sample)[0]
    dec = self.decode(h)

    if not return_dict:
        return dec  # (dec.sample, dec.commit_loss)

    return DecoderOutput(sample=dec[0], commit_loss=dec[1])

mindone.diffusers.models.vq_model.VQEncoderOutput

Bases: VQEncoderOutput

Source code in mindone/diffusers/models/vq_model.py
18
19
20
21
22
class VQEncoderOutput(VQEncoderOutput):
    def __init__(self, *args, **kwargs):
        deprecation_message = "Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from mindone.diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead."  # noqa: E501
        deprecate("VQEncoderOutput", "0.31", deprecation_message)
        super().__init__(*args, **kwargs)