VQModel¶
The VQ-VAE model was introduced in Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike AutoencoderKL
, the VQModel
works in a quantized latent space.
The abstract from the paper is:
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
mindone.diffusers.VQModel
¶
Bases: ModelMixin
, ConfigMixin
A VQ-VAE model for decoding latent representations.
This model inherits from [ModelMixin
]. Check the superclass documentation for it's generic methods implemented
for all models (such as downloading or saving).
PARAMETER | DESCRIPTION |
---|---|
in_channels |
Number of channels in the input image.
TYPE:
|
out_channels |
Number of channels in the output.
TYPE:
|
down_block_types |
Tuple of downsample block types.
TYPE:
|
up_block_types |
Tuple of upsample block types.
TYPE:
|
block_out_channels |
Tuple of block output channels.
TYPE:
|
layers_per_block |
Number of layers per block.
TYPE:
|
act_fn |
The activation function to use.
TYPE:
|
latent_channels |
Number of channels in the latent space.
TYPE:
|
sample_size |
Sample input size.
TYPE:
|
num_vq_embeddings |
Number of codebook vectors in the VQ-VAE.
TYPE:
|
norm_num_groups |
Number of groups for normalization layers.
TYPE:
|
vq_embed_dim |
Hidden dim of codebook vectors in the VQ-VAE.
TYPE:
|
scaling_factor |
The component-wise standard deviation of the trained latent space computed using the first batch of the
training set. This is used to scale the latent space to have unit variance when training the diffusion
model. The latents are scaled with the formula
TYPE:
|
norm_type |
Type of normalization layer to use. Can be one of
TYPE:
|
Source code in mindone/diffusers/models/autoencoders/vq_model.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
|
mindone.diffusers.VQModel.construct(sample, return_dict=False)
¶
The [VQModel
] forward method.
PARAMETER | DESCRIPTION |
---|---|
sample |
Input sample.
TYPE:
|
return_dict |
Whether or not to return a [
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[DecoderOutput, Tuple[Tensor, ...]]
|
[ |
Source code in mindone/diffusers/models/autoencoders/vq_model.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
|
mindone.diffusers.models.vq_model.VQEncoderOutput
¶
Bases: VQEncoderOutput
Source code in mindone/diffusers/models/vq_model.py
18 19 20 21 22 |
|