Skip to content

Würstchen

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.

The abstract from the paper is:

We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.

Würstchen Overview

Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the paper). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.

Würstchen v2 comes to Diffusers

After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements.

  • Higher resolution (1024x1024 up to 2048x2048)
  • Faster inference
  • Multi Aspect Resolution Sampling
  • Better quality

We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:

  • v2-base
  • v2-aesthetic
  • (default) v2-interpolated (50% interpolation between v2-base and v2-aesthetic)

We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations. A comparison can be seen here:

Text-to-Image Generation

For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows:

import mindspore as ms
from mindone.diffusers import WuerstchenCombinedPipeline
from mindone.diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipe = WuerstchenCombinedPipeline.from_pretrained(
    "warp-ai/wuerstchen",
    mindspore_dtype=ms.float16
)
caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
    caption,
    width = 1024,
    height = 1536,
    prior_timesteps = DEFAULT_STAGE_C_TIMESTEPS,
    prior_gudiance_scale = 4.0,
    num_images_per_prompt = 2,
)[0]

For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the prior_pipeline. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the decoder_pipeline. For more details, take a look at the paper.

import mindspore as ms
from mindone.diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline
from mindone.diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

dtype = ms.float16
num_images_per_prompt = 2
prior_pipeline = WuerstchenPriorPipeline.from_pretrained(
    "warp-ai/wuerstchen-prior",
    mindspore_dtype=ms.float16
)
decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained(
    "warp-ai/wuerstchen",
    mindspore_dtype=ms.float16
)

caption = "Anthropomorphic cat dressed as a fire fighter"
negative_prompt = ""

prior_output = prior_pipeline(
    prompt = caption,
    height = 1024,
    width = 1536,
    timesteps = DEFAULT_STAGE_C_TIMESTEPS,
    negative_prompt = negative_prompt,
    gudiance_scale = 4.0,
    num_images_per_prompt = num_images_per_prompt
)

decoder_output = decoder_pipeline(
    image_embeddings = prior_output[0],
    prompt = caption,
    negative_prompt = negative_prompt,
    guidance_scale = 0.0,
    output_type = "pil",
)[0]

decoder_output

Limitations

  • Due to the high compression employed by Würstchen, generations can lack a good amount of detail. To our human eye, this is especially noticeable in faces, hands etc.
  • Images can only be generated in 128-pixel steps, e.g. the next higher resolution after 1024x1024 is 1152x1152
  • The model lacks the ability to render correct text in images
  • The model often does not achieve photorealism
  • Difficult compositional prompts are hard for the model

The original codebase, as well as experimental ideas, can be found at dome272/Wuerstchen.

mindone.diffusers.WuerstchenCombinedPipeline

Bases: DiffusionPipeline

Combined Pipeline for text-to-image generation using Wuerstchen

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER DESCRIPTION
tokenizer

The decoder tokenizer to be used for text inputs.

TYPE: `CLIPTokenizer`

text_encoder

The decoder text encoder to be used for text inputs.

TYPE: `CLIPTextModel`

decoder

The decoder model to be used for decoder image generation pipeline.

TYPE: `WuerstchenDiffNeXt`

scheduler

The scheduler to be used for decoder image generation pipeline.

TYPE: `DDPMWuerstchenScheduler`

vqgan

The VQGAN model to be used for decoder image generation pipeline.

TYPE: `PaellaVQModel`

prior_tokenizer

The prior tokenizer to be used for text inputs.

TYPE: `CLIPTokenizer`

prior_text_encoder

The prior text encoder to be used for text inputs.

TYPE: `CLIPTextModel`

prior_prior

The prior model to be used for prior pipeline.

TYPE: `WuerstchenPrior`

prior_scheduler

The scheduler to be used for prior pipeline.

TYPE: `DDPMWuerstchenScheduler`

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
class WuerstchenCombinedPipeline(DiffusionPipeline):
    """
    Combined Pipeline for text-to-image generation using Wuerstchen

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer (`CLIPTokenizer`):
            The decoder tokenizer to be used for text inputs.
        text_encoder (`CLIPTextModel`):
            The decoder text encoder to be used for text inputs.
        decoder (`WuerstchenDiffNeXt`):
            The decoder model to be used for decoder image generation pipeline.
        scheduler (`DDPMWuerstchenScheduler`):
            The scheduler to be used for decoder image generation pipeline.
        vqgan (`PaellaVQModel`):
            The VQGAN model to be used for decoder image generation pipeline.
        prior_tokenizer (`CLIPTokenizer`):
            The prior tokenizer to be used for text inputs.
        prior_text_encoder (`CLIPTextModel`):
            The prior text encoder to be used for text inputs.
        prior_prior (`WuerstchenPrior`):
            The prior model to be used for prior pipeline.
        prior_scheduler (`DDPMWuerstchenScheduler`):
            The scheduler to be used for prior pipeline.
    """

    _load_connected_pipes = True

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        decoder: WuerstchenDiffNeXt,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        prior_tokenizer: CLIPTokenizer,
        prior_text_encoder: CLIPTextModel,
        prior_prior: WuerstchenPrior,
        prior_scheduler: DDPMWuerstchenScheduler,
    ):
        super().__init__()

        self.register_modules(
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
            prior_prior=prior_prior,
            prior_text_encoder=prior_text_encoder,
            prior_tokenizer=prior_tokenizer,
            prior_scheduler=prior_scheduler,
        )
        self.prior_pipe = WuerstchenPriorPipeline(
            prior=prior_prior,
            text_encoder=prior_text_encoder,
            tokenizer=prior_tokenizer,
            scheduler=prior_scheduler,
        )
        self.decoder_pipe = WuerstchenDecoderPipeline(
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
        )

    def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None):
        self.decoder_pipe.enable_xformers_memory_efficient_attention(attention_op)

    def progress_bar(self, iterable=None, total=None):
        self.prior_pipe.progress_bar(iterable=iterable, total=total)
        self.decoder_pipe.progress_bar(iterable=iterable, total=total)

    def set_progress_bar_config(self, **kwargs):
        self.prior_pipe.set_progress_bar_config(**kwargs)
        self.decoder_pipe.set_progress_bar_config(**kwargs)

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        height: int = 512,
        width: int = 512,
        prior_num_inference_steps: int = 60,
        prior_timesteps: Optional[List[float]] = None,
        prior_guidance_scale: float = 4.0,
        num_inference_steps: int = 12,
        decoder_timesteps: Optional[List[float]] = None,
        decoder_guidance_scale: float = 0.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        num_images_per_prompt: int = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        prior_callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        prior_callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation for the prior and decoder.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `guidance_scale` is less than `1`).
            prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.*
                prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
                input argument.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            height (`int`, *optional*, defaults to 512):
                The height in pixels of the generated image.
            width (`int`, *optional*, defaults to 512):
                The width in pixels of the generated image.
            prior_guidance_scale (`float`, *optional*, defaults to 4.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `prior_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked
                to the text `prompt`, usually at the expense of lower image quality.
            prior_num_inference_steps (`Union[int, Dict[float, int]]`, *optional*, defaults to 60):
                The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference. For more specific timestep spacing, you can pass customized
                `prior_timesteps`
            num_inference_steps (`int`, *optional*, defaults to 12):
                The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at
                the expense of slower inference. For more specific timestep spacing, you can pass customized
                `timesteps`
            prior_timesteps (`List[float]`, *optional*):
                Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced
                `prior_num_inference_steps` timesteps are used. Must be in descending order.
            decoder_timesteps (`List[float]`, *optional*):
                Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced
                `num_inference_steps` timesteps are used. Must be in descending order.
            decoder_guidance_scale (`float`, *optional*, defaults to 0.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            prior_callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep:
                int, callback_kwargs: Dict)`.
            prior_callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the
                list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in
                the `._callback_tensor_inputs` attribute of your pipeline class.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
            otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
        """
        prior_kwargs = {}
        if kwargs.get("prior_callback", None) is not None:
            prior_kwargs["callback"] = kwargs.pop("prior_callback")
            deprecate(
                "prior_callback",
                "1.0.0",
                "Passing `prior_callback` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
            )
        if kwargs.get("prior_callback_steps", None) is not None:
            deprecate(
                "prior_callback_steps",
                "1.0.0",
                "Passing `prior_callback_steps` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
            )
            prior_kwargs["callback_steps"] = kwargs.pop("prior_callback_steps")

        prior_outputs = self.prior_pipe(
            prompt=prompt if prompt_embeds is None else None,
            height=height,
            width=width,
            num_inference_steps=prior_num_inference_steps,
            timesteps=prior_timesteps,
            guidance_scale=prior_guidance_scale,
            negative_prompt=negative_prompt if negative_prompt_embeds is None else None,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
            num_images_per_prompt=num_images_per_prompt,
            generator=generator,
            latents=latents,
            output_type="ms",
            return_dict=False,
            callback_on_step_end=prior_callback_on_step_end,
            callback_on_step_end_tensor_inputs=prior_callback_on_step_end_tensor_inputs,
            **prior_kwargs,
        )
        image_embeddings = prior_outputs[0]

        outputs = self.decoder_pipe(
            image_embeddings=image_embeddings,
            prompt=prompt if prompt is not None else "",
            num_inference_steps=num_inference_steps,
            timesteps=decoder_timesteps,
            guidance_scale=decoder_guidance_scale,
            negative_prompt=negative_prompt,
            generator=generator,
            output_type=output_type,
            return_dict=return_dict,
            callback_on_step_end=callback_on_step_end,
            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
            **kwargs,
        )

        return outputs

mindone.diffusers.WuerstchenCombinedPipeline.__call__(prompt=None, height=512, width=512, prior_num_inference_steps=60, prior_timesteps=None, prior_guidance_scale=4.0, num_inference_steps=12, decoder_timesteps=None, decoder_guidance_scale=0.0, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, num_images_per_prompt=1, generator=None, latents=None, output_type='pil', return_dict=False, prior_callback_on_step_end=None, prior_callback_on_step_end_tensor_inputs=['latents'], callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)

Function invoked when calling the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide the image generation for the prior and decoder.

TYPE: `str` or `List[str]` DEFAULT: None

negative_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

prompt_embeds

Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

height

The height in pixels of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

width

The width in pixels of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

prior_guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. prior_guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting prior_guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 4.0

prior_num_inference_steps

The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. For more specific timestep spacing, you can pass customized prior_timesteps

TYPE: `Union[int, Dict[float, int]]`, *optional*, defaults to 60 DEFAULT: 60

num_inference_steps

The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. For more specific timestep spacing, you can pass customized timesteps

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

prior_timesteps

Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced prior_num_inference_steps timesteps are used. Must be in descending order.

TYPE: `List[float]`, *optional* DEFAULT: None

decoder_timesteps

Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced num_inference_steps timesteps are used. Must be in descending order.

TYPE: `List[float]`, *optional* DEFAULT: None

decoder_guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

generator

One or a list of np.random.Generator(s) to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

latents

Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

output_type

The output format of the generate image. Choose between: "pil" (PIL.Image.Image), "np" (np.array) or "ms" (ms.Tensor).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

prior_callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).

TYPE: `Callable`, *optional* DEFAULT: None

prior_callback_on_step_end_tensor_inputs

The list of tensor inputs for the prior_callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.

TYPE: `Callable`, *optional* DEFAULT: None

callback_on_step_end_tensor_inputs

The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple [~pipelines.ImagePipelineOutput] if return_dict is True,

otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    height: int = 512,
    width: int = 512,
    prior_num_inference_steps: int = 60,
    prior_timesteps: Optional[List[float]] = None,
    prior_guidance_scale: float = 4.0,
    num_inference_steps: int = 12,
    decoder_timesteps: Optional[List[float]] = None,
    decoder_guidance_scale: float = 0.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    num_images_per_prompt: int = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    prior_callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    prior_callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation for the prior and decoder.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `guidance_scale` is less than `1`).
        prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.*
            prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
            input argument.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        height (`int`, *optional*, defaults to 512):
            The height in pixels of the generated image.
        width (`int`, *optional*, defaults to 512):
            The width in pixels of the generated image.
        prior_guidance_scale (`float`, *optional*, defaults to 4.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `prior_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked
            to the text `prompt`, usually at the expense of lower image quality.
        prior_num_inference_steps (`Union[int, Dict[float, int]]`, *optional*, defaults to 60):
            The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference. For more specific timestep spacing, you can pass customized
            `prior_timesteps`
        num_inference_steps (`int`, *optional*, defaults to 12):
            The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at
            the expense of slower inference. For more specific timestep spacing, you can pass customized
            `timesteps`
        prior_timesteps (`List[float]`, *optional*):
            Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced
            `prior_num_inference_steps` timesteps are used. Must be in descending order.
        decoder_timesteps (`List[float]`, *optional*):
            Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced
            `num_inference_steps` timesteps are used. Must be in descending order.
        decoder_guidance_scale (`float`, *optional*, defaults to 0.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
            1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
            usually at the expense of lower image quality.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        prior_callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep:
            int, callback_kwargs: Dict)`.
        prior_callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the
            list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in
            the `._callback_tensor_inputs` attribute of your pipeline class.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
        otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
    """
    prior_kwargs = {}
    if kwargs.get("prior_callback", None) is not None:
        prior_kwargs["callback"] = kwargs.pop("prior_callback")
        deprecate(
            "prior_callback",
            "1.0.0",
            "Passing `prior_callback` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
        )
    if kwargs.get("prior_callback_steps", None) is not None:
        deprecate(
            "prior_callback_steps",
            "1.0.0",
            "Passing `prior_callback_steps` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
        )
        prior_kwargs["callback_steps"] = kwargs.pop("prior_callback_steps")

    prior_outputs = self.prior_pipe(
        prompt=prompt if prompt_embeds is None else None,
        height=height,
        width=width,
        num_inference_steps=prior_num_inference_steps,
        timesteps=prior_timesteps,
        guidance_scale=prior_guidance_scale,
        negative_prompt=negative_prompt if negative_prompt_embeds is None else None,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        num_images_per_prompt=num_images_per_prompt,
        generator=generator,
        latents=latents,
        output_type="ms",
        return_dict=False,
        callback_on_step_end=prior_callback_on_step_end,
        callback_on_step_end_tensor_inputs=prior_callback_on_step_end_tensor_inputs,
        **prior_kwargs,
    )
    image_embeddings = prior_outputs[0]

    outputs = self.decoder_pipe(
        image_embeddings=image_embeddings,
        prompt=prompt if prompt is not None else "",
        num_inference_steps=num_inference_steps,
        timesteps=decoder_timesteps,
        guidance_scale=decoder_guidance_scale,
        negative_prompt=negative_prompt,
        generator=generator,
        output_type=output_type,
        return_dict=return_dict,
        callback_on_step_end=callback_on_step_end,
        callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
        **kwargs,
    )

    return outputs

mindone.diffusers.WuerstchenPriorPipeline

Bases: DiffusionPipeline, LoraLoaderMixin

Pipeline for generating image prior for Wuerstchen.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods
  • [~loaders.LoraLoaderMixin.load_lora_weights] for loading LoRA weights
  • [~loaders.LoraLoaderMixin.save_lora_weights] for saving LoRA weights
PARAMETER DESCRIPTION
prior

The canonical unCLIP prior to approximate the image embedding from the text embedding.

TYPE: [`Prior`]

text_encoder

Frozen text-encoder.

TYPE: [`CLIPTextModelWithProjection`]

tokenizer

Tokenizer of class CLIPTokenizer.

TYPE: `CLIPTokenizer`

scheduler

A scheduler to be used in combination with prior to generate image embedding.

TYPE: [`DDPMWuerstchenScheduler`]

latent_mean

Mean value for latent diffusers.

TYPE: 'float', *optional*, defaults to 42.0 DEFAULT: 42.0

latent_std

Standard value for latent diffusers.

TYPE: 'float', *optional*, defaults to 1.0 DEFAULT: 1.0

resolution_multiple

Default resolution for multiple images generated.

TYPE: 'float', *optional*, defaults to 42.67 DEFAULT: 42.67

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
class WuerstchenPriorPipeline(DiffusionPipeline, LoraLoaderMixin):
    """
    Pipeline for generating image prior for Wuerstchen.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    The pipeline also inherits the following loading methods:
        - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
        - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights

    Args:
        prior ([`Prior`]):
            The canonical unCLIP prior to approximate the image embedding from the text embedding.
        text_encoder ([`CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer (`CLIPTokenizer`):
            Tokenizer of class
            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
        scheduler ([`DDPMWuerstchenScheduler`]):
            A scheduler to be used in combination with `prior` to generate image embedding.
        latent_mean ('float', *optional*, defaults to 42.0):
            Mean value for latent diffusers.
        latent_std ('float', *optional*, defaults to 1.0):
            Standard value for latent diffusers.
        resolution_multiple ('float', *optional*, defaults to 42.67):
            Default resolution for multiple images generated.
    """

    unet_name = "prior"
    text_encoder_name = "text_encoder"
    model_cpu_offload_seq = "text_encoder->prior"
    _callback_tensor_inputs = ["latents", "text_encoder_hidden_states", "negative_prompt_embeds"]

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        prior: WuerstchenPrior,
        scheduler: DDPMWuerstchenScheduler,
        latent_mean: float = 42.0,
        latent_std: float = 1.0,
        resolution_multiple: float = 42.67,
    ) -> None:
        super().__init__()
        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            prior=prior,
            scheduler=scheduler,
        )
        self.register_to_config(latent_mean=latent_mean, latent_std=latent_std, resolution_multiple=resolution_multiple)

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(
        self,
        num_images_per_prompt,
        do_classifier_free_guidance,
        prompt=None,
        negative_prompt=None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
    ):
        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        if prompt_embeds is None:
            # get prompt text embeddings
            text_inputs = self.tokenizer(
                prompt,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            text_input_ids = text_inputs.input_ids
            attention_mask = ms.Tensor(text_inputs.attention_mask)

            untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                text_input_ids, untruncated_ids
            ):
                removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
                logger.warning(
                    "The following part of your input was truncated because CLIP can only handle sequences up to"
                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                )
                text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]
                attention_mask = attention_mask[:, : self.tokenizer.model_max_length]

            text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids), attention_mask=attention_mask)
            prompt_embeds = text_encoder_output[0]

        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype)
        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)

        if negative_prompt_embeds is None and do_classifier_free_guidance:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif type(prompt) is not type(negative_prompt):
                raise TypeError(
                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
                    f" {type(prompt)}."
                )
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            elif batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                    " the batch size of `prompt`."
                )
            else:
                uncond_tokens = negative_prompt

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            negative_prompt_embeds_text_encoder_output = self.text_encoder(
                ms.Tensor(uncond_input.input_ids), attention_mask=ms.Tensor(uncond_input.attention_mask)
            )

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]

        if do_classifier_free_guidance:
            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder.dtype)
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
            # done duplicates

        return prompt_embeds, negative_prompt_embeds

    def check_inputs(
        self,
        prompt,
        negative_prompt,
        num_inference_steps,
        do_classifier_free_guidance,
        prompt_embeds=None,
        negative_prompt_embeds=None,
    ):
        if prompt is not None and prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
                " only forward one of the two."
            )
        elif prompt is None and prompt_embeds is None:
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

        if prompt_embeds is not None and negative_prompt_embeds is not None:
            if prompt_embeds.shape != negative_prompt_embeds.shape:
                raise ValueError(
                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
                    f" {negative_prompt_embeds.shape}."
                )

        if not isinstance(num_inference_steps, int):
            raise TypeError(
                f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                           In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
            )

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        height: int = 1024,
        width: int = 1024,
        num_inference_steps: int = 60,
        timesteps: List[float] = None,
        guidance_scale: float = 8.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        num_images_per_prompt: Optional[int] = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "ms",
        return_dict: bool = False,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            height (`int`, *optional*, defaults to 1024):
                The height in pixels of the generated image.
            width (`int`, *optional*, defaults to 1024):
                The width in pixels of the generated image.
            num_inference_steps (`int`, *optional*, defaults to 60):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            timesteps (`List[int]`, *optional*):
                Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
                timesteps are used. Must be in descending order.
            guidance_scale (`float`, *optional*, defaults to 8.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
                linked to the text `prompt`, usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `decoder_guidance_scale` is less than `1`).
            prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
                provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.WuerstchenPriorPipelineOutput`] or `tuple` [`~pipelines.WuerstchenPriorPipelineOutput`] if
            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
            generated image embeddings.
        """

        callback = kwargs.pop("callback", None)
        callback_steps = kwargs.pop("callback_steps", None)

        if callback is not None:
            deprecate(
                "callback",
                "1.0.0",
                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )
        if callback_steps is not None:
            deprecate(
                "callback_steps",
                "1.0.0",
                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )

        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
                f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
            )

        # 0. Define commonly used variables
        self._guidance_scale = guidance_scale
        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        # 1. Check inputs. Raise error if not correct
        if prompt is not None and not isinstance(prompt, list):
            if isinstance(prompt, str):
                prompt = [prompt]
            else:
                raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

        if self.do_classifier_free_guidance:
            if negative_prompt is not None and not isinstance(negative_prompt, list):
                if isinstance(negative_prompt, str):
                    negative_prompt = [negative_prompt]
                else:
                    raise TypeError(
                        f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                    )

        self.check_inputs(
            prompt,
            negative_prompt,
            num_inference_steps,
            self.do_classifier_free_guidance,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
        )

        # 2. Encode caption
        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
            prompt=prompt,
            num_images_per_prompt=num_images_per_prompt,
            do_classifier_free_guidance=self.do_classifier_free_guidance,
            negative_prompt=negative_prompt,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
        )

        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
        # to avoid doing two forward passes
        text_encoder_hidden_states = (
            ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
        )

        # 3. Determine latent shape of image embeddings
        dtype = text_encoder_hidden_states.dtype
        latent_height = ceil(height / self.config.resolution_multiple)
        latent_width = ceil(width / self.config.resolution_multiple)
        num_channels = self.prior.config.c_in
        effnet_features_shape = (num_images_per_prompt * batch_size, num_channels, latent_height, latent_width)

        # 4. Prepare and set timesteps
        if timesteps is not None:
            self.scheduler.set_timesteps(timesteps=timesteps)
            timesteps = self.scheduler.timesteps
            num_inference_steps = len(timesteps)
        else:
            self.scheduler.set_timesteps(num_inference_steps)
            timesteps = self.scheduler.timesteps

        # 5. Prepare latents
        latents = self.prepare_latents(effnet_features_shape, dtype, generator, latents, self.scheduler)

        # 6. Run denoising loop
        self._num_timesteps = len(timesteps[:-1])
        for i, t in enumerate(self.progress_bar(timesteps[:-1])):
            ratio = t.broadcast_to((latents.shape[0],)).to(dtype)

            # 7. Denoise image embeddings
            predicted_image_embedding = self.prior(
                ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
                r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
                c=text_encoder_hidden_states,
            )

            # 8. Check for classifier free guidance and apply it
            if self.do_classifier_free_guidance:
                predicted_image_embedding_text, predicted_image_embedding_uncond = predicted_image_embedding.chunk(2)
                predicted_image_embedding = ops.lerp(
                    predicted_image_embedding_uncond,
                    predicted_image_embedding_text,
                    ms.tensor(self.guidance_scale, dtype=predicted_image_embedding_text.dtype),
                )

            # 9. Renoise latents to next timestep
            latents = self.scheduler.step(
                model_output=predicted_image_embedding,
                timestep=ratio,
                sample=latents,
                generator=generator,
            )[0]

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                text_encoder_hidden_states = callback_outputs.pop(
                    "text_encoder_hidden_states", text_encoder_hidden_states
                )
                negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

            if callback is not None and i % callback_steps == 0:
                step_idx = i // getattr(self.scheduler, "order", 1)
                callback(step_idx, t, latents)

        # 10. Denormalize the latents
        latents = latents * self.config.latent_mean - self.config.latent_std

        if output_type == "np":
            latents = latents.float().numpy()

        if not return_dict:
            return (latents,)

        return WuerstchenPriorPipelineOutput(latents)

mindone.diffusers.WuerstchenPriorPipeline.__call__(prompt=None, height=1024, width=1024, num_inference_steps=60, timesteps=None, guidance_scale=8.0, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, num_images_per_prompt=1, generator=None, latents=None, output_type='ms', return_dict=False, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)

Function invoked when calling the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide the image generation.

TYPE: `str` or `List[str]` DEFAULT: None

height

The height in pixels of the generated image.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

width

The width in pixels of the generated image.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 60 DEFAULT: 60

timesteps

Custom timesteps to use for the denoising process. If not defined, equal spaced num_inference_steps timesteps are used. Must be in descending order.

TYPE: `List[int]`, *optional* DEFAULT: None

guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. decoder_guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting decoder_guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 8.0 DEFAULT: 8.0

negative_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if decoder_guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

prompt_embeds

Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

generator

One or a list of np.random.Generator(s) to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

latents

Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

output_type

The output format of the generate image. Choose between: "pil" (PIL.Image.Image), "np" (np.array) or "ms" (ms.Tensor).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'ms'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.

TYPE: `Callable`, *optional* DEFAULT: None

callback_on_step_end_tensor_inputs

The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

RETURNS DESCRIPTION

[~pipelines.WuerstchenPriorPipelineOutput] or tuple [~pipelines.WuerstchenPriorPipelineOutput] if

return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the

generated image embeddings.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    height: int = 1024,
    width: int = 1024,
    num_inference_steps: int = 60,
    timesteps: List[float] = None,
    guidance_scale: float = 8.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    num_images_per_prompt: Optional[int] = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "ms",
    return_dict: bool = False,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation.
        height (`int`, *optional*, defaults to 1024):
            The height in pixels of the generated image.
        width (`int`, *optional*, defaults to 1024):
            The width in pixels of the generated image.
        num_inference_steps (`int`, *optional*, defaults to 60):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        timesteps (`List[int]`, *optional*):
            Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
            timesteps are used. Must be in descending order.
        guidance_scale (`float`, *optional*, defaults to 8.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
            linked to the text `prompt`, usually at the expense of lower image quality.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `decoder_guidance_scale` is less than `1`).
        prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
            provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.WuerstchenPriorPipelineOutput`] or `tuple` [`~pipelines.WuerstchenPriorPipelineOutput`] if
        `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
        generated image embeddings.
    """

    callback = kwargs.pop("callback", None)
    callback_steps = kwargs.pop("callback_steps", None)

    if callback is not None:
        deprecate(
            "callback",
            "1.0.0",
            "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )
    if callback_steps is not None:
        deprecate(
            "callback_steps",
            "1.0.0",
            "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )

    if callback_on_step_end_tensor_inputs is not None and not all(
        k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
    ):
        raise ValueError(
            f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
            f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
        )

    # 0. Define commonly used variables
    self._guidance_scale = guidance_scale
    if prompt is not None and isinstance(prompt, str):
        batch_size = 1
    elif prompt is not None and isinstance(prompt, list):
        batch_size = len(prompt)
    else:
        batch_size = prompt_embeds.shape[0]

    # 1. Check inputs. Raise error if not correct
    if prompt is not None and not isinstance(prompt, list):
        if isinstance(prompt, str):
            prompt = [prompt]
        else:
            raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

    if self.do_classifier_free_guidance:
        if negative_prompt is not None and not isinstance(negative_prompt, list):
            if isinstance(negative_prompt, str):
                negative_prompt = [negative_prompt]
            else:
                raise TypeError(
                    f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                )

    self.check_inputs(
        prompt,
        negative_prompt,
        num_inference_steps,
        self.do_classifier_free_guidance,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
    )

    # 2. Encode caption
    prompt_embeds, negative_prompt_embeds = self.encode_prompt(
        prompt=prompt,
        num_images_per_prompt=num_images_per_prompt,
        do_classifier_free_guidance=self.do_classifier_free_guidance,
        negative_prompt=negative_prompt,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
    )

    # For classifier free guidance, we need to do two forward passes.
    # Here we concatenate the unconditional and text embeddings into a single batch
    # to avoid doing two forward passes
    text_encoder_hidden_states = (
        ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
    )

    # 3. Determine latent shape of image embeddings
    dtype = text_encoder_hidden_states.dtype
    latent_height = ceil(height / self.config.resolution_multiple)
    latent_width = ceil(width / self.config.resolution_multiple)
    num_channels = self.prior.config.c_in
    effnet_features_shape = (num_images_per_prompt * batch_size, num_channels, latent_height, latent_width)

    # 4. Prepare and set timesteps
    if timesteps is not None:
        self.scheduler.set_timesteps(timesteps=timesteps)
        timesteps = self.scheduler.timesteps
        num_inference_steps = len(timesteps)
    else:
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

    # 5. Prepare latents
    latents = self.prepare_latents(effnet_features_shape, dtype, generator, latents, self.scheduler)

    # 6. Run denoising loop
    self._num_timesteps = len(timesteps[:-1])
    for i, t in enumerate(self.progress_bar(timesteps[:-1])):
        ratio = t.broadcast_to((latents.shape[0],)).to(dtype)

        # 7. Denoise image embeddings
        predicted_image_embedding = self.prior(
            ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
            r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
            c=text_encoder_hidden_states,
        )

        # 8. Check for classifier free guidance and apply it
        if self.do_classifier_free_guidance:
            predicted_image_embedding_text, predicted_image_embedding_uncond = predicted_image_embedding.chunk(2)
            predicted_image_embedding = ops.lerp(
                predicted_image_embedding_uncond,
                predicted_image_embedding_text,
                ms.tensor(self.guidance_scale, dtype=predicted_image_embedding_text.dtype),
            )

        # 9. Renoise latents to next timestep
        latents = self.scheduler.step(
            model_output=predicted_image_embedding,
            timestep=ratio,
            sample=latents,
            generator=generator,
        )[0]

        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)
            text_encoder_hidden_states = callback_outputs.pop(
                "text_encoder_hidden_states", text_encoder_hidden_states
            )
            negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

        if callback is not None and i % callback_steps == 0:
            step_idx = i // getattr(self.scheduler, "order", 1)
            callback(step_idx, t, latents)

    # 10. Denormalize the latents
    latents = latents * self.config.latent_mean - self.config.latent_std

    if output_type == "np":
        latents = latents.float().numpy()

    if not return_dict:
        return (latents,)

    return WuerstchenPriorPipelineOutput(latents)

mindone.diffusers.pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput dataclass

Bases: BaseOutput

Output class for WuerstchenPriorPipeline.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py
54
55
56
57
58
59
60
61
62
63
64
65
@dataclass
class WuerstchenPriorPipelineOutput(BaseOutput):
    """
    Output class for WuerstchenPriorPipeline.

    Args:
        image_embeddings (`ms.Tensor` or `np.ndarray`)
            Prior image embeddings for text prompt

    """

    image_embeddings: Union[ms.Tensor, np.ndarray]

mindone.diffusers.WuerstchenDecoderPipeline

Bases: DiffusionPipeline

Pipeline for generating images from the Wuerstchen model.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER DESCRIPTION
tokenizer

The CLIP tokenizer.

TYPE: `CLIPTokenizer`

text_encoder

The CLIP text encoder.

TYPE: `CLIPTextModel`

decoder

The WuerstchenDiffNeXt unet decoder.

TYPE: [`WuerstchenDiffNeXt`]

vqgan

The VQGAN model.

TYPE: [`PaellaVQModel`]

scheduler

A scheduler to be used in combination with prior to generate image embedding.

TYPE: [`DDPMWuerstchenScheduler`]

latent_dim_scale

Multiplier to determine the VQ latent space size from the image embeddings. If the image embeddings are height=24 and width=24, the VQ latent shape needs to be height=int(24*10.67)=256 and width=int(24*10.67)=256 in order to match the training conditions.

TYPE: float, `optional`, defaults to 10.67 DEFAULT: 10.67

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
class WuerstchenDecoderPipeline(DiffusionPipeline):
    """
    Pipeline for generating images from the Wuerstchen model.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer (`CLIPTokenizer`):
            The CLIP tokenizer.
        text_encoder (`CLIPTextModel`):
            The CLIP text encoder.
        decoder ([`WuerstchenDiffNeXt`]):
            The WuerstchenDiffNeXt unet decoder.
        vqgan ([`PaellaVQModel`]):
            The VQGAN model.
        scheduler ([`DDPMWuerstchenScheduler`]):
            A scheduler to be used in combination with `prior` to generate image embedding.
        latent_dim_scale (float, `optional`, defaults to 10.67):
            Multiplier to determine the VQ latent space size from the image embeddings. If the image embeddings are
            height=24 and width=24, the VQ latent shape needs to be height=int(24*10.67)=256 and
            width=int(24*10.67)=256 in order to match the training conditions.
    """

    model_cpu_offload_seq = "text_encoder->decoder->vqgan"
    _callback_tensor_inputs = [
        "latents",
        "text_encoder_hidden_states",
        "negative_prompt_embeds",
        "image_embeddings",
    ]

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        decoder: WuerstchenDiffNeXt,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        latent_dim_scale: float = 10.67,
    ) -> None:
        super().__init__()
        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
        )
        self.register_to_config(latent_dim_scale=latent_dim_scale)

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
            latents = latents

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(
        self,
        prompt,
        num_images_per_prompt,
        do_classifier_free_guidance,
        negative_prompt=None,
    ):
        batch_size = len(prompt) if isinstance(prompt, list) else 1
        # get prompt text embeddings
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True,
            return_tensors="np",
        )
        text_input_ids = text_inputs.input_ids
        attention_mask = ms.Tensor(text_inputs.attention_mask)

        untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
            text_input_ids, untruncated_ids
        ):
            removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
            logger.warning(
                "The following part of your input was truncated because CLIP can only handle sequences up to"
                f" {self.tokenizer.model_max_length} tokens: {removed_text}"
            )
            text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]
            attention_mask = attention_mask[:, : self.tokenizer.model_max_length]

        text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids), attention_mask=attention_mask)
        text_encoder_hidden_states = text_encoder_output[0]
        text_encoder_hidden_states = text_encoder_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)

        uncond_text_encoder_hidden_states = None
        if do_classifier_free_guidance:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif type(prompt) is not type(negative_prompt):
                raise TypeError(
                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
                    f" {type(prompt)}."
                )
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            elif batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                    " the batch size of `prompt`."
                )
            else:
                uncond_tokens = negative_prompt

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            negative_prompt_embeds_text_encoder_output = self.text_encoder(
                ms.Tensor(uncond_input.input_ids), attention_mask=ms.Tensor(uncond_input.attention_mask)
            )

            uncond_text_encoder_hidden_states = negative_prompt_embeds_text_encoder_output[0]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = uncond_text_encoder_hidden_states.shape[1]
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.tile((1, num_images_per_prompt, 1))
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
        return text_encoder_hidden_states, uncond_text_encoder_hidden_states

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def __call__(
        self,
        image_embeddings: Union[ms.Tensor, List[ms.Tensor]],
        prompt: Union[str, List[str]] = None,
        num_inference_steps: int = 12,
        timesteps: Optional[List[float]] = None,
        guidance_scale: float = 0.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        num_images_per_prompt: int = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            image_embedding (`ms.Tensor` or `List[ms.Tensor]`):
                Image Embeddings either extracted from an image or generated by a Prior Model.
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            num_inference_steps (`int`, *optional*, defaults to 12):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            timesteps (`List[int]`, *optional*):
                Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
                timesteps are used. Must be in descending order.
            guidance_scale (`float`, *optional*, defaults to 0.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
                linked to the text `prompt`, usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `decoder_guidance_scale` is less than `1`).
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
            otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
            embeddings.
        """

        callback = kwargs.pop("callback", None)
        callback_steps = kwargs.pop("callback_steps", None)

        if callback is not None:
            deprecate(
                "callback",
                "1.0.0",
                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )
        if callback_steps is not None:
            deprecate(
                "callback_steps",
                "1.0.0",
                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )

        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
                f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
            )

        # 0. Define commonly used variables
        dtype = next(self.decoder.get_parameters()).dtype
        self._guidance_scale = guidance_scale

        # 1. Check inputs. Raise error if not correct
        if not isinstance(prompt, list):
            if isinstance(prompt, str):
                prompt = [prompt]
            else:
                raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

        if self.do_classifier_free_guidance:
            if negative_prompt is not None and not isinstance(negative_prompt, list):
                if isinstance(negative_prompt, str):
                    negative_prompt = [negative_prompt]
                else:
                    raise TypeError(
                        f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                    )

        if isinstance(image_embeddings, list):
            image_embeddings = ops.cat(image_embeddings, axis=0)
        if isinstance(image_embeddings, np.ndarray):
            image_embeddings = ms.Tensor(image_embeddings).to(dtype=dtype)
        if not isinstance(image_embeddings, ms.Tensor):
            raise TypeError(
                f"'image_embeddings' must be of type 'ms.Tensor' or 'np.array', but got {type(image_embeddings)}."
            )

        if not isinstance(num_inference_steps, int):
            raise TypeError(
                f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                           In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
            )

        # 2. Encode caption
        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
            prompt,
            image_embeddings.shape[0] * num_images_per_prompt,
            self.do_classifier_free_guidance,
            negative_prompt,
        )
        text_encoder_hidden_states = (
            ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
        )
        effnet = (
            ops.cat([image_embeddings, ops.zeros_like(image_embeddings)])
            if self.do_classifier_free_guidance
            else image_embeddings
        )

        # 3. Determine latent shape of latents
        latent_height = int(image_embeddings.shape[2] * self.config.latent_dim_scale)
        latent_width = int(image_embeddings.shape[3] * self.config.latent_dim_scale)
        latent_features_shape = (image_embeddings.shape[0] * num_images_per_prompt, 4, latent_height, latent_width)

        # 4. Prepare and set timesteps
        if timesteps is not None:
            self.scheduler.set_timesteps(timesteps=timesteps)
            timesteps = self.scheduler.timesteps
            num_inference_steps = len(timesteps)
        else:
            self.scheduler.set_timesteps(num_inference_steps)
            timesteps = self.scheduler.timesteps

        # 5. Prepare latents
        latents = self.prepare_latents(latent_features_shape, dtype, generator, latents, self.scheduler)

        # 6. Run denoising loop
        self._num_timesteps = len(timesteps[:-1])
        for i, t in enumerate(self.progress_bar(timesteps[:-1])):
            ratio = t.broadcast_to((latents.shape[0],)).to(dtype)
            # 7. Denoise latents
            predicted_latents = self.decoder(
                ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
                r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
                effnet=effnet,
                clip=text_encoder_hidden_states,
            )

            # 8. Check for classifier free guidance and apply it
            if self.do_classifier_free_guidance:
                predicted_latents_text, predicted_latents_uncond = predicted_latents.chunk(2)
                predicted_latents = ops.lerp(
                    predicted_latents_uncond,
                    predicted_latents_text,
                    ms.tensor(self.guidance_scale, dtype=predicted_latents_text.dtype),
                )

            # 9. Renoise latents to next timestep
            latents = self.scheduler.step(
                model_output=predicted_latents,
                timestep=ratio,
                sample=latents,
                generator=generator,
            )[0]

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                image_embeddings = callback_outputs.pop("image_embeddings", image_embeddings)
                text_encoder_hidden_states = callback_outputs.pop(
                    "text_encoder_hidden_states", text_encoder_hidden_states
                )

            if callback is not None and i % callback_steps == 0:
                step_idx = i // getattr(self.scheduler, "order", 1)
                callback(step_idx, t, latents)

        if output_type not in ["ms", "np", "pil", "latent"]:
            raise ValueError(
                f"Only the output types `ms`, `np`, `pil` and `latent` are supported not output_type={output_type}"
            )

        if not output_type == "latent":
            # 10. Scale and decode the image latents with vq-vae
            latents = (self.vqgan.config.scale_factor * latents).to(latents.dtype)
            images = self.vqgan.decode(latents)[0].clamp(0, 1)
            if output_type == "np":
                images = images.permute((0, 2, 3, 1)).float().numpy()
            elif output_type == "pil":
                images = images.permute((0, 2, 3, 1)).float().numpy()
                images = self.numpy_to_pil(images)
        else:
            images = latents

        if not return_dict:
            return images
        return ImagePipelineOutput(images)

mindone.diffusers.WuerstchenDecoderPipeline.__call__(image_embeddings, prompt=None, num_inference_steps=12, timesteps=None, guidance_scale=0.0, negative_prompt=None, num_images_per_prompt=1, generator=None, latents=None, output_type='pil', return_dict=False, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)

Function invoked when calling the pipeline for generation.

PARAMETER DESCRIPTION
image_embedding

Image Embeddings either extracted from an image or generated by a Prior Model.

TYPE: `ms.Tensor` or `List[ms.Tensor]`

prompt

The prompt or prompts to guide the image generation.

TYPE: `str` or `List[str]` DEFAULT: None

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

timesteps

Custom timesteps to use for the denoising process. If not defined, equal spaced num_inference_steps timesteps are used. Must be in descending order.

TYPE: `List[int]`, *optional* DEFAULT: None

guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. decoder_guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting decoder_guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

negative_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if decoder_guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

generator

One or a list of np.random.Generator(s) to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

latents

Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

output_type

The output format of the generate image. Choose between: "pil" (PIL.Image.Image), "np" (np.array) or "ms" (ms.Tensor).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.

TYPE: `Callable`, *optional* DEFAULT: None

callback_on_step_end_tensor_inputs

The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple [~pipelines.ImagePipelineOutput] if return_dict is True,

otherwise a tuple. When returning a tuple, the first element is a list with the generated image

embeddings.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen.py
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
def __call__(
    self,
    image_embeddings: Union[ms.Tensor, List[ms.Tensor]],
    prompt: Union[str, List[str]] = None,
    num_inference_steps: int = 12,
    timesteps: Optional[List[float]] = None,
    guidance_scale: float = 0.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    num_images_per_prompt: int = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        image_embedding (`ms.Tensor` or `List[ms.Tensor]`):
            Image Embeddings either extracted from an image or generated by a Prior Model.
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation.
        num_inference_steps (`int`, *optional*, defaults to 12):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        timesteps (`List[int]`, *optional*):
            Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
            timesteps are used. Must be in descending order.
        guidance_scale (`float`, *optional*, defaults to 0.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
            linked to the text `prompt`, usually at the expense of lower image quality.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `decoder_guidance_scale` is less than `1`).
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
        otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
        embeddings.
    """

    callback = kwargs.pop("callback", None)
    callback_steps = kwargs.pop("callback_steps", None)

    if callback is not None:
        deprecate(
            "callback",
            "1.0.0",
            "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )
    if callback_steps is not None:
        deprecate(
            "callback_steps",
            "1.0.0",
            "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )

    if callback_on_step_end_tensor_inputs is not None and not all(
        k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
    ):
        raise ValueError(
            f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
            f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
        )

    # 0. Define commonly used variables
    dtype = next(self.decoder.get_parameters()).dtype
    self._guidance_scale = guidance_scale

    # 1. Check inputs. Raise error if not correct
    if not isinstance(prompt, list):
        if isinstance(prompt, str):
            prompt = [prompt]
        else:
            raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

    if self.do_classifier_free_guidance:
        if negative_prompt is not None and not isinstance(negative_prompt, list):
            if isinstance(negative_prompt, str):
                negative_prompt = [negative_prompt]
            else:
                raise TypeError(
                    f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                )

    if isinstance(image_embeddings, list):
        image_embeddings = ops.cat(image_embeddings, axis=0)
    if isinstance(image_embeddings, np.ndarray):
        image_embeddings = ms.Tensor(image_embeddings).to(dtype=dtype)
    if not isinstance(image_embeddings, ms.Tensor):
        raise TypeError(
            f"'image_embeddings' must be of type 'ms.Tensor' or 'np.array', but got {type(image_embeddings)}."
        )

    if not isinstance(num_inference_steps, int):
        raise TypeError(
            f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                       In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
        )

    # 2. Encode caption
    prompt_embeds, negative_prompt_embeds = self.encode_prompt(
        prompt,
        image_embeddings.shape[0] * num_images_per_prompt,
        self.do_classifier_free_guidance,
        negative_prompt,
    )
    text_encoder_hidden_states = (
        ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
    )
    effnet = (
        ops.cat([image_embeddings, ops.zeros_like(image_embeddings)])
        if self.do_classifier_free_guidance
        else image_embeddings
    )

    # 3. Determine latent shape of latents
    latent_height = int(image_embeddings.shape[2] * self.config.latent_dim_scale)
    latent_width = int(image_embeddings.shape[3] * self.config.latent_dim_scale)
    latent_features_shape = (image_embeddings.shape[0] * num_images_per_prompt, 4, latent_height, latent_width)

    # 4. Prepare and set timesteps
    if timesteps is not None:
        self.scheduler.set_timesteps(timesteps=timesteps)
        timesteps = self.scheduler.timesteps
        num_inference_steps = len(timesteps)
    else:
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

    # 5. Prepare latents
    latents = self.prepare_latents(latent_features_shape, dtype, generator, latents, self.scheduler)

    # 6. Run denoising loop
    self._num_timesteps = len(timesteps[:-1])
    for i, t in enumerate(self.progress_bar(timesteps[:-1])):
        ratio = t.broadcast_to((latents.shape[0],)).to(dtype)
        # 7. Denoise latents
        predicted_latents = self.decoder(
            ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
            r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
            effnet=effnet,
            clip=text_encoder_hidden_states,
        )

        # 8. Check for classifier free guidance and apply it
        if self.do_classifier_free_guidance:
            predicted_latents_text, predicted_latents_uncond = predicted_latents.chunk(2)
            predicted_latents = ops.lerp(
                predicted_latents_uncond,
                predicted_latents_text,
                ms.tensor(self.guidance_scale, dtype=predicted_latents_text.dtype),
            )

        # 9. Renoise latents to next timestep
        latents = self.scheduler.step(
            model_output=predicted_latents,
            timestep=ratio,
            sample=latents,
            generator=generator,
        )[0]

        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)
            image_embeddings = callback_outputs.pop("image_embeddings", image_embeddings)
            text_encoder_hidden_states = callback_outputs.pop(
                "text_encoder_hidden_states", text_encoder_hidden_states
            )

        if callback is not None and i % callback_steps == 0:
            step_idx = i // getattr(self.scheduler, "order", 1)
            callback(step_idx, t, latents)

    if output_type not in ["ms", "np", "pil", "latent"]:
        raise ValueError(
            f"Only the output types `ms`, `np`, `pil` and `latent` are supported not output_type={output_type}"
        )

    if not output_type == "latent":
        # 10. Scale and decode the image latents with vq-vae
        latents = (self.vqgan.config.scale_factor * latents).to(latents.dtype)
        images = self.vqgan.decode(latents)[0].clamp(0, 1)
        if output_type == "np":
            images = images.permute((0, 2, 3, 1)).float().numpy()
        elif output_type == "pil":
            images = images.permute((0, 2, 3, 1)).float().numpy()
            images = self.numpy_to_pil(images)
    else:
        images = latents

    if not return_dict:
        return images
    return ImagePipelineOutput(images)

Citation

@misc{pernias2023wuerstchen,
    title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
    author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
    year={2023},
    eprint={2306.00637},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}