Würstchen¶

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.

The abstract from the paper is:

We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.

Würstchen Overview¶

Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the paper). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.

Würstchen v2 comes to Diffusers¶

After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements.

Higher resolution (1024x1024 up to 2048x2048)
Faster inference
Multi Aspect Resolution Sampling
Better quality

We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:

v2-base
v2-aesthetic
(default) v2-interpolated (50% interpolation between v2-base and v2-aesthetic)

We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations. A comparison can be seen here:

Text-to-Image Generation¶

For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows:

import mindspore as ms
from mindone.diffusers import WuerstchenCombinedPipeline
from mindone.diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipe = WuerstchenCombinedPipeline.from_pretrained(
    "warp-ai/wuerstchen",
    mindspore_dtype=ms.float16
)
caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
    caption,
    width = 1024,
    height = 1536,
    prior_timesteps = DEFAULT_STAGE_C_TIMESTEPS,
    prior_gudiance_scale = 4.0,
    num_images_per_prompt = 2,
)[0]

For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the prior_pipeline. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the decoder_pipeline. For more details, take a look at the paper.

import mindspore as ms
from mindone.diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline
from mindone.diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

dtype = ms.float16
num_images_per_prompt = 2
prior_pipeline = WuerstchenPriorPipeline.from_pretrained(
    "warp-ai/wuerstchen-prior",
    mindspore_dtype=ms.float16
)
decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained(
    "warp-ai/wuerstchen",
    mindspore_dtype=ms.float16
)

caption = "Anthropomorphic cat dressed as a fire fighter"
negative_prompt = ""

prior_output = prior_pipeline(
    prompt = caption,
    height = 1024,
    width = 1536,
    timesteps = DEFAULT_STAGE_C_TIMESTEPS,
    negative_prompt = negative_prompt,
    gudiance_scale = 4.0,
    num_images_per_prompt = num_images_per_prompt
)

decoder_output = decoder_pipeline(
    image_embeddings = prior_output[0],
    prompt = caption,
    negative_prompt = negative_prompt,
    guidance_scale = 0.0,
    output_type = "pil",
)[0]

decoder_output

Limitations¶

Due to the high compression employed by Würstchen, generations can lack a good amount of detail. To our human eye, this is especially noticeable in faces, hands etc.
Images can only be generated in 128-pixel steps, e.g. the next higher resolution after 1024x1024 is 1152x1152
The model lacks the ability to render correct text in images
The model often does not achieve photorealism
Difficult compositional prompts are hard for the model

The original codebase, as well as experimental ideas, can be found at dome272/Wuerstchen.

`mindone.diffusers.WuerstchenCombinedPipeline` ¶

Bases: DiffusionPipeline

Combined Pipeline for text-to-image generation using Wuerstchen

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER	DESCRIPTION
`tokenizer`	The decoder tokenizer to be used for text inputs. TYPE: `CLIPTokenizer`
`text_encoder`	The decoder text encoder to be used for text inputs. TYPE: `CLIPTextModel`
`decoder`	The decoder model to be used for decoder image generation pipeline. TYPE: `WuerstchenDiffNeXt`
`scheduler`	The scheduler to be used for decoder image generation pipeline. TYPE: `DDPMWuerstchenScheduler`
`vqgan`	The VQGAN model to be used for decoder image generation pipeline. TYPE: `PaellaVQModel`
`prior_tokenizer`	The prior tokenizer to be used for text inputs. TYPE: `CLIPTokenizer`
`prior_text_encoder`	The prior text encoder to be used for text inputs. TYPE: `CLIPTextModel`
`prior_prior`	The prior model to be used for prior pipeline. TYPE: `WuerstchenPrior`
`prior_scheduler`	The scheduler to be used for prior pipeline. TYPE: `DDPMWuerstchenScheduler`

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

class WuerstchenCombinedPipeline(DiffusionPipeline):
    """
    Combined Pipeline for text-to-image generation using Wuerstchen

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer (`CLIPTokenizer`):
            The decoder tokenizer to be used for text inputs.
        text_encoder (`CLIPTextModel`):
            The decoder text encoder to be used for text inputs.
        decoder (`WuerstchenDiffNeXt`):
            The decoder model to be used for decoder image generation pipeline.
        scheduler (`DDPMWuerstchenScheduler`):
            The scheduler to be used for decoder image generation pipeline.
        vqgan (`PaellaVQModel`):
            The VQGAN model to be used for decoder image generation pipeline.
        prior_tokenizer (`CLIPTokenizer`):
            The prior tokenizer to be used for text inputs.
        prior_text_encoder (`CLIPTextModel`):
            The prior text encoder to be used for text inputs.
        prior_prior (`WuerstchenPrior`):
            The prior model to be used for prior pipeline.
        prior_scheduler (`DDPMWuerstchenScheduler`):
            The scheduler to be used for prior pipeline.
    """

    _load_connected_pipes = True

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        decoder: WuerstchenDiffNeXt,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        prior_tokenizer: CLIPTokenizer,
        prior_text_encoder: CLIPTextModel,
        prior_prior: WuerstchenPrior,
        prior_scheduler: DDPMWuerstchenScheduler,
    ):
        super().__init__()

        self.register_modules(
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
            prior_prior=prior_prior,
            prior_text_encoder=prior_text_encoder,
            prior_tokenizer=prior_tokenizer,
            prior_scheduler=prior_scheduler,
        )
        self.prior_pipe = WuerstchenPriorPipeline(
            prior=prior_prior,
            text_encoder=prior_text_encoder,
            tokenizer=prior_tokenizer,
            scheduler=prior_scheduler,
        )
        self.decoder_pipe = WuerstchenDecoderPipeline(
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
        )

    def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None):
        self.decoder_pipe.enable_xformers_memory_efficient_attention(attention_op)

    def progress_bar(self, iterable=None, total=None):
        self.prior_pipe.progress_bar(iterable=iterable, total=total)
        self.decoder_pipe.progress_bar(iterable=iterable, total=total)

    def set_progress_bar_config(self, **kwargs):
        self.prior_pipe.set_progress_bar_config(**kwargs)
        self.decoder_pipe.set_progress_bar_config(**kwargs)

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        height: int = 512,
        width: int = 512,
        prior_num_inference_steps: int = 60,
        prior_timesteps: Optional[List[float]] = None,
        prior_guidance_scale: float = 4.0,
        num_inference_steps: int = 12,
        decoder_timesteps: Optional[List[float]] = None,
        decoder_guidance_scale: float = 0.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        num_images_per_prompt: int = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        prior_callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        prior_callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation for the prior and decoder.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `guidance_scale` is less than `1`).
            prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.*
                prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
                input argument.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            height (`int`, *optional*, defaults to 512):
                The height in pixels of the generated image.
            width (`int`, *optional*, defaults to 512):
                The width in pixels of the generated image.
            prior_guidance_scale (`float`, *optional*, defaults to 4.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `prior_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked
                to the text `prompt`, usually at the expense of lower image quality.
            prior_num_inference_steps (`Union[int, Dict[float, int]]`, *optional*, defaults to 60):
                The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference. For more specific timestep spacing, you can pass customized
                `prior_timesteps`
            num_inference_steps (`int`, *optional*, defaults to 12):
                The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at
                the expense of slower inference. For more specific timestep spacing, you can pass customized
                `timesteps`
            prior_timesteps (`List[float]`, *optional*):
                Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced
                `prior_num_inference_steps` timesteps are used. Must be in descending order.
            decoder_timesteps (`List[float]`, *optional*):
                Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced
                `num_inference_steps` timesteps are used. Must be in descending order.
            decoder_guidance_scale (`float`, *optional*, defaults to 0.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            prior_callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep:
                int, callback_kwargs: Dict)`.
            prior_callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the
                list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in
                the `._callback_tensor_inputs` attribute of your pipeline class.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
            otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
        """
        prior_kwargs = {}
        if kwargs.get("prior_callback", None) is not None:
            prior_kwargs["callback"] = kwargs.pop("prior_callback")
            deprecate(
                "prior_callback",
                "1.0.0",
                "Passing `prior_callback` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
            )
        if kwargs.get("prior_callback_steps", None) is not None:
            deprecate(
                "prior_callback_steps",
                "1.0.0",
                "Passing `prior_callback_steps` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
            )
            prior_kwargs["callback_steps"] = kwargs.pop("prior_callback_steps")

        prior_outputs = self.prior_pipe(
            prompt=prompt if prompt_embeds is None else None,
            height=height,
            width=width,
            num_inference_steps=prior_num_inference_steps,
            timesteps=prior_timesteps,
            guidance_scale=prior_guidance_scale,
            negative_prompt=negative_prompt if negative_prompt_embeds is None else None,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
            num_images_per_prompt=num_images_per_prompt,
            generator=generator,
            latents=latents,
            output_type="ms",
            return_dict=False,
            callback_on_step_end=prior_callback_on_step_end,
            callback_on_step_end_tensor_inputs=prior_callback_on_step_end_tensor_inputs,
            **prior_kwargs,
        )
        image_embeddings = prior_outputs[0]

        outputs = self.decoder_pipe(
            image_embeddings=image_embeddings,
            prompt=prompt if prompt is not None else "",
            num_inference_steps=num_inference_steps,
            timesteps=decoder_timesteps,
            guidance_scale=decoder_guidance_scale,
            negative_prompt=negative_prompt,
            generator=generator,
            output_type=output_type,
            return_dict=return_dict,
            callback_on_step_end=callback_on_step_end,
            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
            **kwargs,
        )

        return outputs

mindone.diffusers.WuerstchenCombinedPipeline.call(prompt=None, height=512, width=512, prior_num_inference_steps=60, prior_timesteps=None, prior_guidance_scale=4.0, num_inference_steps=12, decoder_timesteps=None, decoder_guidance_scale=0.0, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, num_images_per_prompt=1, generator=None, latents=None, output_type='pil', return_dict=False, prior_callback_on_step_end=None, prior_callback_on_step_end_tensor_inputs=['latents'], callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs) ¶

Function invoked when calling the pipeline for generation.

PARAMETER	DESCRIPTION
`prompt`	The prompt or prompts to guide the image generation for the prior and decoder. TYPE: `str` or `List[str]` DEFAULT: `None`
`negative_prompt`	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). TYPE: `str` or `List[str]`, optional DEFAULT: `None`
`prompt_embeds`	Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`negative_prompt_embeds`	Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`num_images_per_prompt`	The number of images to generate per prompt. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`height`	The height in pixels of the generated image. TYPE: `int`, optional, defaults to 512 DEFAULT: `512`
`width`	The width in pixels of the generated image. TYPE: `int`, optional, defaults to 512 DEFAULT: `512`
`prior_guidance_scale`	Guidance scale as defined in Classifier-Free Diffusion Guidance. `prior_guidance_scale` is defined as `w` of equation 2. of Imagen Paper. Guidance scale is enabled by setting `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `4.0`
`prior_num_inference_steps`	The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. For more specific timestep spacing, you can pass customized `prior_timesteps` TYPE: `Union[int, Dict[float, int]]`, optional, defaults to 60 DEFAULT: `60`
`num_inference_steps`	The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. For more specific timestep spacing, you can pass customized `timesteps` TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`prior_timesteps`	Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced `prior_num_inference_steps` timesteps are used. Must be in descending order. TYPE: `List[float]`, optional DEFAULT: `None`
`decoder_timesteps`	Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced `num_inference_steps` timesteps are used. Must be in descending order. TYPE: `List[float]`, optional DEFAULT: `None`
`decoder_guidance_scale`	Guidance scale as defined in Classifier-Free Diffusion Guidance. `guidance_scale` is defined as `w` of equation 2. of Imagen Paper. Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`generator`	One or a list of np.random.Generator(s) to make generation deterministic. TYPE: `np.random.Generator` or `List[np.random.Generator]`, optional DEFAULT: `None`
`latents`	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`output_type`	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` (`np.array`) or `"ms"` (`ms.Tensor`). TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'pil'`
`return_dict`	Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`prior_callback_on_step_end`	A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. TYPE: `Callable`, optional DEFAULT: `None`
`prior_callback_on_step_end_tensor_inputs`	The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class. TYPE: `List`, optional DEFAULT: `['latents']`
`callback_on_step_end`	A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`. TYPE: `Callable`, optional DEFAULT: `None`
`callback_on_step_end_tensor_inputs`	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class. TYPE: `List`, optional DEFAULT: `['latents']`

RETURNS	DESCRIPTION
	[`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    height: int = 512,
    width: int = 512,
    prior_num_inference_steps: int = 60,
    prior_timesteps: Optional[List[float]] = None,
    prior_guidance_scale: float = 4.0,
    num_inference_steps: int = 12,
    decoder_timesteps: Optional[List[float]] = None,
    decoder_guidance_scale: float = 0.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    num_images_per_prompt: int = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    prior_callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    prior_callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation for the prior and decoder.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `guidance_scale` is less than `1`).
        prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings for the prior. Can be used to easily tweak text inputs, *e.g.*
            prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt`
            input argument.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        height (`int`, *optional*, defaults to 512):
            The height in pixels of the generated image.
        width (`int`, *optional*, defaults to 512):
            The width in pixels of the generated image.
        prior_guidance_scale (`float`, *optional*, defaults to 4.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `prior_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `prior_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked
            to the text `prompt`, usually at the expense of lower image quality.
        prior_num_inference_steps (`Union[int, Dict[float, int]]`, *optional*, defaults to 60):
            The number of prior denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference. For more specific timestep spacing, you can pass customized
            `prior_timesteps`
        num_inference_steps (`int`, *optional*, defaults to 12):
            The number of decoder denoising steps. More denoising steps usually lead to a higher quality image at
            the expense of slower inference. For more specific timestep spacing, you can pass customized
            `timesteps`
        prior_timesteps (`List[float]`, *optional*):
            Custom timesteps to use for the denoising process for the prior. If not defined, equal spaced
            `prior_num_inference_steps` timesteps are used. Must be in descending order.
        decoder_timesteps (`List[float]`, *optional*):
            Custom timesteps to use for the denoising process for the decoder. If not defined, equal spaced
            `num_inference_steps` timesteps are used. Must be in descending order.
        decoder_guidance_scale (`float`, *optional*, defaults to 0.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
            1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
            usually at the expense of lower image quality.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        prior_callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep:
            int, callback_kwargs: Dict)`.
        prior_callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `prior_callback_on_step_end` function. The tensors specified in the
            list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in
            the `._callback_tensor_inputs` attribute of your pipeline class.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
        otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
    """
    prior_kwargs = {}
    if kwargs.get("prior_callback", None) is not None:
        prior_kwargs["callback"] = kwargs.pop("prior_callback")
        deprecate(
            "prior_callback",
            "1.0.0",
            "Passing `prior_callback` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
        )
    if kwargs.get("prior_callback_steps", None) is not None:
        deprecate(
            "prior_callback_steps",
            "1.0.0",
            "Passing `prior_callback_steps` as an input argument to `__call__` is deprecated, consider use `prior_callback_on_step_end`",
        )
        prior_kwargs["callback_steps"] = kwargs.pop("prior_callback_steps")

    prior_outputs = self.prior_pipe(
        prompt=prompt if prompt_embeds is None else None,
        height=height,
        width=width,
        num_inference_steps=prior_num_inference_steps,
        timesteps=prior_timesteps,
        guidance_scale=prior_guidance_scale,
        negative_prompt=negative_prompt if negative_prompt_embeds is None else None,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        num_images_per_prompt=num_images_per_prompt,
        generator=generator,
        latents=latents,
        output_type="ms",
        return_dict=False,
        callback_on_step_end=prior_callback_on_step_end,
        callback_on_step_end_tensor_inputs=prior_callback_on_step_end_tensor_inputs,
        **prior_kwargs,
    )
    image_embeddings = prior_outputs[0]

    outputs = self.decoder_pipe(
        image_embeddings=image_embeddings,
        prompt=prompt if prompt is not None else "",
        num_inference_steps=num_inference_steps,
        timesteps=decoder_timesteps,
        guidance_scale=decoder_guidance_scale,
        negative_prompt=negative_prompt,
        generator=generator,
        output_type=output_type,
        return_dict=return_dict,
        callback_on_step_end=callback_on_step_end,
        callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
        **kwargs,
    )

    return outputs

`mindone.diffusers.WuerstchenPriorPipeline` ¶

Bases: DiffusionPipeline, LoraLoaderMixin

Pipeline for generating image prior for Wuerstchen.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods

[~loaders.LoraLoaderMixin.load_lora_weights] for loading LoRA weights
[~loaders.LoraLoaderMixin.save_lora_weights] for saving LoRA weights

PARAMETER	DESCRIPTION
`prior`	The canonical unCLIP prior to approximate the image embedding from the text embedding. TYPE: [`Prior`]
`text_encoder`	Frozen text-encoder. TYPE: [`CLIPTextModelWithProjection`]
`tokenizer`	Tokenizer of class CLIPTokenizer. TYPE: `CLIPTokenizer`
`scheduler`	A scheduler to be used in combination with `prior` to generate image embedding. TYPE: [`DDPMWuerstchenScheduler`]
`latent_mean`	Mean value for latent diffusers. TYPE: `'float', optional, defaults to 42.0` DEFAULT: `42.0`
`latent_std`	Standard value for latent diffusers. TYPE: `'float', optional, defaults to 1.0` DEFAULT: `1.0`
`resolution_multiple`	Default resolution for multiple images generated. TYPE: `'float', optional, defaults to 42.67` DEFAULT: `42.67`

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py

class WuerstchenPriorPipeline(DiffusionPipeline, LoraLoaderMixin):
    """
    Pipeline for generating image prior for Wuerstchen.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    The pipeline also inherits the following loading methods:
        - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
        - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights

    Args:
        prior ([`Prior`]):
            The canonical unCLIP prior to approximate the image embedding from the text embedding.
        text_encoder ([`CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer (`CLIPTokenizer`):
            Tokenizer of class
            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
        scheduler ([`DDPMWuerstchenScheduler`]):
            A scheduler to be used in combination with `prior` to generate image embedding.
        latent_mean ('float', *optional*, defaults to 42.0):
            Mean value for latent diffusers.
        latent_std ('float', *optional*, defaults to 1.0):
            Standard value for latent diffusers.
        resolution_multiple ('float', *optional*, defaults to 42.67):
            Default resolution for multiple images generated.
    """

    unet_name = "prior"
    text_encoder_name = "text_encoder"
    model_cpu_offload_seq = "text_encoder->prior"
    _callback_tensor_inputs = ["latents", "text_encoder_hidden_states", "negative_prompt_embeds"]

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        prior: WuerstchenPrior,
        scheduler: DDPMWuerstchenScheduler,
        latent_mean: float = 42.0,
        latent_std: float = 1.0,
        resolution_multiple: float = 42.67,
    ) -> None:
        super().__init__()
        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            prior=prior,
            scheduler=scheduler,
        )
        self.register_to_config(latent_mean=latent_mean, latent_std=latent_std, resolution_multiple=resolution_multiple)

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(
        self,
        num_images_per_prompt,
        do_classifier_free_guidance,
        prompt=None,
        negative_prompt=None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
    ):
        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        if prompt_embeds is None:
            # get prompt text embeddings
            text_inputs = self.tokenizer(
                prompt,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            text_input_ids = text_inputs.input_ids
            attention_mask = ms.Tensor(text_inputs.attention_mask)

            untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                text_input_ids, untruncated_ids
            ):
                removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
                logger.warning(
                    "The following part of your input was truncated because CLIP can only handle sequences up to"
                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                )
                text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]
                attention_mask = attention_mask[:, : self.tokenizer.model_max_length]

            text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids), attention_mask=attention_mask)
            prompt_embeds = text_encoder_output[0]

        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype)
        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)

        if negative_prompt_embeds is None and do_classifier_free_guidance:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif type(prompt) is not type(negative_prompt):
                raise TypeError(
                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
                    f" {type(prompt)}."
                )
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            elif batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                    " the batch size of `prompt`."
                )
            else:
                uncond_tokens = negative_prompt

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            negative_prompt_embeds_text_encoder_output = self.text_encoder(
                ms.Tensor(uncond_input.input_ids), attention_mask=ms.Tensor(uncond_input.attention_mask)
            )

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]

        if do_classifier_free_guidance:
            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder.dtype)
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
            # done duplicates

        return prompt_embeds, negative_prompt_embeds

    def check_inputs(
        self,
        prompt,
        negative_prompt,
        num_inference_steps,
        do_classifier_free_guidance,
        prompt_embeds=None,
        negative_prompt_embeds=None,
    ):
        if prompt is not None and prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
                " only forward one of the two."
            )
        elif prompt is None and prompt_embeds is None:
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

        if prompt_embeds is not None and negative_prompt_embeds is not None:
            if prompt_embeds.shape != negative_prompt_embeds.shape:
                raise ValueError(
                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
                    f" {negative_prompt_embeds.shape}."
                )

        if not isinstance(num_inference_steps, int):
            raise TypeError(
                f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                           In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
            )

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        height: int = 1024,
        width: int = 1024,
        num_inference_steps: int = 60,
        timesteps: List[float] = None,
        guidance_scale: float = 8.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        num_images_per_prompt: Optional[int] = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "ms",
        return_dict: bool = False,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            height (`int`, *optional*, defaults to 1024):
                The height in pixels of the generated image.
            width (`int`, *optional*, defaults to 1024):
                The width in pixels of the generated image.
            num_inference_steps (`int`, *optional*, defaults to 60):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            timesteps (`List[int]`, *optional*):
                Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
                timesteps are used. Must be in descending order.
            guidance_scale (`float`, *optional*, defaults to 8.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
                linked to the text `prompt`, usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `decoder_guidance_scale` is less than `1`).
            prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
                provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.WuerstchenPriorPipelineOutput`] or `tuple` [`~pipelines.WuerstchenPriorPipelineOutput`] if
            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
            generated image embeddings.
        """

        callback = kwargs.pop("callback", None)
        callback_steps = kwargs.pop("callback_steps", None)

        if callback is not None:
            deprecate(
                "callback",
                "1.0.0",
                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )
        if callback_steps is not None:
            deprecate(
                "callback_steps",
                "1.0.0",
                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )

        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
                f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
            )

        # 0. Define commonly used variables
        self._guidance_scale = guidance_scale
        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        # 1. Check inputs. Raise error if not correct
        if prompt is not None and not isinstance(prompt, list):
            if isinstance(prompt, str):
                prompt = [prompt]
            else:
                raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

        if self.do_classifier_free_guidance:
            if negative_prompt is not None and not isinstance(negative_prompt, list):
                if isinstance(negative_prompt, str):
                    negative_prompt = [negative_prompt]
                else:
                    raise TypeError(
                        f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                    )

        self.check_inputs(
            prompt,
            negative_prompt,
            num_inference_steps,
            self.do_classifier_free_guidance,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
        )

        # 2. Encode caption
        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
            prompt=prompt,
            num_images_per_prompt=num_images_per_prompt,
            do_classifier_free_guidance=self.do_classifier_free_guidance,
            negative_prompt=negative_prompt,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
        )

        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
        # to avoid doing two forward passes
        text_encoder_hidden_states = (
            ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
        )

        # 3. Determine latent shape of image embeddings
        dtype = text_encoder_hidden_states.dtype
        latent_height = ceil(height / self.config.resolution_multiple)
        latent_width = ceil(width / self.config.resolution_multiple)
        num_channels = self.prior.config.c_in
        effnet_features_shape = (num_images_per_prompt * batch_size, num_channels, latent_height, latent_width)

        # 4. Prepare and set timesteps
        if timesteps is not None:
            self.scheduler.set_timesteps(timesteps=timesteps)
            timesteps = self.scheduler.timesteps
            num_inference_steps = len(timesteps)
        else:
            self.scheduler.set_timesteps(num_inference_steps)
            timesteps = self.scheduler.timesteps

        # 5. Prepare latents
        latents = self.prepare_latents(effnet_features_shape, dtype, generator, latents, self.scheduler)

        # 6. Run denoising loop
        self._num_timesteps = len(timesteps[:-1])
        for i, t in enumerate(self.progress_bar(timesteps[:-1])):
            ratio = t.broadcast_to((latents.shape[0],)).to(dtype)

            # 7. Denoise image embeddings
            predicted_image_embedding = self.prior(
                ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
                r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
                c=text_encoder_hidden_states,
            )

            # 8. Check for classifier free guidance and apply it
            if self.do_classifier_free_guidance:
                predicted_image_embedding_text, predicted_image_embedding_uncond = predicted_image_embedding.chunk(2)
                predicted_image_embedding = ops.lerp(
                    predicted_image_embedding_uncond,
                    predicted_image_embedding_text,
                    ms.tensor(self.guidance_scale, dtype=predicted_image_embedding_text.dtype),
                )

            # 9. Renoise latents to next timestep
            latents = self.scheduler.step(
                model_output=predicted_image_embedding,
                timestep=ratio,
                sample=latents,
                generator=generator,
            )[0]

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                text_encoder_hidden_states = callback_outputs.pop(
                    "text_encoder_hidden_states", text_encoder_hidden_states
                )
                negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

            if callback is not None and i % callback_steps == 0:
                step_idx = i // getattr(self.scheduler, "order", 1)
                callback(step_idx, t, latents)

        # 10. Denormalize the latents
        latents = latents * self.config.latent_mean - self.config.latent_std

        if output_type == "np":
            latents = latents.float().numpy()

        if not return_dict:
            return (latents,)

        return WuerstchenPriorPipelineOutput(latents)

`mindone.diffusers.WuerstchenPriorPipeline.call(prompt=None, height=1024, width=1024, num_inference_steps=60, timesteps=None, guidance_scale=8.0, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, num_images_per_prompt=1, generator=None, latents=None, output_type='ms', return_dict=False, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)` ¶

Function invoked when calling the pipeline for generation.

PARAMETER	DESCRIPTION
`prompt`	The prompt or prompts to guide the image generation. TYPE: `str` or `List[str]` DEFAULT: `None`
`height`	The height in pixels of the generated image. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`width`	The width in pixels of the generated image. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`num_inference_steps`	The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 60 DEFAULT: `60`
`timesteps`	Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps` timesteps are used. Must be in descending order. TYPE: `List[int]`, optional DEFAULT: `None`
`guidance_scale`	Guidance scale as defined in Classifier-Free Diffusion Guidance. `decoder_guidance_scale` is defined as `w` of equation 2. of Imagen Paper. Guidance scale is enabled by setting `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. TYPE: `float`, optional, defaults to 8.0 DEFAULT: `8.0`
`negative_prompt`	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if `decoder_guidance_scale` is less than `1`). TYPE: `str` or `List[str]`, optional DEFAULT: `None`
`prompt_embeds`	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`negative_prompt_embeds`	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`num_images_per_prompt`	The number of images to generate per prompt. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`generator`	One or a list of np.random.Generator(s) to make generation deterministic. TYPE: `np.random.Generator` or `List[np.random.Generator]`, optional DEFAULT: `None`
`latents`	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`output_type`	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` (`np.array`) or `"ms"` (`ms.Tensor`). TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'ms'`
`return_dict`	Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`callback_on_step_end`	A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`. TYPE: `Callable`, optional DEFAULT: `None`
`callback_on_step_end_tensor_inputs`	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class. TYPE: `List`, optional DEFAULT: `['latents']`

RETURNS	DESCRIPTION
	[`~pipelines.WuerstchenPriorPipelineOutput`] or `tuple` [`~pipelines.WuerstchenPriorPipelineOutput`] if
	`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
	generated image embeddings.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py

def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    height: int = 1024,
    width: int = 1024,
    num_inference_steps: int = 60,
    timesteps: List[float] = None,
    guidance_scale: float = 8.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    num_images_per_prompt: Optional[int] = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "ms",
    return_dict: bool = False,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation.
        height (`int`, *optional*, defaults to 1024):
            The height in pixels of the generated image.
        width (`int`, *optional*, defaults to 1024):
            The width in pixels of the generated image.
        num_inference_steps (`int`, *optional*, defaults to 60):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        timesteps (`List[int]`, *optional*):
            Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
            timesteps are used. Must be in descending order.
        guidance_scale (`float`, *optional*, defaults to 8.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
            linked to the text `prompt`, usually at the expense of lower image quality.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `decoder_guidance_scale` is less than `1`).
        prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
            provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.WuerstchenPriorPipelineOutput`] or `tuple` [`~pipelines.WuerstchenPriorPipelineOutput`] if
        `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
        generated image embeddings.
    """

    callback = kwargs.pop("callback", None)
    callback_steps = kwargs.pop("callback_steps", None)

    if callback is not None:
        deprecate(
            "callback",
            "1.0.0",
            "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )
    if callback_steps is not None:
        deprecate(
            "callback_steps",
            "1.0.0",
            "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )

    if callback_on_step_end_tensor_inputs is not None and not all(
        k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
    ):
        raise ValueError(
            f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
            f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
        )

    # 0. Define commonly used variables
    self._guidance_scale = guidance_scale
    if prompt is not None and isinstance(prompt, str):
        batch_size = 1
    elif prompt is not None and isinstance(prompt, list):
        batch_size = len(prompt)
    else:
        batch_size = prompt_embeds.shape[0]

    # 1. Check inputs. Raise error if not correct
    if prompt is not None and not isinstance(prompt, list):
        if isinstance(prompt, str):
            prompt = [prompt]
        else:
            raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

    if self.do_classifier_free_guidance:
        if negative_prompt is not None and not isinstance(negative_prompt, list):
            if isinstance(negative_prompt, str):
                negative_prompt = [negative_prompt]
            else:
                raise TypeError(
                    f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                )

    self.check_inputs(
        prompt,
        negative_prompt,
        num_inference_steps,
        self.do_classifier_free_guidance,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
    )

    # 2. Encode caption
    prompt_embeds, negative_prompt_embeds = self.encode_prompt(
        prompt=prompt,
        num_images_per_prompt=num_images_per_prompt,
        do_classifier_free_guidance=self.do_classifier_free_guidance,
        negative_prompt=negative_prompt,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
    )

    # For classifier free guidance, we need to do two forward passes.
    # Here we concatenate the unconditional and text embeddings into a single batch
    # to avoid doing two forward passes
    text_encoder_hidden_states = (
        ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
    )

    # 3. Determine latent shape of image embeddings
    dtype = text_encoder_hidden_states.dtype
    latent_height = ceil(height / self.config.resolution_multiple)
    latent_width = ceil(width / self.config.resolution_multiple)
    num_channels = self.prior.config.c_in
    effnet_features_shape = (num_images_per_prompt * batch_size, num_channels, latent_height, latent_width)

    # 4. Prepare and set timesteps
    if timesteps is not None:
        self.scheduler.set_timesteps(timesteps=timesteps)
        timesteps = self.scheduler.timesteps
        num_inference_steps = len(timesteps)
    else:
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

    # 5. Prepare latents
    latents = self.prepare_latents(effnet_features_shape, dtype, generator, latents, self.scheduler)

    # 6. Run denoising loop
    self._num_timesteps = len(timesteps[:-1])
    for i, t in enumerate(self.progress_bar(timesteps[:-1])):
        ratio = t.broadcast_to((latents.shape[0],)).to(dtype)

        # 7. Denoise image embeddings
        predicted_image_embedding = self.prior(
            ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
            r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
            c=text_encoder_hidden_states,
        )

        # 8. Check for classifier free guidance and apply it
        if self.do_classifier_free_guidance:
            predicted_image_embedding_text, predicted_image_embedding_uncond = predicted_image_embedding.chunk(2)
            predicted_image_embedding = ops.lerp(
                predicted_image_embedding_uncond,
                predicted_image_embedding_text,
                ms.tensor(self.guidance_scale, dtype=predicted_image_embedding_text.dtype),
            )

        # 9. Renoise latents to next timestep
        latents = self.scheduler.step(
            model_output=predicted_image_embedding,
            timestep=ratio,
            sample=latents,
            generator=generator,
        )[0]

        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)
            text_encoder_hidden_states = callback_outputs.pop(
                "text_encoder_hidden_states", text_encoder_hidden_states
            )
            negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

        if callback is not None and i % callback_steps == 0:
            step_idx = i // getattr(self.scheduler, "order", 1)
            callback(step_idx, t, latents)

    # 10. Denormalize the latents
    latents = latents * self.config.latent_mean - self.config.latent_std

    if output_type == "np":
        latents = latents.float().numpy()

    if not return_dict:
        return (latents,)

    return WuerstchenPriorPipelineOutput(latents)

`mindone.diffusers.pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput` `dataclass` ¶

Bases: BaseOutput

Output class for WuerstchenPriorPipeline.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_prior.py

@dataclass
class WuerstchenPriorPipelineOutput(BaseOutput):
    """
    Output class for WuerstchenPriorPipeline.

    Args:
        image_embeddings (`ms.Tensor` or `np.ndarray`)
            Prior image embeddings for text prompt

    """

    image_embeddings: Union[ms.Tensor, np.ndarray]

`mindone.diffusers.WuerstchenDecoderPipeline` ¶

Bases: DiffusionPipeline

Pipeline for generating images from the Wuerstchen model.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER	DESCRIPTION
`tokenizer`	The CLIP tokenizer. TYPE: `CLIPTokenizer`
`text_encoder`	The CLIP text encoder. TYPE: `CLIPTextModel`
`decoder`	The WuerstchenDiffNeXt unet decoder. TYPE: [`WuerstchenDiffNeXt`]
`vqgan`	The VQGAN model. TYPE: [`PaellaVQModel`]
`scheduler`	A scheduler to be used in combination with `prior` to generate image embedding. TYPE: [`DDPMWuerstchenScheduler`]
`latent_dim_scale`	Multiplier to determine the VQ latent space size from the image embeddings. If the image embeddings are height=24 and width=24, the VQ latent shape needs to be height=int(2410.67)=256 and width=int(2410.67)=256 in order to match the training conditions. TYPE: float, `optional`, defaults to 10.67 DEFAULT: `10.67`

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen.py

class WuerstchenDecoderPipeline(DiffusionPipeline):
    """
    Pipeline for generating images from the Wuerstchen model.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer (`CLIPTokenizer`):
            The CLIP tokenizer.
        text_encoder (`CLIPTextModel`):
            The CLIP text encoder.
        decoder ([`WuerstchenDiffNeXt`]):
            The WuerstchenDiffNeXt unet decoder.
        vqgan ([`PaellaVQModel`]):
            The VQGAN model.
        scheduler ([`DDPMWuerstchenScheduler`]):
            A scheduler to be used in combination with `prior` to generate image embedding.
        latent_dim_scale (float, `optional`, defaults to 10.67):
            Multiplier to determine the VQ latent space size from the image embeddings. If the image embeddings are
            height=24 and width=24, the VQ latent shape needs to be height=int(24*10.67)=256 and
            width=int(24*10.67)=256 in order to match the training conditions.
    """

    model_cpu_offload_seq = "text_encoder->decoder->vqgan"
    _callback_tensor_inputs = [
        "latents",
        "text_encoder_hidden_states",
        "negative_prompt_embeds",
        "image_embeddings",
    ]

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: CLIPTextModel,
        decoder: WuerstchenDiffNeXt,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        latent_dim_scale: float = 10.67,
    ) -> None:
        super().__init__()
        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            decoder=decoder,
            scheduler=scheduler,
            vqgan=vqgan,
        )
        self.register_to_config(latent_dim_scale=latent_dim_scale)

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
            latents = latents

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(
        self,
        prompt,
        num_images_per_prompt,
        do_classifier_free_guidance,
        negative_prompt=None,
    ):
        batch_size = len(prompt) if isinstance(prompt, list) else 1
        # get prompt text embeddings
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True,
            return_tensors="np",
        )
        text_input_ids = text_inputs.input_ids
        attention_mask = ms.Tensor(text_inputs.attention_mask)

        untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
            text_input_ids, untruncated_ids
        ):
            removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
            logger.warning(
                "The following part of your input was truncated because CLIP can only handle sequences up to"
                f" {self.tokenizer.model_max_length} tokens: {removed_text}"
            )
            text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]
            attention_mask = attention_mask[:, : self.tokenizer.model_max_length]

        text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids), attention_mask=attention_mask)
        text_encoder_hidden_states = text_encoder_output[0]
        text_encoder_hidden_states = text_encoder_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)

        uncond_text_encoder_hidden_states = None
        if do_classifier_free_guidance:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif type(prompt) is not type(negative_prompt):
                raise TypeError(
                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
                    f" {type(prompt)}."
                )
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            elif batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                    " the batch size of `prompt`."
                )
            else:
                uncond_tokens = negative_prompt

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            negative_prompt_embeds_text_encoder_output = self.text_encoder(
                ms.Tensor(uncond_input.input_ids), attention_mask=ms.Tensor(uncond_input.attention_mask)
            )

            uncond_text_encoder_hidden_states = negative_prompt_embeds_text_encoder_output[0]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = uncond_text_encoder_hidden_states.shape[1]
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.tile((1, num_images_per_prompt, 1))
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
        return text_encoder_hidden_states, uncond_text_encoder_hidden_states

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def __call__(
        self,
        image_embeddings: Union[ms.Tensor, List[ms.Tensor]],
        prompt: Union[str, List[str]] = None,
        num_inference_steps: int = 12,
        timesteps: Optional[List[float]] = None,
        guidance_scale: float = 0.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        num_images_per_prompt: int = 1,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            image_embedding (`ms.Tensor` or `List[ms.Tensor]`):
                Image Embeddings either extracted from an image or generated by a Prior Model.
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            num_inference_steps (`int`, *optional*, defaults to 12):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            timesteps (`List[int]`, *optional*):
                Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
                timesteps are used. Must be in descending order.
            guidance_scale (`float`, *optional*, defaults to 0.0):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
                `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
                linked to the text `prompt`, usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `decoder_guidance_scale` is less than `1`).
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"ms"` (`ms.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
            otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
            embeddings.
        """

        callback = kwargs.pop("callback", None)
        callback_steps = kwargs.pop("callback_steps", None)

        if callback is not None:
            deprecate(
                "callback",
                "1.0.0",
                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )
        if callback_steps is not None:
            deprecate(
                "callback_steps",
                "1.0.0",
                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
            )

        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
                f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
            )

        # 0. Define commonly used variables
        dtype = next(self.decoder.get_parameters()).dtype
        self._guidance_scale = guidance_scale

        # 1. Check inputs. Raise error if not correct
        if not isinstance(prompt, list):
            if isinstance(prompt, str):
                prompt = [prompt]
            else:
                raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

        if self.do_classifier_free_guidance:
            if negative_prompt is not None and not isinstance(negative_prompt, list):
                if isinstance(negative_prompt, str):
                    negative_prompt = [negative_prompt]
                else:
                    raise TypeError(
                        f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                    )

        if isinstance(image_embeddings, list):
            image_embeddings = ops.cat(image_embeddings, axis=0)
        if isinstance(image_embeddings, np.ndarray):
            image_embeddings = ms.Tensor(image_embeddings).to(dtype=dtype)
        if not isinstance(image_embeddings, ms.Tensor):
            raise TypeError(
                f"'image_embeddings' must be of type 'ms.Tensor' or 'np.array', but got {type(image_embeddings)}."
            )

        if not isinstance(num_inference_steps, int):
            raise TypeError(
                f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                           In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
            )

        # 2. Encode caption
        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
            prompt,
            image_embeddings.shape[0] * num_images_per_prompt,
            self.do_classifier_free_guidance,
            negative_prompt,
        )
        text_encoder_hidden_states = (
            ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
        )
        effnet = (
            ops.cat([image_embeddings, ops.zeros_like(image_embeddings)])
            if self.do_classifier_free_guidance
            else image_embeddings
        )

        # 3. Determine latent shape of latents
        latent_height = int(image_embeddings.shape[2] * self.config.latent_dim_scale)
        latent_width = int(image_embeddings.shape[3] * self.config.latent_dim_scale)
        latent_features_shape = (image_embeddings.shape[0] * num_images_per_prompt, 4, latent_height, latent_width)

        # 4. Prepare and set timesteps
        if timesteps is not None:
            self.scheduler.set_timesteps(timesteps=timesteps)
            timesteps = self.scheduler.timesteps
            num_inference_steps = len(timesteps)
        else:
            self.scheduler.set_timesteps(num_inference_steps)
            timesteps = self.scheduler.timesteps

        # 5. Prepare latents
        latents = self.prepare_latents(latent_features_shape, dtype, generator, latents, self.scheduler)

        # 6. Run denoising loop
        self._num_timesteps = len(timesteps[:-1])
        for i, t in enumerate(self.progress_bar(timesteps[:-1])):
            ratio = t.broadcast_to((latents.shape[0],)).to(dtype)
            # 7. Denoise latents
            predicted_latents = self.decoder(
                ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
                r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
                effnet=effnet,
                clip=text_encoder_hidden_states,
            )

            # 8. Check for classifier free guidance and apply it
            if self.do_classifier_free_guidance:
                predicted_latents_text, predicted_latents_uncond = predicted_latents.chunk(2)
                predicted_latents = ops.lerp(
                    predicted_latents_uncond,
                    predicted_latents_text,
                    ms.tensor(self.guidance_scale, dtype=predicted_latents_text.dtype),
                )

            # 9. Renoise latents to next timestep
            latents = self.scheduler.step(
                model_output=predicted_latents,
                timestep=ratio,
                sample=latents,
                generator=generator,
            )[0]

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                image_embeddings = callback_outputs.pop("image_embeddings", image_embeddings)
                text_encoder_hidden_states = callback_outputs.pop(
                    "text_encoder_hidden_states", text_encoder_hidden_states
                )

            if callback is not None and i % callback_steps == 0:
                step_idx = i // getattr(self.scheduler, "order", 1)
                callback(step_idx, t, latents)

        if output_type not in ["ms", "np", "pil", "latent"]:
            raise ValueError(
                f"Only the output types `ms`, `np`, `pil` and `latent` are supported not output_type={output_type}"
            )

        if not output_type == "latent":
            # 10. Scale and decode the image latents with vq-vae
            latents = (self.vqgan.config.scale_factor * latents).to(latents.dtype)
            images = self.vqgan.decode(latents)[0].clamp(0, 1)
            if output_type == "np":
                images = images.permute((0, 2, 3, 1)).float().numpy()
            elif output_type == "pil":
                images = images.permute((0, 2, 3, 1)).float().numpy()
                images = self.numpy_to_pil(images)
        else:
            images = latents

        if not return_dict:
            return images
        return ImagePipelineOutput(images)

`mindone.diffusers.WuerstchenDecoderPipeline.call(image_embeddings, prompt=None, num_inference_steps=12, timesteps=None, guidance_scale=0.0, negative_prompt=None, num_images_per_prompt=1, generator=None, latents=None, output_type='pil', return_dict=False, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)` ¶

Function invoked when calling the pipeline for generation.

PARAMETER	DESCRIPTION
`image_embedding`	Image Embeddings either extracted from an image or generated by a Prior Model. TYPE: `ms.Tensor` or `List[ms.Tensor]`
`prompt`	The prompt or prompts to guide the image generation. TYPE: `str` or `List[str]` DEFAULT: `None`
`num_inference_steps`	The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`timesteps`	Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps` timesteps are used. Must be in descending order. TYPE: `List[int]`, optional DEFAULT: `None`
`guidance_scale`	Guidance scale as defined in Classifier-Free Diffusion Guidance. `decoder_guidance_scale` is defined as `w` of equation 2. of Imagen Paper. Guidance scale is enabled by setting `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`negative_prompt`	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if `decoder_guidance_scale` is less than `1`). TYPE: `str` or `List[str]`, optional DEFAULT: `None`
`num_images_per_prompt`	The number of images to generate per prompt. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`generator`	One or a list of np.random.Generator(s) to make generation deterministic. TYPE: `np.random.Generator` or `List[np.random.Generator]`, optional DEFAULT: `None`
`latents`	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`output_type`	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` (`np.array`) or `"ms"` (`ms.Tensor`). TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'pil'`
`return_dict`	Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`callback_on_step_end`	A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`. TYPE: `Callable`, optional DEFAULT: `None`
`callback_on_step_end_tensor_inputs`	The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class. TYPE: `List`, optional DEFAULT: `['latents']`

RETURNS	DESCRIPTION
	[`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
	otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
	embeddings.

Source code in mindone/diffusers/pipelines/wuerstchen/pipeline_wuerstchen.py

def __call__(
    self,
    image_embeddings: Union[ms.Tensor, List[ms.Tensor]],
    prompt: Union[str, List[str]] = None,
    num_inference_steps: int = 12,
    timesteps: Optional[List[float]] = None,
    guidance_scale: float = 0.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    num_images_per_prompt: int = 1,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        image_embedding (`ms.Tensor` or `List[ms.Tensor]`):
            Image Embeddings either extracted from an image or generated by a Prior Model.
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation.
        num_inference_steps (`int`, *optional*, defaults to 12):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        timesteps (`List[int]`, *optional*):
            Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
            timesteps are used. Must be in descending order.
        guidance_scale (`float`, *optional*, defaults to 0.0):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `decoder_guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
            `decoder_guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely
            linked to the text `prompt`, usually at the expense of lower image quality.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `decoder_guidance_scale` is less than `1`).
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by sampling using the supplied random `generator`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"ms"` (`ms.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple` [`~pipelines.ImagePipelineOutput`] if `return_dict` is True,
        otherwise a `tuple`. When returning a tuple, the first element is a list with the generated image
        embeddings.
    """

    callback = kwargs.pop("callback", None)
    callback_steps = kwargs.pop("callback_steps", None)

    if callback is not None:
        deprecate(
            "callback",
            "1.0.0",
            "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )
    if callback_steps is not None:
        deprecate(
            "callback_steps",
            "1.0.0",
            "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
        )

    if callback_on_step_end_tensor_inputs is not None and not all(
        k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
    ):
        raise ValueError(
            f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, "
            f"but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
        )

    # 0. Define commonly used variables
    dtype = next(self.decoder.get_parameters()).dtype
    self._guidance_scale = guidance_scale

    # 1. Check inputs. Raise error if not correct
    if not isinstance(prompt, list):
        if isinstance(prompt, str):
            prompt = [prompt]
        else:
            raise TypeError(f"'prompt' must be of type 'list' or 'str', but got {type(prompt)}.")

    if self.do_classifier_free_guidance:
        if negative_prompt is not None and not isinstance(negative_prompt, list):
            if isinstance(negative_prompt, str):
                negative_prompt = [negative_prompt]
            else:
                raise TypeError(
                    f"'negative_prompt' must be of type 'list' or 'str', but got {type(negative_prompt)}."
                )

    if isinstance(image_embeddings, list):
        image_embeddings = ops.cat(image_embeddings, axis=0)
    if isinstance(image_embeddings, np.ndarray):
        image_embeddings = ms.Tensor(image_embeddings).to(dtype=dtype)
    if not isinstance(image_embeddings, ms.Tensor):
        raise TypeError(
            f"'image_embeddings' must be of type 'ms.Tensor' or 'np.array', but got {type(image_embeddings)}."
        )

    if not isinstance(num_inference_steps, int):
        raise TypeError(
            f"'num_inference_steps' must be of type 'int', but got {type(num_inference_steps)}\
                       In Case you want to provide explicit timesteps, please use the 'timesteps' argument."
        )

    # 2. Encode caption
    prompt_embeds, negative_prompt_embeds = self.encode_prompt(
        prompt,
        image_embeddings.shape[0] * num_images_per_prompt,
        self.do_classifier_free_guidance,
        negative_prompt,
    )
    text_encoder_hidden_states = (
        ops.cat([prompt_embeds, negative_prompt_embeds]) if negative_prompt_embeds is not None else prompt_embeds
    )
    effnet = (
        ops.cat([image_embeddings, ops.zeros_like(image_embeddings)])
        if self.do_classifier_free_guidance
        else image_embeddings
    )

    # 3. Determine latent shape of latents
    latent_height = int(image_embeddings.shape[2] * self.config.latent_dim_scale)
    latent_width = int(image_embeddings.shape[3] * self.config.latent_dim_scale)
    latent_features_shape = (image_embeddings.shape[0] * num_images_per_prompt, 4, latent_height, latent_width)

    # 4. Prepare and set timesteps
    if timesteps is not None:
        self.scheduler.set_timesteps(timesteps=timesteps)
        timesteps = self.scheduler.timesteps
        num_inference_steps = len(timesteps)
    else:
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

    # 5. Prepare latents
    latents = self.prepare_latents(latent_features_shape, dtype, generator, latents, self.scheduler)

    # 6. Run denoising loop
    self._num_timesteps = len(timesteps[:-1])
    for i, t in enumerate(self.progress_bar(timesteps[:-1])):
        ratio = t.broadcast_to((latents.shape[0],)).to(dtype)
        # 7. Denoise latents
        predicted_latents = self.decoder(
            ops.cat([latents] * 2) if self.do_classifier_free_guidance else latents,
            r=ops.cat([ratio] * 2) if self.do_classifier_free_guidance else ratio,
            effnet=effnet,
            clip=text_encoder_hidden_states,
        )

        # 8. Check for classifier free guidance and apply it
        if self.do_classifier_free_guidance:
            predicted_latents_text, predicted_latents_uncond = predicted_latents.chunk(2)
            predicted_latents = ops.lerp(
                predicted_latents_uncond,
                predicted_latents_text,
                ms.tensor(self.guidance_scale, dtype=predicted_latents_text.dtype),
            )

        # 9. Renoise latents to next timestep
        latents = self.scheduler.step(
            model_output=predicted_latents,
            timestep=ratio,
            sample=latents,
            generator=generator,
        )[0]

        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)
            image_embeddings = callback_outputs.pop("image_embeddings", image_embeddings)
            text_encoder_hidden_states = callback_outputs.pop(
                "text_encoder_hidden_states", text_encoder_hidden_states
            )

        if callback is not None and i % callback_steps == 0:
            step_idx = i // getattr(self.scheduler, "order", 1)
            callback(step_idx, t, latents)

    if output_type not in ["ms", "np", "pil", "latent"]:
        raise ValueError(
            f"Only the output types `ms`, `np`, `pil` and `latent` are supported not output_type={output_type}"
        )

    if not output_type == "latent":
        # 10. Scale and decode the image latents with vq-vae
        latents = (self.vqgan.config.scale_factor * latents).to(latents.dtype)
        images = self.vqgan.decode(latents)[0].clamp(0, 1)
        if output_type == "np":
            images = images.permute((0, 2, 3, 1)).float().numpy()
        elif output_type == "pil":
            images = images.permute((0, 2, 3, 1)).float().numpy()
            images = self.numpy_to_pil(images)
    else:
        images = latents

    if not return_dict:
        return images
    return ImagePipelineOutput(images)

Citation¶

@misc{pernias2023wuerstchen,
    title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
    author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
    year={2023},
    eprint={2306.00637},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Würstchen¶

Würstchen Overview¶

Würstchen v2 comes to Diffusers¶

Text-to-Image Generation¶

Limitations¶

mindone.diffusers.WuerstchenCombinedPipeline ¶

mindone.diffusers.WuerstchenPriorPipeline ¶

mindone.diffusers.pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput dataclass ¶

mindone.diffusers.WuerstchenDecoderPipeline ¶

Citation¶

`mindone.diffusers.WuerstchenCombinedPipeline` ¶

`mindone.diffusers.WuerstchenPriorPipeline` ¶

`mindone.diffusers.pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput` `dataclass` ¶

`mindone.diffusers.WuerstchenDecoderPipeline` ¶