unCLIP¶

Hierarchical Text-Conditional Image Generation with CLIP Latents is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's karlo.

The abstract from the paper is following:

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

You can find lucidrains' DALL-E 2 recreation at lucidrains/DALLE2-pytorch.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

`mindone.diffusers.UnCLIPPipeline` ¶

Bases: DiffusionPipeline

Pipeline for text-to-image generation using unCLIP.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER	DESCRIPTION
`text_encoder`	Frozen text-encoder. TYPE: [`~transformers.CLIPTextModelWithProjection`]
`tokenizer`	A `CLIPTokenizer` to tokenize text. TYPE: [`~transformers.CLIPTokenizer`]
`prior`	The canonical unCLIP prior to approximate the image embedding from the text embedding. TYPE: [`PriorTransformer`]
`text_proj`	Utility class to prepare and combine the embeddings before they are passed to the decoder. TYPE: [`UnCLIPTextProjModel`]
`decoder`	The decoder to invert the image embedding into an image. TYPE: [`UNet2DConditionModel`]
`super_res_first`	Super resolution UNet. Used in all but the last step of the super resolution diffusion process. TYPE: [`UNet2DModel`]
`super_res_last`	Super resolution UNet. Used in the last step of the super resolution diffusion process. TYPE: [`UNet2DModel`]
`prior_scheduler`	Scheduler used in the prior denoising process (a modified [`DDPMScheduler`]). TYPE: [`UnCLIPScheduler`]
`decoder_scheduler`	Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]). TYPE: [`UnCLIPScheduler`]
`super_res_scheduler`	Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]). TYPE: [`UnCLIPScheduler`]

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip.py

class UnCLIPPipeline(DiffusionPipeline):
    """
    Pipeline for text-to-image generation using unCLIP.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            A `CLIPTokenizer` to tokenize text.
        prior ([`PriorTransformer`]):
            The canonical unCLIP prior to approximate the image embedding from the text embedding.
        text_proj ([`UnCLIPTextProjModel`]):
            Utility class to prepare and combine the embeddings before they are passed to the decoder.
        decoder ([`UNet2DConditionModel`]):
            The decoder to invert the image embedding into an image.
        super_res_first ([`UNet2DModel`]):
            Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
        super_res_last ([`UNet2DModel`]):
            Super resolution UNet. Used in the last step of the super resolution diffusion process.
        prior_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the prior denoising process (a modified [`DDPMScheduler`]).
        decoder_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
        super_res_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).

    """

    _exclude_from_cpu_offload = ["prior"]

    prior: PriorTransformer
    decoder: UNet2DConditionModel
    text_proj: UnCLIPTextProjModel
    text_encoder: CLIPTextModelWithProjection
    tokenizer: CLIPTokenizer
    super_res_first: UNet2DModel
    super_res_last: UNet2DModel

    prior_scheduler: UnCLIPScheduler
    decoder_scheduler: UnCLIPScheduler
    super_res_scheduler: UnCLIPScheduler

    model_cpu_offload_seq = "text_encoder->text_proj->decoder->super_res_first->super_res_last"

    def __init__(
        self,
        prior: PriorTransformer,
        decoder: UNet2DConditionModel,
        text_encoder: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        text_proj: UnCLIPTextProjModel,
        super_res_first: UNet2DModel,
        super_res_last: UNet2DModel,
        prior_scheduler: UnCLIPScheduler,
        decoder_scheduler: UnCLIPScheduler,
        super_res_scheduler: UnCLIPScheduler,
    ):
        super().__init__()

        self.register_modules(
            prior=prior,
            decoder=decoder,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            text_proj=text_proj,
            super_res_first=super_res_first,
            super_res_last=super_res_last,
            prior_scheduler=prior_scheduler,
            decoder_scheduler=decoder_scheduler,
            super_res_scheduler=super_res_scheduler,
        )

    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def _encode_prompt(
        self,
        prompt,
        num_images_per_prompt,
        do_classifier_free_guidance,
        text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
        text_attention_mask: Optional[ms.Tensor] = None,
    ):
        if text_model_output is None:
            batch_size = len(prompt) if isinstance(prompt, list) else 1
            # get prompt text embeddings
            text_inputs = self.tokenizer(
                prompt,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            text_input_ids = text_inputs.input_ids
            text_mask = ms.Tensor.from_numpy(text_inputs.attention_mask)  # MindSpore mask does not require bool()

            untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                text_input_ids, untruncated_ids
            ):
                removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
                logger.warning(
                    "The following part of your input was truncated because CLIP can only handle sequences up to"
                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                )
                text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]

            text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids))

            prompt_embeds = text_encoder_output[0]
            text_enc_hid_states = text_encoder_output[1]

        else:
            batch_size = text_model_output[0].shape[0]
            prompt_embeds, text_enc_hid_states = text_model_output[0], text_model_output[1]
            text_mask = text_attention_mask

        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
        text_enc_hid_states = text_enc_hid_states.repeat_interleave(num_images_per_prompt, dim=0)
        text_mask = text_mask.repeat_interleave(num_images_per_prompt, dim=0)

        if do_classifier_free_guidance:
            uncond_tokens = [""] * batch_size

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            uncond_text_mask = ms.Tensor.from_numpy(
                uncond_input.attention_mask
            )  # MindSpore mask does not require bool()
            negative_prompt_embeds_text_encoder_output = self.text_encoder(ms.Tensor.from_numpy(uncond_input.input_ids))

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]
            uncond_text_enc_hid_states = negative_prompt_embeds_text_encoder_output[1]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method

            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len)

            seq_len = uncond_text_enc_hid_states.shape[1]
            uncond_text_enc_hid_states = uncond_text_enc_hid_states.tile((1, num_images_per_prompt, 1))
            uncond_text_enc_hid_states = uncond_text_enc_hid_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            uncond_text_mask = uncond_text_mask.repeat_interleave(num_images_per_prompt, dim=0)

            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            prompt_embeds = ops.cat([negative_prompt_embeds, prompt_embeds])
            text_enc_hid_states = ops.cat([uncond_text_enc_hid_states, text_enc_hid_states])

            text_mask = ops.cat([uncond_text_mask, text_mask])

        return prompt_embeds, text_enc_hid_states, text_mask

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        num_images_per_prompt: int = 1,
        prior_num_inference_steps: int = 25,
        decoder_num_inference_steps: int = 25,
        super_res_num_inference_steps: int = 7,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        prior_latents: Optional[ms.Tensor] = None,
        decoder_latents: Optional[ms.Tensor] = None,
        super_res_latents: Optional[ms.Tensor] = None,
        text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
        text_attention_mask: Optional[ms.Tensor] = None,
        prior_guidance_scale: float = 4.0,
        decoder_guidance_scale: float = 8.0,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output`
                and `text_attention_mask` is passed.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            prior_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the prior. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            decoder_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            super_res_num_inference_steps (`int`, *optional*, defaults to 7):
                The number of denoising steps for super resolution. More denoising steps usually lead to a higher
                quality image at the expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            prior_latents (`ms.Tensor` of shape (batch size, embeddings dimension), *optional*):
                Pre-generated noisy latents to be used as inputs for the prior.
            decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            prior_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            text_model_output (`CLIPTextModelOutput`, *optional*):
                Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text
                outputs can be passed for tasks like text embedding interpolations. Make sure to also pass
                `text_attention_mask` in this case. `prompt` can the be left `None`.
            text_attention_mask (`ms.Tensor`, *optional*):
                Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention
                masks are necessary when passing `text_model_output`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images.
        """
        if prompt is not None:
            if isinstance(prompt, str):
                batch_size = 1
            elif isinstance(prompt, list):
                batch_size = len(prompt)
            else:
                raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
        else:
            batch_size = text_model_output[0].shape[0]

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = prior_guidance_scale > 1.0 or decoder_guidance_scale > 1.0

        prompt_embeds, text_enc_hid_states, text_mask = self._encode_prompt(
            prompt, num_images_per_prompt, do_classifier_free_guidance, text_model_output, text_attention_mask
        )

        # prior

        self.prior_scheduler.set_timesteps(prior_num_inference_steps)
        prior_timesteps_tensor = self.prior_scheduler.timesteps

        embedding_dim = self.prior.config.embedding_dim

        prior_latents = self.prepare_latents(
            (batch_size, embedding_dim),
            prompt_embeds.dtype,
            generator,
            prior_latents,
            self.prior_scheduler,
        )

        for i, t in enumerate(self.progress_bar(prior_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([prior_latents] * 2) if do_classifier_free_guidance else prior_latents

            predicted_image_embedding = self.prior(
                latent_model_input,
                timestep=t,
                proj_embedding=prompt_embeds,
                encoder_hidden_states=text_enc_hid_states,
                attention_mask=text_mask,
            )[0]

            if do_classifier_free_guidance:
                predicted_image_embedding_uncond, predicted_image_embedding_text = predicted_image_embedding.chunk(2)
                predicted_image_embedding = predicted_image_embedding_uncond + prior_guidance_scale * (
                    predicted_image_embedding_text - predicted_image_embedding_uncond
                )

            if i + 1 == prior_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = prior_timesteps_tensor[i + 1]

            prior_latents = self.prior_scheduler.step(
                predicted_image_embedding,
                timestep=t,
                sample=prior_latents,
                generator=generator,
                prev_timestep=prev_timestep,
            )[0]

        prior_latents = self.prior.post_process_latents(prior_latents)

        image_embeddings = prior_latents

        # done prior

        # decoder

        text_enc_hid_states, additive_clip_time_embeddings = self.text_proj(
            image_embeddings=image_embeddings,
            prompt_embeds=prompt_embeds,
            text_encoder_hidden_states=text_enc_hid_states,
            do_classifier_free_guidance=do_classifier_free_guidance,
        )

        decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=1.0)

        self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
        decoder_timesteps_tensor = self.decoder_scheduler.timesteps

        num_channels_latents = self.decoder.config.in_channels
        height = self.decoder.config.sample_size
        width = self.decoder.config.sample_size

        decoder_latents = self.prepare_latents(
            (batch_size, num_channels_latents, height, width),
            text_enc_hid_states.dtype,
            generator,
            decoder_latents,
            self.decoder_scheduler,
        )

        for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

            noise_pred = self.decoder(
                sample=latent_model_input,
                timestep=t,
                encoder_hidden_states=text_enc_hid_states,
                class_labels=additive_clip_time_embeddings,
                attention_mask=decoder_text_mask,
            )[0]

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
                noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
                noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
                noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

            if i + 1 == decoder_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = decoder_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            decoder_latents = self.decoder_scheduler.step(
                noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        decoder_latents = decoder_latents.clamp(-1, 1)

        image_small = decoder_latents

        # done decoder

        # super res

        self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
        super_res_timesteps_tensor = self.super_res_scheduler.timesteps

        channels = self.super_res_first.config.in_channels // 2
        height = self.super_res_first.config.sample_size
        width = self.super_res_first.config.sample_size

        super_res_latents = self.prepare_latents(
            (batch_size, channels, height, width),
            image_small.dtype,
            generator,
            super_res_latents,
            self.super_res_scheduler,
        )

        interpolate_antialias = {}
        if "antialias" in inspect.signature(ops.interpolate).parameters:
            interpolate_antialias["antialias"] = True

        image_upscaled = ops.interpolate(
            image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
        )

        for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
            # no classifier free guidance

            if i == super_res_timesteps_tensor.shape[0] - 1:
                unet = self.super_res_last
            else:
                unet = self.super_res_first

            latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

            noise_pred = unet(
                sample=latent_model_input,
                timestep=t,
            )[0]

            if i + 1 == super_res_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = super_res_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            super_res_latents = self.super_res_scheduler.step(
                noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        image = super_res_latents
        # done super res

        # post processing
        image = image * 0.5 + 0.5
        image = image.clamp(0, 1)
        image = image.permute(0, 2, 3, 1).float().numpy()

        if output_type == "pil":
            image = self.numpy_to_pil(image)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

`mindone.diffusers.UnCLIPPipeline.call(prompt=None, num_images_per_prompt=1, prior_num_inference_steps=25, decoder_num_inference_steps=25, super_res_num_inference_steps=7, generator=None, prior_latents=None, decoder_latents=None, super_res_latents=None, text_model_output=None, text_attention_mask=None, prior_guidance_scale=4.0, decoder_guidance_scale=8.0, output_type='pil', return_dict=False)` ¶

The call function to the pipeline for generation.

PARAMETER	DESCRIPTION
`prompt`	The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output` and `text_attention_mask` is passed. TYPE: `str` or `List[str]` DEFAULT: `None`
`num_images_per_prompt`	The number of images to generate per prompt. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`prior_num_inference_steps`	The number of denoising steps for the prior. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 25 DEFAULT: `25`
`decoder_num_inference_steps`	The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 25 DEFAULT: `25`
`super_res_num_inference_steps`	The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 7 DEFAULT: `7`
`generator`	A `np.random.Generator` to make generation deterministic. TYPE: `np.random.Generator` or `List[np.random.Generator]`, optional DEFAULT: `None`
`prior_latents`	Pre-generated noisy latents to be used as inputs for the prior. TYPE: `ms.Tensor` of shape (batch size, embeddings dimension), optional DEFAULT: `None`
`decoder_latents`	Pre-generated noisy latents to be used as inputs for the decoder. TYPE: `ms.Tensor` of shape (batch size, channels, height, width), optional DEFAULT: `None`
`super_res_latents`	Pre-generated noisy latents to be used as inputs for the decoder. TYPE: `ms.Tensor` of shape (batch size, channels, super res height, super res width), optional DEFAULT: `None`
`prior_guidance_scale`	A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `4.0`
`decoder_guidance_scale`	A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `8.0`
`text_model_output`	Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text outputs can be passed for tasks like text embedding interpolations. Make sure to also pass `text_attention_mask` in this case. `prompt` can the be left `None`. TYPE: `CLIPTextModelOutput`, optional DEFAULT: `None`
`text_attention_mask`	Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention masks are necessary when passing `text_model_output`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`output_type`	The output format of the generated image. Choose between `PIL.Image` or `np.array`. TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'pil'`
`return_dict`	Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`

RETURNS	DESCRIPTION
	[`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip.py

def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    num_images_per_prompt: int = 1,
    prior_num_inference_steps: int = 25,
    decoder_num_inference_steps: int = 25,
    super_res_num_inference_steps: int = 7,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    prior_latents: Optional[ms.Tensor] = None,
    decoder_latents: Optional[ms.Tensor] = None,
    super_res_latents: Optional[ms.Tensor] = None,
    text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
    text_attention_mask: Optional[ms.Tensor] = None,
    prior_guidance_scale: float = 4.0,
    decoder_guidance_scale: float = 8.0,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output`
            and `text_attention_mask` is passed.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        prior_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the prior. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        decoder_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        super_res_num_inference_steps (`int`, *optional*, defaults to 7):
            The number of denoising steps for super resolution. More denoising steps usually lead to a higher
            quality image at the expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        prior_latents (`ms.Tensor` of shape (batch size, embeddings dimension), *optional*):
            Pre-generated noisy latents to be used as inputs for the prior.
        decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        prior_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        text_model_output (`CLIPTextModelOutput`, *optional*):
            Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text
            outputs can be passed for tasks like text embedding interpolations. Make sure to also pass
            `text_attention_mask` in this case. `prompt` can the be left `None`.
        text_attention_mask (`ms.Tensor`, *optional*):
            Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention
            masks are necessary when passing `text_model_output`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images.
    """
    if prompt is not None:
        if isinstance(prompt, str):
            batch_size = 1
        elif isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
    else:
        batch_size = text_model_output[0].shape[0]

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = prior_guidance_scale > 1.0 or decoder_guidance_scale > 1.0

    prompt_embeds, text_enc_hid_states, text_mask = self._encode_prompt(
        prompt, num_images_per_prompt, do_classifier_free_guidance, text_model_output, text_attention_mask
    )

    # prior

    self.prior_scheduler.set_timesteps(prior_num_inference_steps)
    prior_timesteps_tensor = self.prior_scheduler.timesteps

    embedding_dim = self.prior.config.embedding_dim

    prior_latents = self.prepare_latents(
        (batch_size, embedding_dim),
        prompt_embeds.dtype,
        generator,
        prior_latents,
        self.prior_scheduler,
    )

    for i, t in enumerate(self.progress_bar(prior_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([prior_latents] * 2) if do_classifier_free_guidance else prior_latents

        predicted_image_embedding = self.prior(
            latent_model_input,
            timestep=t,
            proj_embedding=prompt_embeds,
            encoder_hidden_states=text_enc_hid_states,
            attention_mask=text_mask,
        )[0]

        if do_classifier_free_guidance:
            predicted_image_embedding_uncond, predicted_image_embedding_text = predicted_image_embedding.chunk(2)
            predicted_image_embedding = predicted_image_embedding_uncond + prior_guidance_scale * (
                predicted_image_embedding_text - predicted_image_embedding_uncond
            )

        if i + 1 == prior_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = prior_timesteps_tensor[i + 1]

        prior_latents = self.prior_scheduler.step(
            predicted_image_embedding,
            timestep=t,
            sample=prior_latents,
            generator=generator,
            prev_timestep=prev_timestep,
        )[0]

    prior_latents = self.prior.post_process_latents(prior_latents)

    image_embeddings = prior_latents

    # done prior

    # decoder

    text_enc_hid_states, additive_clip_time_embeddings = self.text_proj(
        image_embeddings=image_embeddings,
        prompt_embeds=prompt_embeds,
        text_encoder_hidden_states=text_enc_hid_states,
        do_classifier_free_guidance=do_classifier_free_guidance,
    )

    decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=1.0)

    self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
    decoder_timesteps_tensor = self.decoder_scheduler.timesteps

    num_channels_latents = self.decoder.config.in_channels
    height = self.decoder.config.sample_size
    width = self.decoder.config.sample_size

    decoder_latents = self.prepare_latents(
        (batch_size, num_channels_latents, height, width),
        text_enc_hid_states.dtype,
        generator,
        decoder_latents,
        self.decoder_scheduler,
    )

    for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

        noise_pred = self.decoder(
            sample=latent_model_input,
            timestep=t,
            encoder_hidden_states=text_enc_hid_states,
            class_labels=additive_clip_time_embeddings,
            attention_mask=decoder_text_mask,
        )[0]

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
            noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
            noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
            noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

        if i + 1 == decoder_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = decoder_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        decoder_latents = self.decoder_scheduler.step(
            noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    decoder_latents = decoder_latents.clamp(-1, 1)

    image_small = decoder_latents

    # done decoder

    # super res

    self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
    super_res_timesteps_tensor = self.super_res_scheduler.timesteps

    channels = self.super_res_first.config.in_channels // 2
    height = self.super_res_first.config.sample_size
    width = self.super_res_first.config.sample_size

    super_res_latents = self.prepare_latents(
        (batch_size, channels, height, width),
        image_small.dtype,
        generator,
        super_res_latents,
        self.super_res_scheduler,
    )

    interpolate_antialias = {}
    if "antialias" in inspect.signature(ops.interpolate).parameters:
        interpolate_antialias["antialias"] = True

    image_upscaled = ops.interpolate(
        image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
    )

    for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
        # no classifier free guidance

        if i == super_res_timesteps_tensor.shape[0] - 1:
            unet = self.super_res_last
        else:
            unet = self.super_res_first

        latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

        noise_pred = unet(
            sample=latent_model_input,
            timestep=t,
        )[0]

        if i + 1 == super_res_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = super_res_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        super_res_latents = self.super_res_scheduler.step(
            noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    image = super_res_latents
    # done super res

    # post processing
    image = image * 0.5 + 0.5
    image = image.clamp(0, 1)
    image = image.permute(0, 2, 3, 1).float().numpy()

    if output_type == "pil":
        image = self.numpy_to_pil(image)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)

`mindone.diffusers.UnCLIPImageVariationPipeline` ¶

Bases: DiffusionPipeline

Pipeline to generate image variations from an input image using UnCLIP.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER	DESCRIPTION
`text_encoder`	Frozen text-encoder. TYPE: [`~transformers.CLIPTextModelWithProjection`]
`tokenizer`	A `CLIPTokenizer` to tokenize text. TYPE: [`~transformers.CLIPTokenizer`]
`feature_extractor`	Model that extracts features from generated images to be used as inputs for the `image_encoder`. TYPE: [`~transformers.CLIPImageProcessor`]
`image_encoder`	Frozen CLIP image-encoder (clip-vit-large-patch14). TYPE: [`~transformers.CLIPVisionModelWithProjection`]
`text_proj`	Utility class to prepare and combine the embeddings before they are passed to the decoder. TYPE: [`UnCLIPTextProjModel`]
`decoder`	The decoder to invert the image embedding into an image. TYPE: [`UNet2DConditionModel`]
`super_res_first`	Super resolution UNet. Used in all but the last step of the super resolution diffusion process. TYPE: [`UNet2DModel`]
`super_res_last`	Super resolution UNet. Used in the last step of the super resolution diffusion process. TYPE: [`UNet2DModel`]
`decoder_scheduler`	Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]). TYPE: [`UnCLIPScheduler`]
`super_res_scheduler`	Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]). TYPE: [`UnCLIPScheduler`]

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py

class UnCLIPImageVariationPipeline(DiffusionPipeline):
    """
    Pipeline to generate image variations from an input image using UnCLIP.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            A `CLIPTokenizer` to tokenize text.
        feature_extractor ([`~transformers.CLIPImageProcessor`]):
            Model that extracts features from generated images to be used as inputs for the `image_encoder`.
        image_encoder ([`~transformers.CLIPVisionModelWithProjection`]):
            Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
        text_proj ([`UnCLIPTextProjModel`]):
            Utility class to prepare and combine the embeddings before they are passed to the decoder.
        decoder ([`UNet2DConditionModel`]):
            The decoder to invert the image embedding into an image.
        super_res_first ([`UNet2DModel`]):
            Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
        super_res_last ([`UNet2DModel`]):
            Super resolution UNet. Used in the last step of the super resolution diffusion process.
        decoder_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
        super_res_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).
    """

    decoder: UNet2DConditionModel
    text_proj: UnCLIPTextProjModel
    text_encoder: CLIPTextModelWithProjection
    tokenizer: CLIPTokenizer
    feature_extractor: CLIPImageProcessor
    image_encoder: CLIPVisionModelWithProjection
    super_res_first: UNet2DModel
    super_res_last: UNet2DModel

    decoder_scheduler: UnCLIPScheduler
    super_res_scheduler: UnCLIPScheduler
    model_cpu_offload_seq = "text_encoder->image_encoder->text_proj->decoder->super_res_first->super_res_last"

    def __init__(
        self,
        decoder: UNet2DConditionModel,
        text_encoder: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        text_proj: UnCLIPTextProjModel,
        feature_extractor: CLIPImageProcessor,
        image_encoder: CLIPVisionModelWithProjection,
        super_res_first: UNet2DModel,
        super_res_last: UNet2DModel,
        decoder_scheduler: UnCLIPScheduler,
        super_res_scheduler: UnCLIPScheduler,
    ):
        super().__init__()

        self.register_modules(
            decoder=decoder,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            text_proj=text_proj,
            feature_extractor=feature_extractor,
            image_encoder=image_encoder,
            super_res_first=super_res_first,
            super_res_last=super_res_last,
            decoder_scheduler=decoder_scheduler,
            super_res_scheduler=super_res_scheduler,
        )

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def _encode_prompt(self, prompt, num_images_per_prompt, do_classifier_free_guidance):
        batch_size = len(prompt) if isinstance(prompt, list) else 1

        # get prompt text embeddings
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            return_tensors="np",
        )
        text_input_ids = ms.Tensor.from_numpy(text_inputs.input_ids)
        text_mask = ms.Tensor.from_numpy(text_inputs.attention_mask)  # MindSpore mask does not require bool()
        text_encoder_output = self.text_encoder(text_input_ids)

        prompt_embeds = text_encoder_output[0]
        text_encoder_hidden_states = text_encoder_output[1]

        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
        text_encoder_hidden_states = text_encoder_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)
        text_mask = text_mask.repeat_interleave(num_images_per_prompt, dim=0)

        if do_classifier_free_guidance:
            uncond_tokens = [""] * batch_size

            max_length = text_input_ids.shape[-1]
            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=max_length,
                truncation=True,
                return_tensors="np",
            )
            uncond_text_mask = ms.Tensor.from_numpy(
                uncond_input.attention_mask
            )  # MindSpore mask does not require bool()
            negative_prompt_embeds_text_encoder_output = self.text_encoder(ms.Tensor.from_numpy(uncond_input.input_ids))

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]
            uncond_text_encoder_hidden_states = negative_prompt_embeds_text_encoder_output[1]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method

            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len)

            seq_len = uncond_text_encoder_hidden_states.shape[1]
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.tile((1, num_images_per_prompt, 1))
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            uncond_text_mask = uncond_text_mask.repeat_interleave(num_images_per_prompt, dim=0)

            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            prompt_embeds = ops.cat([negative_prompt_embeds, prompt_embeds])
            text_encoder_hidden_states = ops.cat([uncond_text_encoder_hidden_states, text_encoder_hidden_states])

            text_mask = ops.cat([uncond_text_mask, text_mask])

        return prompt_embeds, text_encoder_hidden_states, text_mask

    def _encode_image(self, image, num_images_per_prompt, image_embeddings: Optional[ms.Tensor] = None):
        dtype = next(self.image_encoder.get_parameters()).dtype

        if image_embeddings is None:
            if not isinstance(image, ms.Tensor):
                image = self.feature_extractor(images=image, return_tensors="np").pixel_values
                image = ms.Tensor(image)

            image = image.to(dtype=dtype)
            image_embeddings = self.image_encoder(image)[0]

        image_embeddings = image_embeddings.repeat_interleave(num_images_per_prompt, dim=0)

        return image_embeddings

    def __call__(
        self,
        image: Optional[Union[PIL.Image.Image, List[PIL.Image.Image], ms.Tensor]] = None,
        num_images_per_prompt: int = 1,
        decoder_num_inference_steps: int = 25,
        super_res_num_inference_steps: int = 7,
        generator: Optional[np.random.Generator] = None,
        decoder_latents: Optional[ms.Tensor] = None,
        super_res_latents: Optional[ms.Tensor] = None,
        image_embeddings: Optional[ms.Tensor] = None,
        decoder_guidance_scale: float = 8.0,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor`):
                `Image` or tensor representing an image batch to be used as the starting point. If you provide a
                tensor, it needs to be compatible with the [`CLIPImageProcessor`]
                [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
                Can be left as `None` only when `image_embeddings` are passed.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            decoder_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            super_res_num_inference_steps (`int`, *optional*, defaults to 7):
                The number of denoising steps for super resolution. More denoising steps usually lead to a higher
                quality image at the expense of slower inference.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            image_embeddings (`ms.Tensor`, *optional*):
                Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings
                can be passed for tasks like image interpolations. `image` can be left as `None`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images.
        """
        if image is not None:
            if isinstance(image, PIL.Image.Image):
                batch_size = 1
            elif isinstance(image, list):
                batch_size = len(image)
            else:
                batch_size = image.shape[0]
        else:
            batch_size = image_embeddings.shape[0]

        prompt = [""] * batch_size

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = decoder_guidance_scale > 1.0

        prompt_embeds, text_encoder_hidden_states, text_mask = self._encode_prompt(
            prompt, num_images_per_prompt, do_classifier_free_guidance
        )

        image_embeddings = self._encode_image(image, num_images_per_prompt, image_embeddings)

        # decoder
        text_encoder_hidden_states, additive_clip_time_embeddings = self.text_proj(
            image_embeddings=image_embeddings,
            prompt_embeds=prompt_embeds,
            text_encoder_hidden_states=text_encoder_hidden_states,
            do_classifier_free_guidance=do_classifier_free_guidance,
        )

        decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=True)

        self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
        decoder_timesteps_tensor = self.decoder_scheduler.timesteps

        num_channels_latents = self.decoder.config.in_channels
        height = self.decoder.config.sample_size
        width = self.decoder.config.sample_size

        if decoder_latents is None:
            decoder_latents = self.prepare_latents(
                (batch_size, num_channels_latents, height, width),
                text_encoder_hidden_states.dtype,
                generator,
                decoder_latents,
                self.decoder_scheduler,
            )

        for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

            noise_pred = self.decoder(
                sample=latent_model_input,
                timestep=t,
                encoder_hidden_states=text_encoder_hidden_states,
                class_labels=additive_clip_time_embeddings,
                attention_mask=decoder_text_mask,
            )[0]

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
                noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
                noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
                noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

            if i + 1 == decoder_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = decoder_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            decoder_latents = self.decoder_scheduler.step(
                noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        decoder_latents = decoder_latents.clamp(-1, 1)

        image_small = decoder_latents

        # done decoder

        # super res

        self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
        super_res_timesteps_tensor = self.super_res_scheduler.timesteps

        channels = self.super_res_first.config.in_channels // 2
        height = self.super_res_first.config.sample_size
        width = self.super_res_first.config.sample_size

        if super_res_latents is None:
            super_res_latents = self.prepare_latents(
                (batch_size, channels, height, width),
                image_small.dtype,
                generator,
                super_res_latents,
                self.super_res_scheduler,
            )

        interpolate_antialias = {}
        if "antialias" in inspect.signature(ops.interpolate).parameters:
            interpolate_antialias["antialias"] = True

        image_upscaled = ops.interpolate(
            image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
        )

        for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
            # no classifier free guidance

            if i == super_res_timesteps_tensor.shape[0] - 1:
                unet = self.super_res_last
            else:
                unet = self.super_res_first

            latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

            noise_pred = unet(
                sample=latent_model_input,
                timestep=t,
            )[0]

            if i + 1 == super_res_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = super_res_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            super_res_latents = self.super_res_scheduler.step(
                noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        image = super_res_latents

        # post processing

        image = image * 0.5 + 0.5
        image = image.clamp(0, 1)
        image = image.permute(0, 2, 3, 1).float().numpy()

        if output_type == "pil":
            image = self.numpy_to_pil(image)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

`mindone.diffusers.UnCLIPImageVariationPipeline.call(image=None, num_images_per_prompt=1, decoder_num_inference_steps=25, super_res_num_inference_steps=7, generator=None, decoder_latents=None, super_res_latents=None, image_embeddings=None, decoder_guidance_scale=8.0, output_type='pil', return_dict=False)` ¶

The call function to the pipeline for generation.

PARAMETER	DESCRIPTION
`image`	`Image` or tensor representing an image batch to be used as the starting point. If you provide a tensor, it needs to be compatible with the [`CLIPImageProcessor`][configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json). Can be left as `None` only when `image_embeddings` are passed. TYPE: `PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor` DEFAULT: `None`
`num_images_per_prompt`	The number of images to generate per prompt. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`decoder_num_inference_steps`	The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 25 DEFAULT: `25`
`super_res_num_inference_steps`	The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 7 DEFAULT: `7`
`generator`	A `np.random.Generator` to make generation deterministic. TYPE: `np.random.Generator`, optional DEFAULT: `None`
`decoder_latents`	Pre-generated noisy latents to be used as inputs for the decoder. TYPE: `ms.Tensor` of shape (batch size, channels, height, width), optional DEFAULT: `None`
`super_res_latents`	Pre-generated noisy latents to be used as inputs for the decoder. TYPE: `ms.Tensor` of shape (batch size, channels, super res height, super res width), optional DEFAULT: `None`
`decoder_guidance_scale`	A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `8.0`
`image_embeddings`	Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings can be passed for tasks like image interpolations. `image` can be left as `None`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`output_type`	The output format of the generated image. Choose between `PIL.Image` or `np.array`. TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'pil'`
`return_dict`	Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`

RETURNS	DESCRIPTION
	[`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py

def __call__(
    self,
    image: Optional[Union[PIL.Image.Image, List[PIL.Image.Image], ms.Tensor]] = None,
    num_images_per_prompt: int = 1,
    decoder_num_inference_steps: int = 25,
    super_res_num_inference_steps: int = 7,
    generator: Optional[np.random.Generator] = None,
    decoder_latents: Optional[ms.Tensor] = None,
    super_res_latents: Optional[ms.Tensor] = None,
    image_embeddings: Optional[ms.Tensor] = None,
    decoder_guidance_scale: float = 8.0,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor`):
            `Image` or tensor representing an image batch to be used as the starting point. If you provide a
            tensor, it needs to be compatible with the [`CLIPImageProcessor`]
            [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
            Can be left as `None` only when `image_embeddings` are passed.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        decoder_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        super_res_num_inference_steps (`int`, *optional*, defaults to 7):
            The number of denoising steps for super resolution. More denoising steps usually lead to a higher
            quality image at the expense of slower inference.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        image_embeddings (`ms.Tensor`, *optional*):
            Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings
            can be passed for tasks like image interpolations. `image` can be left as `None`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images.
    """
    if image is not None:
        if isinstance(image, PIL.Image.Image):
            batch_size = 1
        elif isinstance(image, list):
            batch_size = len(image)
        else:
            batch_size = image.shape[0]
    else:
        batch_size = image_embeddings.shape[0]

    prompt = [""] * batch_size

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = decoder_guidance_scale > 1.0

    prompt_embeds, text_encoder_hidden_states, text_mask = self._encode_prompt(
        prompt, num_images_per_prompt, do_classifier_free_guidance
    )

    image_embeddings = self._encode_image(image, num_images_per_prompt, image_embeddings)

    # decoder
    text_encoder_hidden_states, additive_clip_time_embeddings = self.text_proj(
        image_embeddings=image_embeddings,
        prompt_embeds=prompt_embeds,
        text_encoder_hidden_states=text_encoder_hidden_states,
        do_classifier_free_guidance=do_classifier_free_guidance,
    )

    decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=True)

    self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
    decoder_timesteps_tensor = self.decoder_scheduler.timesteps

    num_channels_latents = self.decoder.config.in_channels
    height = self.decoder.config.sample_size
    width = self.decoder.config.sample_size

    if decoder_latents is None:
        decoder_latents = self.prepare_latents(
            (batch_size, num_channels_latents, height, width),
            text_encoder_hidden_states.dtype,
            generator,
            decoder_latents,
            self.decoder_scheduler,
        )

    for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

        noise_pred = self.decoder(
            sample=latent_model_input,
            timestep=t,
            encoder_hidden_states=text_encoder_hidden_states,
            class_labels=additive_clip_time_embeddings,
            attention_mask=decoder_text_mask,
        )[0]

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
            noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
            noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
            noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

        if i + 1 == decoder_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = decoder_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        decoder_latents = self.decoder_scheduler.step(
            noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    decoder_latents = decoder_latents.clamp(-1, 1)

    image_small = decoder_latents

    # done decoder

    # super res

    self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
    super_res_timesteps_tensor = self.super_res_scheduler.timesteps

    channels = self.super_res_first.config.in_channels // 2
    height = self.super_res_first.config.sample_size
    width = self.super_res_first.config.sample_size

    if super_res_latents is None:
        super_res_latents = self.prepare_latents(
            (batch_size, channels, height, width),
            image_small.dtype,
            generator,
            super_res_latents,
            self.super_res_scheduler,
        )

    interpolate_antialias = {}
    if "antialias" in inspect.signature(ops.interpolate).parameters:
        interpolate_antialias["antialias"] = True

    image_upscaled = ops.interpolate(
        image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
    )

    for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
        # no classifier free guidance

        if i == super_res_timesteps_tensor.shape[0] - 1:
            unet = self.super_res_last
        else:
            unet = self.super_res_first

        latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

        noise_pred = unet(
            sample=latent_model_input,
            timestep=t,
        )[0]

        if i + 1 == super_res_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = super_res_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        super_res_latents = self.super_res_scheduler.step(
            noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    image = super_res_latents

    # post processing

    image = image * 0.5 + 0.5
    image = image.clamp(0, 1)
    image = image.permute(0, 2, 3, 1).float().numpy()

    if output_type == "pil":
        image = self.numpy_to_pil(image)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)

`mindone.diffusers.pipelines.ImagePipelineOutput` `dataclass` ¶

Bases: BaseOutput

Output class for image pipelines.

Source code in mindone/diffusers/pipelines/pipeline_utils.py

@dataclass
class ImagePipelineOutput(BaseOutput):
    """
    Output class for image pipelines.

    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
            num_channels)`.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]

unCLIP¶

mindone.diffusers.UnCLIPPipeline ¶

mindone.diffusers.UnCLIPImageVariationPipeline ¶

mindone.diffusers.pipelines.ImagePipelineOutput dataclass ¶

`mindone.diffusers.UnCLIPPipeline` ¶

`mindone.diffusers.UnCLIPImageVariationPipeline` ¶

`mindone.diffusers.pipelines.ImagePipelineOutput` `dataclass` ¶