Skip to content

unCLIP

Hierarchical Text-Conditional Image Generation with CLIP Latents is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's karlo.

The abstract from the paper is following:

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

You can find lucidrains' DALL-E 2 recreation at lucidrains/DALLE2-pytorch.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

mindone.diffusers.UnCLIPPipeline

Bases: DiffusionPipeline

Pipeline for text-to-image generation using unCLIP.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
text_encoder

Frozen text-encoder.

TYPE: [`~transformers.CLIPTextModelWithProjection`]

tokenizer

A CLIPTokenizer to tokenize text.

TYPE: [`~transformers.CLIPTokenizer`]

prior

The canonical unCLIP prior to approximate the image embedding from the text embedding.

TYPE: [`PriorTransformer`]

text_proj

Utility class to prepare and combine the embeddings before they are passed to the decoder.

TYPE: [`UnCLIPTextProjModel`]

decoder

The decoder to invert the image embedding into an image.

TYPE: [`UNet2DConditionModel`]

super_res_first

Super resolution UNet. Used in all but the last step of the super resolution diffusion process.

TYPE: [`UNet2DModel`]

super_res_last

Super resolution UNet. Used in the last step of the super resolution diffusion process.

TYPE: [`UNet2DModel`]

prior_scheduler

Scheduler used in the prior denoising process (a modified [DDPMScheduler]).

TYPE: [`UnCLIPScheduler`]

decoder_scheduler

Scheduler used in the decoder denoising process (a modified [DDPMScheduler]).

TYPE: [`UnCLIPScheduler`]

super_res_scheduler

Scheduler used in the super resolution denoising process (a modified [DDPMScheduler]).

TYPE: [`UnCLIPScheduler`]

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip.py
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
class UnCLIPPipeline(DiffusionPipeline):
    """
    Pipeline for text-to-image generation using unCLIP.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            A `CLIPTokenizer` to tokenize text.
        prior ([`PriorTransformer`]):
            The canonical unCLIP prior to approximate the image embedding from the text embedding.
        text_proj ([`UnCLIPTextProjModel`]):
            Utility class to prepare and combine the embeddings before they are passed to the decoder.
        decoder ([`UNet2DConditionModel`]):
            The decoder to invert the image embedding into an image.
        super_res_first ([`UNet2DModel`]):
            Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
        super_res_last ([`UNet2DModel`]):
            Super resolution UNet. Used in the last step of the super resolution diffusion process.
        prior_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the prior denoising process (a modified [`DDPMScheduler`]).
        decoder_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
        super_res_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).

    """

    _exclude_from_cpu_offload = ["prior"]

    prior: PriorTransformer
    decoder: UNet2DConditionModel
    text_proj: UnCLIPTextProjModel
    text_encoder: CLIPTextModelWithProjection
    tokenizer: CLIPTokenizer
    super_res_first: UNet2DModel
    super_res_last: UNet2DModel

    prior_scheduler: UnCLIPScheduler
    decoder_scheduler: UnCLIPScheduler
    super_res_scheduler: UnCLIPScheduler

    model_cpu_offload_seq = "text_encoder->text_proj->decoder->super_res_first->super_res_last"

    def __init__(
        self,
        prior: PriorTransformer,
        decoder: UNet2DConditionModel,
        text_encoder: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        text_proj: UnCLIPTextProjModel,
        super_res_first: UNet2DModel,
        super_res_last: UNet2DModel,
        prior_scheduler: UnCLIPScheduler,
        decoder_scheduler: UnCLIPScheduler,
        super_res_scheduler: UnCLIPScheduler,
    ):
        super().__init__()

        self.register_modules(
            prior=prior,
            decoder=decoder,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            text_proj=text_proj,
            super_res_first=super_res_first,
            super_res_last=super_res_last,
            prior_scheduler=prior_scheduler,
            decoder_scheduler=decoder_scheduler,
            super_res_scheduler=super_res_scheduler,
        )

    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def _encode_prompt(
        self,
        prompt,
        num_images_per_prompt,
        do_classifier_free_guidance,
        text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
        text_attention_mask: Optional[ms.Tensor] = None,
    ):
        if text_model_output is None:
            batch_size = len(prompt) if isinstance(prompt, list) else 1
            # get prompt text embeddings
            text_inputs = self.tokenizer(
                prompt,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            text_input_ids = text_inputs.input_ids
            text_mask = ms.Tensor.from_numpy(text_inputs.attention_mask)  # MindSpore mask does not require bool()

            untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                text_input_ids, untruncated_ids
            ):
                removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
                logger.warning(
                    "The following part of your input was truncated because CLIP can only handle sequences up to"
                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                )
                text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]

            text_encoder_output = self.text_encoder(ms.Tensor(text_input_ids))

            prompt_embeds = text_encoder_output[0]
            text_enc_hid_states = text_encoder_output[1]

        else:
            batch_size = text_model_output[0].shape[0]
            prompt_embeds, text_enc_hid_states = text_model_output[0], text_model_output[1]
            text_mask = text_attention_mask

        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
        text_enc_hid_states = text_enc_hid_states.repeat_interleave(num_images_per_prompt, dim=0)
        text_mask = text_mask.repeat_interleave(num_images_per_prompt, dim=0)

        if do_classifier_free_guidance:
            uncond_tokens = [""] * batch_size

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )
            uncond_text_mask = ms.Tensor.from_numpy(
                uncond_input.attention_mask
            )  # MindSpore mask does not require bool()
            negative_prompt_embeds_text_encoder_output = self.text_encoder(ms.Tensor.from_numpy(uncond_input.input_ids))

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]
            uncond_text_enc_hid_states = negative_prompt_embeds_text_encoder_output[1]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method

            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len)

            seq_len = uncond_text_enc_hid_states.shape[1]
            uncond_text_enc_hid_states = uncond_text_enc_hid_states.tile((1, num_images_per_prompt, 1))
            uncond_text_enc_hid_states = uncond_text_enc_hid_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            uncond_text_mask = uncond_text_mask.repeat_interleave(num_images_per_prompt, dim=0)

            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            prompt_embeds = ops.cat([negative_prompt_embeds, prompt_embeds])
            text_enc_hid_states = ops.cat([uncond_text_enc_hid_states, text_enc_hid_states])

            text_mask = ops.cat([uncond_text_mask, text_mask])

        return prompt_embeds, text_enc_hid_states, text_mask

    def __call__(
        self,
        prompt: Optional[Union[str, List[str]]] = None,
        num_images_per_prompt: int = 1,
        prior_num_inference_steps: int = 25,
        decoder_num_inference_steps: int = 25,
        super_res_num_inference_steps: int = 7,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        prior_latents: Optional[ms.Tensor] = None,
        decoder_latents: Optional[ms.Tensor] = None,
        super_res_latents: Optional[ms.Tensor] = None,
        text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
        text_attention_mask: Optional[ms.Tensor] = None,
        prior_guidance_scale: float = 4.0,
        decoder_guidance_scale: float = 8.0,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output`
                and `text_attention_mask` is passed.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            prior_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the prior. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            decoder_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            super_res_num_inference_steps (`int`, *optional*, defaults to 7):
                The number of denoising steps for super resolution. More denoising steps usually lead to a higher
                quality image at the expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            prior_latents (`ms.Tensor` of shape (batch size, embeddings dimension), *optional*):
                Pre-generated noisy latents to be used as inputs for the prior.
            decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            prior_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            text_model_output (`CLIPTextModelOutput`, *optional*):
                Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text
                outputs can be passed for tasks like text embedding interpolations. Make sure to also pass
                `text_attention_mask` in this case. `prompt` can the be left `None`.
            text_attention_mask (`ms.Tensor`, *optional*):
                Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention
                masks are necessary when passing `text_model_output`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images.
        """
        if prompt is not None:
            if isinstance(prompt, str):
                batch_size = 1
            elif isinstance(prompt, list):
                batch_size = len(prompt)
            else:
                raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
        else:
            batch_size = text_model_output[0].shape[0]

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = prior_guidance_scale > 1.0 or decoder_guidance_scale > 1.0

        prompt_embeds, text_enc_hid_states, text_mask = self._encode_prompt(
            prompt, num_images_per_prompt, do_classifier_free_guidance, text_model_output, text_attention_mask
        )

        # prior

        self.prior_scheduler.set_timesteps(prior_num_inference_steps)
        prior_timesteps_tensor = self.prior_scheduler.timesteps

        embedding_dim = self.prior.config.embedding_dim

        prior_latents = self.prepare_latents(
            (batch_size, embedding_dim),
            prompt_embeds.dtype,
            generator,
            prior_latents,
            self.prior_scheduler,
        )

        for i, t in enumerate(self.progress_bar(prior_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([prior_latents] * 2) if do_classifier_free_guidance else prior_latents

            predicted_image_embedding = self.prior(
                latent_model_input,
                timestep=t,
                proj_embedding=prompt_embeds,
                encoder_hidden_states=text_enc_hid_states,
                attention_mask=text_mask,
            )[0]

            if do_classifier_free_guidance:
                predicted_image_embedding_uncond, predicted_image_embedding_text = predicted_image_embedding.chunk(2)
                predicted_image_embedding = predicted_image_embedding_uncond + prior_guidance_scale * (
                    predicted_image_embedding_text - predicted_image_embedding_uncond
                )

            if i + 1 == prior_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = prior_timesteps_tensor[i + 1]

            prior_latents = self.prior_scheduler.step(
                predicted_image_embedding,
                timestep=t,
                sample=prior_latents,
                generator=generator,
                prev_timestep=prev_timestep,
            )[0]

        prior_latents = self.prior.post_process_latents(prior_latents)

        image_embeddings = prior_latents

        # done prior

        # decoder

        text_enc_hid_states, additive_clip_time_embeddings = self.text_proj(
            image_embeddings=image_embeddings,
            prompt_embeds=prompt_embeds,
            text_encoder_hidden_states=text_enc_hid_states,
            do_classifier_free_guidance=do_classifier_free_guidance,
        )

        decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=1.0)

        self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
        decoder_timesteps_tensor = self.decoder_scheduler.timesteps

        num_channels_latents = self.decoder.config.in_channels
        height = self.decoder.config.sample_size
        width = self.decoder.config.sample_size

        decoder_latents = self.prepare_latents(
            (batch_size, num_channels_latents, height, width),
            text_enc_hid_states.dtype,
            generator,
            decoder_latents,
            self.decoder_scheduler,
        )

        for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

            noise_pred = self.decoder(
                sample=latent_model_input,
                timestep=t,
                encoder_hidden_states=text_enc_hid_states,
                class_labels=additive_clip_time_embeddings,
                attention_mask=decoder_text_mask,
            )[0]

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
                noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
                noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
                noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

            if i + 1 == decoder_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = decoder_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            decoder_latents = self.decoder_scheduler.step(
                noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        decoder_latents = decoder_latents.clamp(-1, 1)

        image_small = decoder_latents

        # done decoder

        # super res

        self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
        super_res_timesteps_tensor = self.super_res_scheduler.timesteps

        channels = self.super_res_first.config.in_channels // 2
        height = self.super_res_first.config.sample_size
        width = self.super_res_first.config.sample_size

        super_res_latents = self.prepare_latents(
            (batch_size, channels, height, width),
            image_small.dtype,
            generator,
            super_res_latents,
            self.super_res_scheduler,
        )

        interpolate_antialias = {}
        if "antialias" in inspect.signature(ops.interpolate).parameters:
            interpolate_antialias["antialias"] = True

        image_upscaled = ops.interpolate(
            image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
        )

        for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
            # no classifier free guidance

            if i == super_res_timesteps_tensor.shape[0] - 1:
                unet = self.super_res_last
            else:
                unet = self.super_res_first

            latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

            noise_pred = unet(
                sample=latent_model_input,
                timestep=t,
            )[0]

            if i + 1 == super_res_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = super_res_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            super_res_latents = self.super_res_scheduler.step(
                noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        image = super_res_latents
        # done super res

        # post processing
        image = image * 0.5 + 0.5
        image = image.clamp(0, 1)
        image = image.permute(0, 2, 3, 1).float().numpy()

        if output_type == "pil":
            image = self.numpy_to_pil(image)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

mindone.diffusers.UnCLIPPipeline.__call__(prompt=None, num_images_per_prompt=1, prior_num_inference_steps=25, decoder_num_inference_steps=25, super_res_num_inference_steps=7, generator=None, prior_latents=None, decoder_latents=None, super_res_latents=None, text_model_output=None, text_attention_mask=None, prior_guidance_scale=4.0, decoder_guidance_scale=8.0, output_type='pil', return_dict=False)

The call function to the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide image generation. This can only be left undefined if text_model_output and text_attention_mask is passed.

TYPE: `str` or `List[str]` DEFAULT: None

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

prior_num_inference_steps

The number of denoising steps for the prior. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 25 DEFAULT: 25

decoder_num_inference_steps

The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 25 DEFAULT: 25

super_res_num_inference_steps

The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 7 DEFAULT: 7

generator

A np.random.Generator to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

prior_latents

Pre-generated noisy latents to be used as inputs for the prior.

TYPE: `ms.Tensor` of shape (batch size, embeddings dimension), *optional* DEFAULT: None

decoder_latents

Pre-generated noisy latents to be used as inputs for the decoder.

TYPE: `ms.Tensor` of shape (batch size, channels, height, width), *optional* DEFAULT: None

super_res_latents

Pre-generated noisy latents to be used as inputs for the decoder.

TYPE: `ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional* DEFAULT: None

prior_guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 4.0

decoder_guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 8.0

text_model_output

Pre-defined [CLIPTextModel] outputs that can be derived from the text encoder. Pre-defined text outputs can be passed for tasks like text embedding interpolations. Make sure to also pass text_attention_mask in this case. prompt can the be left None.

TYPE: `CLIPTextModelOutput`, *optional* DEFAULT: None

text_attention_mask

Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention masks are necessary when passing text_model_output.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

output_type

The output format of the generated image. Choose between PIL.Image or np.array.

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple: If return_dict is True, [~pipelines.ImagePipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip.py
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
def __call__(
    self,
    prompt: Optional[Union[str, List[str]]] = None,
    num_images_per_prompt: int = 1,
    prior_num_inference_steps: int = 25,
    decoder_num_inference_steps: int = 25,
    super_res_num_inference_steps: int = 7,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    prior_latents: Optional[ms.Tensor] = None,
    decoder_latents: Optional[ms.Tensor] = None,
    super_res_latents: Optional[ms.Tensor] = None,
    text_model_output: Optional[Union[CLIPTextModelOutput, Tuple]] = None,
    text_attention_mask: Optional[ms.Tensor] = None,
    prior_guidance_scale: float = 4.0,
    decoder_guidance_scale: float = 8.0,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output`
            and `text_attention_mask` is passed.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        prior_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the prior. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        decoder_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        super_res_num_inference_steps (`int`, *optional*, defaults to 7):
            The number of denoising steps for super resolution. More denoising steps usually lead to a higher
            quality image at the expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        prior_latents (`ms.Tensor` of shape (batch size, embeddings dimension), *optional*):
            Pre-generated noisy latents to be used as inputs for the prior.
        decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        prior_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        text_model_output (`CLIPTextModelOutput`, *optional*):
            Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text
            outputs can be passed for tasks like text embedding interpolations. Make sure to also pass
            `text_attention_mask` in this case. `prompt` can the be left `None`.
        text_attention_mask (`ms.Tensor`, *optional*):
            Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention
            masks are necessary when passing `text_model_output`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images.
    """
    if prompt is not None:
        if isinstance(prompt, str):
            batch_size = 1
        elif isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
    else:
        batch_size = text_model_output[0].shape[0]

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = prior_guidance_scale > 1.0 or decoder_guidance_scale > 1.0

    prompt_embeds, text_enc_hid_states, text_mask = self._encode_prompt(
        prompt, num_images_per_prompt, do_classifier_free_guidance, text_model_output, text_attention_mask
    )

    # prior

    self.prior_scheduler.set_timesteps(prior_num_inference_steps)
    prior_timesteps_tensor = self.prior_scheduler.timesteps

    embedding_dim = self.prior.config.embedding_dim

    prior_latents = self.prepare_latents(
        (batch_size, embedding_dim),
        prompt_embeds.dtype,
        generator,
        prior_latents,
        self.prior_scheduler,
    )

    for i, t in enumerate(self.progress_bar(prior_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([prior_latents] * 2) if do_classifier_free_guidance else prior_latents

        predicted_image_embedding = self.prior(
            latent_model_input,
            timestep=t,
            proj_embedding=prompt_embeds,
            encoder_hidden_states=text_enc_hid_states,
            attention_mask=text_mask,
        )[0]

        if do_classifier_free_guidance:
            predicted_image_embedding_uncond, predicted_image_embedding_text = predicted_image_embedding.chunk(2)
            predicted_image_embedding = predicted_image_embedding_uncond + prior_guidance_scale * (
                predicted_image_embedding_text - predicted_image_embedding_uncond
            )

        if i + 1 == prior_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = prior_timesteps_tensor[i + 1]

        prior_latents = self.prior_scheduler.step(
            predicted_image_embedding,
            timestep=t,
            sample=prior_latents,
            generator=generator,
            prev_timestep=prev_timestep,
        )[0]

    prior_latents = self.prior.post_process_latents(prior_latents)

    image_embeddings = prior_latents

    # done prior

    # decoder

    text_enc_hid_states, additive_clip_time_embeddings = self.text_proj(
        image_embeddings=image_embeddings,
        prompt_embeds=prompt_embeds,
        text_encoder_hidden_states=text_enc_hid_states,
        do_classifier_free_guidance=do_classifier_free_guidance,
    )

    decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=1.0)

    self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
    decoder_timesteps_tensor = self.decoder_scheduler.timesteps

    num_channels_latents = self.decoder.config.in_channels
    height = self.decoder.config.sample_size
    width = self.decoder.config.sample_size

    decoder_latents = self.prepare_latents(
        (batch_size, num_channels_latents, height, width),
        text_enc_hid_states.dtype,
        generator,
        decoder_latents,
        self.decoder_scheduler,
    )

    for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

        noise_pred = self.decoder(
            sample=latent_model_input,
            timestep=t,
            encoder_hidden_states=text_enc_hid_states,
            class_labels=additive_clip_time_embeddings,
            attention_mask=decoder_text_mask,
        )[0]

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
            noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
            noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
            noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

        if i + 1 == decoder_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = decoder_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        decoder_latents = self.decoder_scheduler.step(
            noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    decoder_latents = decoder_latents.clamp(-1, 1)

    image_small = decoder_latents

    # done decoder

    # super res

    self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
    super_res_timesteps_tensor = self.super_res_scheduler.timesteps

    channels = self.super_res_first.config.in_channels // 2
    height = self.super_res_first.config.sample_size
    width = self.super_res_first.config.sample_size

    super_res_latents = self.prepare_latents(
        (batch_size, channels, height, width),
        image_small.dtype,
        generator,
        super_res_latents,
        self.super_res_scheduler,
    )

    interpolate_antialias = {}
    if "antialias" in inspect.signature(ops.interpolate).parameters:
        interpolate_antialias["antialias"] = True

    image_upscaled = ops.interpolate(
        image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
    )

    for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
        # no classifier free guidance

        if i == super_res_timesteps_tensor.shape[0] - 1:
            unet = self.super_res_last
        else:
            unet = self.super_res_first

        latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

        noise_pred = unet(
            sample=latent_model_input,
            timestep=t,
        )[0]

        if i + 1 == super_res_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = super_res_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        super_res_latents = self.super_res_scheduler.step(
            noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    image = super_res_latents
    # done super res

    # post processing
    image = image * 0.5 + 0.5
    image = image.clamp(0, 1)
    image = image.permute(0, 2, 3, 1).float().numpy()

    if output_type == "pil":
        image = self.numpy_to_pil(image)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)

mindone.diffusers.UnCLIPImageVariationPipeline

Bases: DiffusionPipeline

Pipeline to generate image variations from an input image using UnCLIP.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
text_encoder

Frozen text-encoder.

TYPE: [`~transformers.CLIPTextModelWithProjection`]

tokenizer

A CLIPTokenizer to tokenize text.

TYPE: [`~transformers.CLIPTokenizer`]

feature_extractor

Model that extracts features from generated images to be used as inputs for the image_encoder.

TYPE: [`~transformers.CLIPImageProcessor`]

image_encoder

Frozen CLIP image-encoder (clip-vit-large-patch14).

TYPE: [`~transformers.CLIPVisionModelWithProjection`]

text_proj

Utility class to prepare and combine the embeddings before they are passed to the decoder.

TYPE: [`UnCLIPTextProjModel`]

decoder

The decoder to invert the image embedding into an image.

TYPE: [`UNet2DConditionModel`]

super_res_first

Super resolution UNet. Used in all but the last step of the super resolution diffusion process.

TYPE: [`UNet2DModel`]

super_res_last

Super resolution UNet. Used in the last step of the super resolution diffusion process.

TYPE: [`UNet2DModel`]

decoder_scheduler

Scheduler used in the decoder denoising process (a modified [DDPMScheduler]).

TYPE: [`UnCLIPScheduler`]

super_res_scheduler

Scheduler used in the super resolution denoising process (a modified [DDPMScheduler]).

TYPE: [`UnCLIPScheduler`]

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
class UnCLIPImageVariationPipeline(DiffusionPipeline):
    """
    Pipeline to generate image variations from an input image using UnCLIP.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            A `CLIPTokenizer` to tokenize text.
        feature_extractor ([`~transformers.CLIPImageProcessor`]):
            Model that extracts features from generated images to be used as inputs for the `image_encoder`.
        image_encoder ([`~transformers.CLIPVisionModelWithProjection`]):
            Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
        text_proj ([`UnCLIPTextProjModel`]):
            Utility class to prepare and combine the embeddings before they are passed to the decoder.
        decoder ([`UNet2DConditionModel`]):
            The decoder to invert the image embedding into an image.
        super_res_first ([`UNet2DModel`]):
            Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
        super_res_last ([`UNet2DModel`]):
            Super resolution UNet. Used in the last step of the super resolution diffusion process.
        decoder_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
        super_res_scheduler ([`UnCLIPScheduler`]):
            Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).
    """

    decoder: UNet2DConditionModel
    text_proj: UnCLIPTextProjModel
    text_encoder: CLIPTextModelWithProjection
    tokenizer: CLIPTokenizer
    feature_extractor: CLIPImageProcessor
    image_encoder: CLIPVisionModelWithProjection
    super_res_first: UNet2DModel
    super_res_last: UNet2DModel

    decoder_scheduler: UnCLIPScheduler
    super_res_scheduler: UnCLIPScheduler
    model_cpu_offload_seq = "text_encoder->image_encoder->text_proj->decoder->super_res_first->super_res_last"

    def __init__(
        self,
        decoder: UNet2DConditionModel,
        text_encoder: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        text_proj: UnCLIPTextProjModel,
        feature_extractor: CLIPImageProcessor,
        image_encoder: CLIPVisionModelWithProjection,
        super_res_first: UNet2DModel,
        super_res_last: UNet2DModel,
        decoder_scheduler: UnCLIPScheduler,
        super_res_scheduler: UnCLIPScheduler,
    ):
        super().__init__()

        self.register_modules(
            decoder=decoder,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            text_proj=text_proj,
            feature_extractor=feature_extractor,
            image_encoder=image_encoder,
            super_res_first=super_res_first,
            super_res_last=super_res_last,
            decoder_scheduler=decoder_scheduler,
            super_res_scheduler=super_res_scheduler,
        )

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = (latents * scheduler.init_noise_sigma).to(dtype)
        return latents

    def _encode_prompt(self, prompt, num_images_per_prompt, do_classifier_free_guidance):
        batch_size = len(prompt) if isinstance(prompt, list) else 1

        # get prompt text embeddings
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            return_tensors="np",
        )
        text_input_ids = ms.Tensor.from_numpy(text_inputs.input_ids)
        text_mask = ms.Tensor.from_numpy(text_inputs.attention_mask)  # MindSpore mask does not require bool()
        text_encoder_output = self.text_encoder(text_input_ids)

        prompt_embeds = text_encoder_output[0]
        text_encoder_hidden_states = text_encoder_output[1]

        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
        text_encoder_hidden_states = text_encoder_hidden_states.repeat_interleave(num_images_per_prompt, dim=0)
        text_mask = text_mask.repeat_interleave(num_images_per_prompt, dim=0)

        if do_classifier_free_guidance:
            uncond_tokens = [""] * batch_size

            max_length = text_input_ids.shape[-1]
            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=max_length,
                truncation=True,
                return_tensors="np",
            )
            uncond_text_mask = ms.Tensor.from_numpy(
                uncond_input.attention_mask
            )  # MindSpore mask does not require bool()
            negative_prompt_embeds_text_encoder_output = self.text_encoder(ms.Tensor.from_numpy(uncond_input.input_ids))

            negative_prompt_embeds = negative_prompt_embeds_text_encoder_output[0]
            uncond_text_encoder_hidden_states = negative_prompt_embeds_text_encoder_output[1]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method

            seq_len = negative_prompt_embeds.shape[1]
            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len)

            seq_len = uncond_text_encoder_hidden_states.shape[1]
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.tile((1, num_images_per_prompt, 1))
            uncond_text_encoder_hidden_states = uncond_text_encoder_hidden_states.view(
                batch_size * num_images_per_prompt, seq_len, -1
            )
            uncond_text_mask = uncond_text_mask.repeat_interleave(num_images_per_prompt, dim=0)

            # done duplicates

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            prompt_embeds = ops.cat([negative_prompt_embeds, prompt_embeds])
            text_encoder_hidden_states = ops.cat([uncond_text_encoder_hidden_states, text_encoder_hidden_states])

            text_mask = ops.cat([uncond_text_mask, text_mask])

        return prompt_embeds, text_encoder_hidden_states, text_mask

    def _encode_image(self, image, num_images_per_prompt, image_embeddings: Optional[ms.Tensor] = None):
        dtype = next(self.image_encoder.get_parameters()).dtype

        if image_embeddings is None:
            if not isinstance(image, ms.Tensor):
                image = self.feature_extractor(images=image, return_tensors="np").pixel_values
                image = ms.Tensor(image)

            image = image.to(dtype=dtype)
            image_embeddings = self.image_encoder(image)[0]

        image_embeddings = image_embeddings.repeat_interleave(num_images_per_prompt, dim=0)

        return image_embeddings

    def __call__(
        self,
        image: Optional[Union[PIL.Image.Image, List[PIL.Image.Image], ms.Tensor]] = None,
        num_images_per_prompt: int = 1,
        decoder_num_inference_steps: int = 25,
        super_res_num_inference_steps: int = 7,
        generator: Optional[np.random.Generator] = None,
        decoder_latents: Optional[ms.Tensor] = None,
        super_res_latents: Optional[ms.Tensor] = None,
        image_embeddings: Optional[ms.Tensor] = None,
        decoder_guidance_scale: float = 8.0,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor`):
                `Image` or tensor representing an image batch to be used as the starting point. If you provide a
                tensor, it needs to be compatible with the [`CLIPImageProcessor`]
                [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
                Can be left as `None` only when `image_embeddings` are passed.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            decoder_num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
                image at the expense of slower inference.
            super_res_num_inference_steps (`int`, *optional*, defaults to 7):
                The number of denoising steps for super resolution. More denoising steps usually lead to a higher
                quality image at the expense of slower inference.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
                Pre-generated noisy latents to be used as inputs for the decoder.
            decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            image_embeddings (`ms.Tensor`, *optional*):
                Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings
                can be passed for tasks like image interpolations. `image` can be left as `None`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images.
        """
        if image is not None:
            if isinstance(image, PIL.Image.Image):
                batch_size = 1
            elif isinstance(image, list):
                batch_size = len(image)
            else:
                batch_size = image.shape[0]
        else:
            batch_size = image_embeddings.shape[0]

        prompt = [""] * batch_size

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = decoder_guidance_scale > 1.0

        prompt_embeds, text_encoder_hidden_states, text_mask = self._encode_prompt(
            prompt, num_images_per_prompt, do_classifier_free_guidance
        )

        image_embeddings = self._encode_image(image, num_images_per_prompt, image_embeddings)

        # decoder
        text_encoder_hidden_states, additive_clip_time_embeddings = self.text_proj(
            image_embeddings=image_embeddings,
            prompt_embeds=prompt_embeds,
            text_encoder_hidden_states=text_encoder_hidden_states,
            do_classifier_free_guidance=do_classifier_free_guidance,
        )

        decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=True)

        self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
        decoder_timesteps_tensor = self.decoder_scheduler.timesteps

        num_channels_latents = self.decoder.config.in_channels
        height = self.decoder.config.sample_size
        width = self.decoder.config.sample_size

        if decoder_latents is None:
            decoder_latents = self.prepare_latents(
                (batch_size, num_channels_latents, height, width),
                text_encoder_hidden_states.dtype,
                generator,
                decoder_latents,
                self.decoder_scheduler,
            )

        for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

            noise_pred = self.decoder(
                sample=latent_model_input,
                timestep=t,
                encoder_hidden_states=text_encoder_hidden_states,
                class_labels=additive_clip_time_embeddings,
                attention_mask=decoder_text_mask,
            )[0]

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
                noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
                noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
                noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

            if i + 1 == decoder_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = decoder_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            decoder_latents = self.decoder_scheduler.step(
                noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        decoder_latents = decoder_latents.clamp(-1, 1)

        image_small = decoder_latents

        # done decoder

        # super res

        self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
        super_res_timesteps_tensor = self.super_res_scheduler.timesteps

        channels = self.super_res_first.config.in_channels // 2
        height = self.super_res_first.config.sample_size
        width = self.super_res_first.config.sample_size

        if super_res_latents is None:
            super_res_latents = self.prepare_latents(
                (batch_size, channels, height, width),
                image_small.dtype,
                generator,
                super_res_latents,
                self.super_res_scheduler,
            )

        interpolate_antialias = {}
        if "antialias" in inspect.signature(ops.interpolate).parameters:
            interpolate_antialias["antialias"] = True

        image_upscaled = ops.interpolate(
            image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
        )

        for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
            # no classifier free guidance

            if i == super_res_timesteps_tensor.shape[0] - 1:
                unet = self.super_res_last
            else:
                unet = self.super_res_first

            latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

            noise_pred = unet(
                sample=latent_model_input,
                timestep=t,
            )[0]

            if i + 1 == super_res_timesteps_tensor.shape[0]:
                prev_timestep = None
            else:
                prev_timestep = super_res_timesteps_tensor[i + 1]

            # compute the previous noisy sample x_t -> x_t-1
            super_res_latents = self.super_res_scheduler.step(
                noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
            )[0]

        image = super_res_latents

        # post processing

        image = image * 0.5 + 0.5
        image = image.clamp(0, 1)
        image = image.permute(0, 2, 3, 1).float().numpy()

        if output_type == "pil":
            image = self.numpy_to_pil(image)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

mindone.diffusers.UnCLIPImageVariationPipeline.__call__(image=None, num_images_per_prompt=1, decoder_num_inference_steps=25, super_res_num_inference_steps=7, generator=None, decoder_latents=None, super_res_latents=None, image_embeddings=None, decoder_guidance_scale=8.0, output_type='pil', return_dict=False)

The call function to the pipeline for generation.

PARAMETER DESCRIPTION
image

Image or tensor representing an image batch to be used as the starting point. If you provide a tensor, it needs to be compatible with the [CLIPImageProcessor][configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json). Can be left as None only when image_embeddings are passed.

TYPE: `PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor` DEFAULT: None

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

decoder_num_inference_steps

The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 25 DEFAULT: 25

super_res_num_inference_steps

The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 7 DEFAULT: 7

generator

A np.random.Generator to make generation deterministic.

TYPE: `np.random.Generator`, *optional* DEFAULT: None

decoder_latents

Pre-generated noisy latents to be used as inputs for the decoder.

TYPE: `ms.Tensor` of shape (batch size, channels, height, width), *optional* DEFAULT: None

super_res_latents

Pre-generated noisy latents to be used as inputs for the decoder.

TYPE: `ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional* DEFAULT: None

decoder_guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 8.0

image_embeddings

Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings can be passed for tasks like image interpolations. image can be left as None.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

output_type

The output format of the generated image. Choose between PIL.Image or np.array.

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple: If return_dict is True, [~pipelines.ImagePipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
def __call__(
    self,
    image: Optional[Union[PIL.Image.Image, List[PIL.Image.Image], ms.Tensor]] = None,
    num_images_per_prompt: int = 1,
    decoder_num_inference_steps: int = 25,
    super_res_num_inference_steps: int = 7,
    generator: Optional[np.random.Generator] = None,
    decoder_latents: Optional[ms.Tensor] = None,
    super_res_latents: Optional[ms.Tensor] = None,
    image_embeddings: Optional[ms.Tensor] = None,
    decoder_guidance_scale: float = 8.0,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `ms.Tensor`):
            `Image` or tensor representing an image batch to be used as the starting point. If you provide a
            tensor, it needs to be compatible with the [`CLIPImageProcessor`]
            [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
            Can be left as `None` only when `image_embeddings` are passed.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        decoder_num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality
            image at the expense of slower inference.
        super_res_num_inference_steps (`int`, *optional*, defaults to 7):
            The number of denoising steps for super resolution. More denoising steps usually lead to a higher
            quality image at the expense of slower inference.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        decoder_latents (`ms.Tensor` of shape (batch size, channels, height, width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        super_res_latents (`ms.Tensor` of shape (batch size, channels, super res height, super res width), *optional*):
            Pre-generated noisy latents to be used as inputs for the decoder.
        decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        image_embeddings (`ms.Tensor`, *optional*):
            Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings
            can be passed for tasks like image interpolations. `image` can be left as `None`.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images.
    """
    if image is not None:
        if isinstance(image, PIL.Image.Image):
            batch_size = 1
        elif isinstance(image, list):
            batch_size = len(image)
        else:
            batch_size = image.shape[0]
    else:
        batch_size = image_embeddings.shape[0]

    prompt = [""] * batch_size

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = decoder_guidance_scale > 1.0

    prompt_embeds, text_encoder_hidden_states, text_mask = self._encode_prompt(
        prompt, num_images_per_prompt, do_classifier_free_guidance
    )

    image_embeddings = self._encode_image(image, num_images_per_prompt, image_embeddings)

    # decoder
    text_encoder_hidden_states, additive_clip_time_embeddings = self.text_proj(
        image_embeddings=image_embeddings,
        prompt_embeds=prompt_embeds,
        text_encoder_hidden_states=text_encoder_hidden_states,
        do_classifier_free_guidance=do_classifier_free_guidance,
    )

    decoder_text_mask = ops.pad(text_mask, (self.text_proj.clip_extra_context_tokens, 0), value=True)

    self.decoder_scheduler.set_timesteps(decoder_num_inference_steps)
    decoder_timesteps_tensor = self.decoder_scheduler.timesteps

    num_channels_latents = self.decoder.config.in_channels
    height = self.decoder.config.sample_size
    width = self.decoder.config.sample_size

    if decoder_latents is None:
        decoder_latents = self.prepare_latents(
            (batch_size, num_channels_latents, height, width),
            text_encoder_hidden_states.dtype,
            generator,
            decoder_latents,
            self.decoder_scheduler,
        )

    for i, t in enumerate(self.progress_bar(decoder_timesteps_tensor)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([decoder_latents] * 2) if do_classifier_free_guidance else decoder_latents

        noise_pred = self.decoder(
            sample=latent_model_input,
            timestep=t,
            encoder_hidden_states=text_encoder_hidden_states,
            class_labels=additive_clip_time_embeddings,
            attention_mask=decoder_text_mask,
        )[0]

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred_uncond, _ = noise_pred_uncond.split(latent_model_input.shape[1], axis=1)
            noise_pred_text, predicted_variance = noise_pred_text.split(latent_model_input.shape[1], axis=1)
            noise_pred = noise_pred_uncond + decoder_guidance_scale * (noise_pred_text - noise_pred_uncond)
            noise_pred = ops.cat([noise_pred, predicted_variance], axis=1)

        if i + 1 == decoder_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = decoder_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        decoder_latents = self.decoder_scheduler.step(
            noise_pred, t, decoder_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    decoder_latents = decoder_latents.clamp(-1, 1)

    image_small = decoder_latents

    # done decoder

    # super res

    self.super_res_scheduler.set_timesteps(super_res_num_inference_steps)
    super_res_timesteps_tensor = self.super_res_scheduler.timesteps

    channels = self.super_res_first.config.in_channels // 2
    height = self.super_res_first.config.sample_size
    width = self.super_res_first.config.sample_size

    if super_res_latents is None:
        super_res_latents = self.prepare_latents(
            (batch_size, channels, height, width),
            image_small.dtype,
            generator,
            super_res_latents,
            self.super_res_scheduler,
        )

    interpolate_antialias = {}
    if "antialias" in inspect.signature(ops.interpolate).parameters:
        interpolate_antialias["antialias"] = True

    image_upscaled = ops.interpolate(
        image_small, size=[height, width], mode="bicubic", align_corners=False, **interpolate_antialias
    )

    for i, t in enumerate(self.progress_bar(super_res_timesteps_tensor)):
        # no classifier free guidance

        if i == super_res_timesteps_tensor.shape[0] - 1:
            unet = self.super_res_last
        else:
            unet = self.super_res_first

        latent_model_input = ops.cat([super_res_latents, image_upscaled], axis=1)

        noise_pred = unet(
            sample=latent_model_input,
            timestep=t,
        )[0]

        if i + 1 == super_res_timesteps_tensor.shape[0]:
            prev_timestep = None
        else:
            prev_timestep = super_res_timesteps_tensor[i + 1]

        # compute the previous noisy sample x_t -> x_t-1
        super_res_latents = self.super_res_scheduler.step(
            noise_pred, t, super_res_latents, prev_timestep=prev_timestep, generator=generator
        )[0]

    image = super_res_latents

    # post processing

    image = image * 0.5 + 0.5
    image = image.clamp(0, 1)
    image = image.permute(0, 2, 3, 1).float().numpy()

    if output_type == "pil":
        image = self.numpy_to_pil(image)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)

mindone.diffusers.pipelines.ImagePipelineOutput dataclass

Bases: BaseOutput

Output class for image pipelines.

Source code in mindone/diffusers/pipelines/pipeline_utils.py
69
70
71
72
73
74
75
76
77
78
79
80
@dataclass
class ImagePipelineOutput(BaseOutput):
    """
    Output class for image pipelines.

    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
            num_channels)`.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]