Skip to content

BLIP-Diffusion

BLIP-Diffusion was proposed in BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. It enables zero-shot subject-driven generation and control-guided zero-shot generation.

The abstract from the paper is:

Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at this https URL.

The original codebase can be found at salesforce/LAVIS. You can find the official BLIP-Diffusion checkpoints under the hf.co/SalesForce organization.

BlipDiffusionPipeline and BlipDiffusionControlNetPipeline were contributed by ayushtues.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

mindone.diffusers.pipelines.BlipDiffusionPipeline

Bases: DiffusionPipeline

Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER DESCRIPTION
tokenizer

Tokenizer for the text encoder

TYPE: [`CLIPTokenizer`]

text_encoder

Text encoder to encode the text prompt

TYPE: [`ContextCLIPTextModel`]

vae

VAE model to map the latents to the image

TYPE: [`AutoencoderKL`]

unet

Conditional U-Net architecture to denoise the image embedding.

TYPE: [`UNet2DConditionModel`]

scheduler

A scheduler to be used in combination with unet to generate image latents.

TYPE: [`PNDMScheduler`]

qformer

QFormer model to get multi-modal embeddings from the text and image.

TYPE: [`Blip2QFormerModel`]

image_processor

Image Processor to preprocess and postprocess the image.

TYPE: [`BlipImageProcessor`]

ctx_begin_pos

Position of the context token in the text encoder.

TYPE: int, `optional`, defaults to 2 DEFAULT: 2

Source code in mindone/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
class BlipDiffusionPipeline(DiffusionPipeline):
    """
    Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer ([`CLIPTokenizer`]):
            Tokenizer for the text encoder
        text_encoder ([`ContextCLIPTextModel`]):
            Text encoder to encode the text prompt
        vae ([`AutoencoderKL`]):
            VAE model to map the latents to the image
        unet ([`UNet2DConditionModel`]):
            Conditional U-Net architecture to denoise the image embedding.
        scheduler ([`PNDMScheduler`]):
             A scheduler to be used in combination with `unet` to generate image latents.
        qformer ([`Blip2QFormerModel`]):
            QFormer model to get multi-modal embeddings from the text and image.
        image_processor ([`BlipImageProcessor`]):
            Image Processor to preprocess and postprocess the image.
        ctx_begin_pos (int, `optional`, defaults to 2):
            Position of the context token in the text encoder.
    """

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: ContextCLIPTextModel,
        vae: AutoencoderKL,
        unet: UNet2DConditionModel,
        scheduler: PNDMScheduler,
        qformer: Blip2QFormerModel,
        image_processor: BlipImageProcessor,
        ctx_begin_pos: int = 2,
        mean: List[float] = None,
        std: List[float] = None,
    ):
        super().__init__()

        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            vae=vae,
            unet=unet,
            scheduler=scheduler,
            qformer=qformer,
            image_processor=image_processor,
        )
        self.register_to_config(ctx_begin_pos=ctx_begin_pos, mean=mean, std=std)

    def get_query_embeddings(self, input_image, src_subject):
        text = self.qformer.tokenizer(src_subject, return_tensors="np", padding=True)
        input_ids = ms.Tensor(text.input_ids)
        attention_mask = ms.Tensor(text.attention_mask)
        return self.qformer(
            image_input=input_image, text_input_ids=input_ids, text_attention_mask=attention_mask, return_dict=False
        )

    # from the original Blip Diffusion code, speciefies the target subject and augments the prompt by repeating it
    def _build_prompt(self, prompts, tgt_subjects, prompt_strength=1.0, prompt_reps=20):
        rv = []
        for prompt, tgt_subject in zip(prompts, tgt_subjects):
            prompt = f"a {tgt_subject} {prompt.strip()}"
            # a trick to amplify the prompt
            rv.append(", ".join([prompt] * int(prompt_strength * prompt_reps)))

        return rv

    # Copied from diffusers.pipelines.consistency_models.pipeline_consistency_models.ConsistencyModelPipeline.prepare_latents
    def prepare_latents(self, batch_size, num_channels, height, width, dtype, generator, latents=None):
        shape = (batch_size, num_channels, height, width)
        if isinstance(generator, list) and len(generator) != batch_size:
            raise ValueError(
                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
            )

        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            latents = latents.to(dtype=dtype)

        # scale the initial noise by the standard deviation required by the scheduler
        latents = (latents * self.scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(self, query_embeds, prompt):
        # embeddings for prompt, with query_embeds as context
        max_len = self.text_encoder.text_model.config.max_position_embeddings
        max_len -= self.qformer.config.num_query_tokens

        tokenized_prompt = self.tokenizer(
            prompt,
            padding="max_length",
            truncation=True,
            max_length=max_len,
            return_tensors="np",
        )

        batch_size = query_embeds.shape[0]
        ctx_begin_pos = [self.config.ctx_begin_pos] * batch_size

        text_embeddings = self.text_encoder(
            input_ids=ms.Tensor(tokenized_prompt.input_ids),
            ctx_embeddings=query_embeds,
            ctx_begin_pos=ctx_begin_pos,
        )[0]

        return text_embeddings

    def __call__(
        self,
        prompt: List[str],
        reference_image: PIL.Image.Image,
        source_subject_category: List[str],
        target_subject_category: List[str],
        latents: Optional[ms.Tensor] = None,
        guidance_scale: float = 7.5,
        height: int = 512,
        width: int = 512,
        num_inference_steps: int = 50,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        neg_prompt: Optional[str] = "",
        prompt_strength: float = 1.0,
        prompt_reps: int = 20,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`List[str]`):
                The prompt or prompts to guide the image generation.
            reference_image (`PIL.Image.Image`):
                The reference image to condition the generation on.
            source_subject_category (`List[str]`):
                The source subject category.
            target_subject_category (`List[str]`):
                The target subject category.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by random sampling.
            guidance_scale (`float`, *optional*, defaults to 7.5):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality.
            height (`int`, *optional*, defaults to 512):
                The height of the generated image.
            width (`int`, *optional*, defaults to 512):
                The width of the generated image.
            num_inference_steps (`int`, *optional*, defaults to 50):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            neg_prompt (`str`, *optional*, defaults to ""):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `guidance_scale` is less than `1`).
            prompt_strength (`float`, *optional*, defaults to 1.0):
                The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
                to amplify the prompt.
            prompt_reps (`int`, *optional*, defaults to 20):
                The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`) or `"pt"` (`mindspore.Tensor`).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`
        """
        reference_image = self.image_processor.preprocess(
            reference_image, image_mean=self.config.mean, image_std=self.config.std, return_tensors="np"
        )["pixel_values"]
        reference_image = ms.Tensor(reference_image)

        if isinstance(prompt, str):
            prompt = [prompt]
        if isinstance(source_subject_category, str):
            source_subject_category = [source_subject_category]
        if isinstance(target_subject_category, str):
            target_subject_category = [target_subject_category]

        batch_size = len(prompt)

        prompt = self._build_prompt(
            prompts=prompt,
            tgt_subjects=target_subject_category,
            prompt_strength=prompt_strength,
            prompt_reps=prompt_reps,
        )
        query_embeds = self.get_query_embeddings(reference_image, source_subject_category)
        text_embeddings = self.encode_prompt(query_embeds, prompt)
        do_classifier_free_guidance = guidance_scale > 1.0
        if do_classifier_free_guidance:
            max_length = self.text_encoder.text_model.config.max_position_embeddings

            uncond_input = self.tokenizer(
                [neg_prompt] * batch_size,
                padding="max_length",
                max_length=max_length,
                return_tensors="np",
            )
            uncond_embeddings = self.text_encoder(
                input_ids=ms.Tensor(uncond_input.input_ids),
                ctx_embeddings=None,
            )[0]
            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            text_embeddings = ops.cat([uncond_embeddings, text_embeddings])

        scale_down_factor = 2 ** (len(self.unet.config.block_out_channels) - 1)
        latents = self.prepare_latents(
            batch_size=batch_size,
            num_channels=self.unet.config.in_channels,
            height=height // scale_down_factor,
            width=width // scale_down_factor,
            generator=generator,
            latents=latents,
            dtype=self.unet.dtype,
        )
        # set timesteps
        extra_set_kwargs = {}
        self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)

        for i, t in enumerate(self.progress_bar(self.scheduler.timesteps)):
            # expand the latents if we are doing classifier free guidance
            do_classifier_free_guidance = guidance_scale > 1.0

            latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents

            noise_pred = self.unet(
                latent_model_input,
                timestep=t,
                encoder_hidden_states=text_embeddings,
                down_block_additional_residuals=None,
                mid_block_additional_residual=None,
            )[0]

            # perform guidance
            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

            latents = self.scheduler.step(
                noise_pred,
                t,
                latents,
            )[0]

        image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
        image = self.image_processor.postprocess(image, output_type=output_type)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

mindone.diffusers.pipelines.BlipDiffusionPipeline.__call__(prompt, reference_image, source_subject_category, target_subject_category, latents=None, guidance_scale=7.5, height=512, width=512, num_inference_steps=50, generator=None, neg_prompt='', prompt_strength=1.0, prompt_reps=20, output_type='pil', return_dict=False)

Function invoked when calling the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide the image generation.

TYPE: `List[str]`

reference_image

The reference image to condition the generation on.

TYPE: `PIL.Image.Image`

source_subject_category

The source subject category.

TYPE: `List[str]`

target_subject_category

The target subject category.

TYPE: `List[str]`

latents

Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by random sampling.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 7.5 DEFAULT: 7.5

height

The height of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

width

The width of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 50 DEFAULT: 50

generator

One or a list of np.random.Generator(s) to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

neg_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str`, *optional*, defaults to "" DEFAULT: ''

prompt_strength

The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps to amplify the prompt.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

prompt_reps

The number of times the prompt is repeated along with prompt_strength to amplify the prompt.

TYPE: `int`, *optional*, defaults to 20 DEFAULT: 20

output_type

The output format of the generate image. Choose between: "pil" (PIL.Image.Image), "np" (np.array) or "pt" (mindspore.Tensor).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple

Source code in mindone/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
def __call__(
    self,
    prompt: List[str],
    reference_image: PIL.Image.Image,
    source_subject_category: List[str],
    target_subject_category: List[str],
    latents: Optional[ms.Tensor] = None,
    guidance_scale: float = 7.5,
    height: int = 512,
    width: int = 512,
    num_inference_steps: int = 50,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    neg_prompt: Optional[str] = "",
    prompt_strength: float = 1.0,
    prompt_reps: int = 20,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`List[str]`):
            The prompt or prompts to guide the image generation.
        reference_image (`PIL.Image.Image`):
            The reference image to condition the generation on.
        source_subject_category (`List[str]`):
            The source subject category.
        target_subject_category (`List[str]`):
            The target subject category.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by random sampling.
        guidance_scale (`float`, *optional*, defaults to 7.5):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
            1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
            usually at the expense of lower image quality.
        height (`int`, *optional*, defaults to 512):
            The height of the generated image.
        width (`int`, *optional*, defaults to 512):
            The width of the generated image.
        num_inference_steps (`int`, *optional*, defaults to 50):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        neg_prompt (`str`, *optional*, defaults to ""):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `guidance_scale` is less than `1`).
        prompt_strength (`float`, *optional*, defaults to 1.0):
            The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
            to amplify the prompt.
        prompt_reps (`int`, *optional*, defaults to 20):
            The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`) or `"pt"` (`mindspore.Tensor`).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`
    """
    reference_image = self.image_processor.preprocess(
        reference_image, image_mean=self.config.mean, image_std=self.config.std, return_tensors="np"
    )["pixel_values"]
    reference_image = ms.Tensor(reference_image)

    if isinstance(prompt, str):
        prompt = [prompt]
    if isinstance(source_subject_category, str):
        source_subject_category = [source_subject_category]
    if isinstance(target_subject_category, str):
        target_subject_category = [target_subject_category]

    batch_size = len(prompt)

    prompt = self._build_prompt(
        prompts=prompt,
        tgt_subjects=target_subject_category,
        prompt_strength=prompt_strength,
        prompt_reps=prompt_reps,
    )
    query_embeds = self.get_query_embeddings(reference_image, source_subject_category)
    text_embeddings = self.encode_prompt(query_embeds, prompt)
    do_classifier_free_guidance = guidance_scale > 1.0
    if do_classifier_free_guidance:
        max_length = self.text_encoder.text_model.config.max_position_embeddings

        uncond_input = self.tokenizer(
            [neg_prompt] * batch_size,
            padding="max_length",
            max_length=max_length,
            return_tensors="np",
        )
        uncond_embeddings = self.text_encoder(
            input_ids=ms.Tensor(uncond_input.input_ids),
            ctx_embeddings=None,
        )[0]
        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
        # to avoid doing two forward passes
        text_embeddings = ops.cat([uncond_embeddings, text_embeddings])

    scale_down_factor = 2 ** (len(self.unet.config.block_out_channels) - 1)
    latents = self.prepare_latents(
        batch_size=batch_size,
        num_channels=self.unet.config.in_channels,
        height=height // scale_down_factor,
        width=width // scale_down_factor,
        generator=generator,
        latents=latents,
        dtype=self.unet.dtype,
    )
    # set timesteps
    extra_set_kwargs = {}
    self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)

    for i, t in enumerate(self.progress_bar(self.scheduler.timesteps)):
        # expand the latents if we are doing classifier free guidance
        do_classifier_free_guidance = guidance_scale > 1.0

        latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents

        noise_pred = self.unet(
            latent_model_input,
            timestep=t,
            encoder_hidden_states=text_embeddings,
            down_block_additional_residuals=None,
            mid_block_additional_residual=None,
        )[0]

        # perform guidance
        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

        latents = self.scheduler.step(
            noise_pred,
            t,
            latents,
        )[0]

    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
    image = self.image_processor.postprocess(image, output_type=output_type)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)

mindone.diffusers.pipelines.BlipDiffusionControlNetPipeline

Bases: DiffusionPipeline

Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

PARAMETER DESCRIPTION
tokenizer

Tokenizer for the text encoder

TYPE: [`CLIPTokenizer`]

text_encoder

Text encoder to encode the text prompt

TYPE: [`ContextCLIPTextModel`]

vae

VAE model to map the latents to the image

TYPE: [`AutoencoderKL`]

unet

Conditional U-Net architecture to denoise the image embedding.

TYPE: [`UNet2DConditionModel`]

scheduler

A scheduler to be used in combination with unet to generate image latents.

TYPE: [`PNDMScheduler`]

qformer

QFormer model to get multi-modal embeddings from the text and image.

TYPE: [`Blip2QFormerModel`]

controlnet

ControlNet model to get the conditioning image embedding.

TYPE: [`ControlNetModel`]

image_processor

Image Processor to preprocess and postprocess the image.

TYPE: [`BlipImageProcessor`]

ctx_begin_pos

Position of the context token in the text encoder.

TYPE: int, `optional`, defaults to 2 DEFAULT: 2

Source code in mindone/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
class BlipDiffusionControlNetPipeline(DiffusionPipeline):
    """
    Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    Args:
        tokenizer ([`CLIPTokenizer`]):
            Tokenizer for the text encoder
        text_encoder ([`ContextCLIPTextModel`]):
            Text encoder to encode the text prompt
        vae ([`AutoencoderKL`]):
            VAE model to map the latents to the image
        unet ([`UNet2DConditionModel`]):
            Conditional U-Net architecture to denoise the image embedding.
        scheduler ([`PNDMScheduler`]):
             A scheduler to be used in combination with `unet` to generate image latents.
        qformer ([`Blip2QFormerModel`]):
            QFormer model to get multi-modal embeddings from the text and image.
        controlnet ([`ControlNetModel`]):
            ControlNet model to get the conditioning image embedding.
        image_processor ([`BlipImageProcessor`]):
            Image Processor to preprocess and postprocess the image.
        ctx_begin_pos (int, `optional`, defaults to 2):
            Position of the context token in the text encoder.
    """

    model_cpu_offload_seq = "qformer->text_encoder->unet->vae"

    def __init__(
        self,
        tokenizer: CLIPTokenizer,
        text_encoder: ContextCLIPTextModel,
        vae: AutoencoderKL,
        unet: UNet2DConditionModel,
        scheduler: PNDMScheduler,
        qformer: Blip2QFormerModel,
        controlnet: ControlNetModel,
        image_processor: BlipImageProcessor,
        ctx_begin_pos: int = 2,
        mean: List[float] = None,
        std: List[float] = None,
    ):
        super().__init__()

        self.register_modules(
            tokenizer=tokenizer,
            text_encoder=text_encoder,
            vae=vae,
            unet=unet,
            scheduler=scheduler,
            qformer=qformer,
            controlnet=controlnet,
            image_processor=image_processor,
        )
        self.register_to_config(ctx_begin_pos=ctx_begin_pos, mean=mean, std=std)

    def get_query_embeddings(self, input_image, src_subject):
        text = self.qformer.tokenizer(src_subject, return_tensors="np", padding=True)
        input_ids = ms.Tensor.from_numpy(text.input_ids)
        attention_mask = ms.Tensor.from_numpy(text.attention_mask)
        return self.qformer(
            image_input=input_image, text_input_ids=input_ids, text_attention_mask=attention_mask, return_dict=False
        )

    # from the original Blip Diffusion code, speciefies the target subject and augments the prompt by repeating it
    def _build_prompt(self, prompts, tgt_subjects, prompt_strength=1.0, prompt_reps=20):
        rv = []
        for prompt, tgt_subject in zip(prompts, tgt_subjects):
            prompt = f"a {tgt_subject} {prompt.strip()}"
            # a trick to amplify the prompt
            rv.append(", ".join([prompt] * int(prompt_strength * prompt_reps)))

        return rv

    # Copied from diffusers.pipelines.consistency_models.pipeline_consistency_models.ConsistencyModelPipeline.prepare_latents
    def prepare_latents(self, batch_size, num_channels, height, width, dtype, generator, latents=None):
        shape = (batch_size, num_channels, height, width)
        if isinstance(generator, list) and len(generator) != batch_size:
            raise ValueError(
                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
            )

        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            latents = latents.to(dtype=dtype)

        # scale the initial noise by the standard deviation required by the scheduler
        latents = (latents * self.scheduler.init_noise_sigma).to(dtype)
        return latents

    def encode_prompt(self, query_embeds, prompt):
        # embeddings for prompt, with query_embeds as context
        max_len = self.text_encoder.text_model.config.max_position_embeddings
        max_len -= self.qformer.config.num_query_tokens

        tokenized_prompt = self.tokenizer(
            prompt,
            padding="max_length",
            truncation=True,
            max_length=max_len,
            return_tensors="np",
        )

        batch_size = query_embeds.shape[0]
        ctx_begin_pos = [self.config.ctx_begin_pos] * batch_size

        text_embeddings = self.text_encoder(
            input_ids=ms.Tensor.from_numpy(tokenized_prompt.input_ids),
            ctx_embeddings=query_embeds,
            ctx_begin_pos=ctx_begin_pos,
        )[0]

        return text_embeddings

    # Adapted from diffusers.pipelines.controlnet.pipeline_controlnet.StableDiffusionControlNetPipeline.prepare_image
    def prepare_control_image(
        self,
        image,
        width,
        height,
        batch_size,
        num_images_per_prompt,
        dtype,
        do_classifier_free_guidance=False,
    ):
        image = self.image_processor.preprocess(
            image,
            size={"width": width, "height": height},
            do_rescale=True,
            do_center_crop=False,
            do_normalize=False,
            return_tensors="np",
        )["pixel_values"]
        image = ms.Tensor.from_numpy(image)
        image_batch_size = image.shape[0]

        if image_batch_size == 1:
            repeat_by = batch_size
        else:
            # image batch size is the same as prompt batch size
            repeat_by = num_images_per_prompt

        image = image.repeat_interleave(repeat_by, dim=0)

        image = image.to(dtype=dtype)

        if do_classifier_free_guidance:
            image = ops.cat([image] * 2)

        return image

    def __call__(
        self,
        prompt: List[str],
        reference_image: PIL.Image.Image,
        condtioning_image: PIL.Image.Image,
        source_subject_category: List[str],
        target_subject_category: List[str],
        latents: Optional[ms.Tensor] = None,
        guidance_scale: float = 7.5,
        height: int = 512,
        width: int = 512,
        num_inference_steps: int = 50,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        neg_prompt: Optional[str] = "",
        prompt_strength: float = 1.0,
        prompt_reps: int = 20,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ):
        """
        Function invoked when calling the pipeline for generation.

        Args:
            prompt (`List[str]`):
                The prompt or prompts to guide the image generation.
            reference_image (`PIL.Image.Image`):
                The reference image to condition the generation on.
            condtioning_image (`PIL.Image.Image`):
                The conditioning canny edge image to condition the generation on.
            source_subject_category (`List[str]`):
                The source subject category.
            target_subject_category (`List[str]`):
                The target subject category.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by random sampling.
            guidance_scale (`float`, *optional*, defaults to 7.5):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality.
            height (`int`, *optional*, defaults to 512):
                The height of the generated image.
            width (`int`, *optional*, defaults to 512):
                The width of the generated image.
            seed (`int`, *optional*, defaults to 42):
                The seed to use for random generation.
            num_inference_steps (`int`, *optional*, defaults to 50):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
                to make generation deterministic.
            neg_prompt (`str`, *optional*, defaults to ""):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `guidance_scale` is less than `1`).
            prompt_strength (`float`, *optional*, defaults to 1.0):
                The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
                to amplify the prompt.
            prompt_reps (`int`, *optional*, defaults to 20):
                The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
        Examples:

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`
        """

        reference_image = self.image_processor.preprocess(
            reference_image, image_mean=self.config.mean, image_std=self.config.std, return_tensors="np"
        )["pixel_values"]
        reference_image = ms.Tensor.from_numpy(reference_image)

        if isinstance(prompt, str):
            prompt = [prompt]
        if isinstance(source_subject_category, str):
            source_subject_category = [source_subject_category]
        if isinstance(target_subject_category, str):
            target_subject_category = [target_subject_category]

        batch_size = len(prompt)

        prompt = self._build_prompt(
            prompts=prompt,
            tgt_subjects=target_subject_category,
            prompt_strength=prompt_strength,
            prompt_reps=prompt_reps,
        )
        query_embeds = self.get_query_embeddings(reference_image, source_subject_category)
        text_embeddings = self.encode_prompt(query_embeds, prompt)
        # 3. unconditional embedding
        do_classifier_free_guidance = guidance_scale > 1.0
        if do_classifier_free_guidance:
            max_length = self.text_encoder.text_model.config.max_position_embeddings

            uncond_input = self.tokenizer(
                [neg_prompt] * batch_size,
                padding="max_length",
                max_length=max_length,
                return_tensors="np",
            )
            uncond_embeddings = self.text_encoder(
                input_ids=ms.Tensor.from_numpy(uncond_input.input_ids),
                ctx_embeddings=None,
            )[0]
            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            text_embeddings = ops.cat([uncond_embeddings, text_embeddings])
        scale_down_factor = 2 ** (len(self.unet.config.block_out_channels) - 1)
        latents = self.prepare_latents(
            batch_size=batch_size,
            num_channels=self.unet.config.in_channels,
            height=height // scale_down_factor,
            width=width // scale_down_factor,
            generator=generator,
            latents=latents,
            dtype=self.unet.dtype,
        )
        # set timesteps
        extra_set_kwargs = {}
        self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)

        cond_image = self.prepare_control_image(
            image=condtioning_image,
            width=width,
            height=height,
            batch_size=batch_size,
            num_images_per_prompt=1,
            dtype=self.controlnet.dtype,
            do_classifier_free_guidance=do_classifier_free_guidance,
        )

        for i, t in enumerate(self.progress_bar(self.scheduler.timesteps)):
            # expand the latents if we are doing classifier free guidance
            do_classifier_free_guidance = guidance_scale > 1.0

            latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
            down_block_res_samples, mid_block_res_sample = self.controlnet(
                latent_model_input,
                t,
                encoder_hidden_states=text_embeddings,
                controlnet_cond=cond_image,
                return_dict=False,
            )

            noise_pred = self.unet(
                latent_model_input,
                timestep=t,
                encoder_hidden_states=text_embeddings,
                down_block_additional_residuals=ms.mutable(down_block_res_samples),
                mid_block_additional_residual=mid_block_res_sample,
            )[0]

            # perform guidance
            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

            latents_dtype = latents.dtype
            latents = self.scheduler.step(
                noise_pred,
                t,
                latents,
            )[0]
            latents = latents.to(latents_dtype)
        image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
        image = self.image_processor.postprocess(image, output_type=output_type)

        if not return_dict:
            return (image,)

        return ImagePipelineOutput(images=image)

mindone.diffusers.pipelines.BlipDiffusionControlNetPipeline.__call__(prompt, reference_image, condtioning_image, source_subject_category, target_subject_category, latents=None, guidance_scale=7.5, height=512, width=512, num_inference_steps=50, generator=None, neg_prompt='', prompt_strength=1.0, prompt_reps=20, output_type='pil', return_dict=False)

Function invoked when calling the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide the image generation.

TYPE: `List[str]`

reference_image

The reference image to condition the generation on.

TYPE: `PIL.Image.Image`

condtioning_image

The conditioning canny edge image to condition the generation on.

TYPE: `PIL.Image.Image`

source_subject_category

The source subject category.

TYPE: `List[str]`

target_subject_category

The target subject category.

TYPE: `List[str]`

latents

Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by random sampling.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

guidance_scale

Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

TYPE: `float`, *optional*, defaults to 7.5 DEFAULT: 7.5

height

The height of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

width

The width of the generated image.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

seed

The seed to use for random generation.

TYPE: `int`, *optional*, defaults to 42

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 50 DEFAULT: 50

generator

One or a list of torch generator(s) to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

neg_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str`, *optional*, defaults to "" DEFAULT: ''

prompt_strength

The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps to amplify the prompt.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

prompt_reps

The number of times the prompt is repeated along with prompt_strength to amplify the prompt.

TYPE: `int`, *optional*, defaults to 20 DEFAULT: 20

RETURNS DESCRIPTION

[~pipelines.ImagePipelineOutput] or tuple

Source code in mindone/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
def __call__(
    self,
    prompt: List[str],
    reference_image: PIL.Image.Image,
    condtioning_image: PIL.Image.Image,
    source_subject_category: List[str],
    target_subject_category: List[str],
    latents: Optional[ms.Tensor] = None,
    guidance_scale: float = 7.5,
    height: int = 512,
    width: int = 512,
    num_inference_steps: int = 50,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    neg_prompt: Optional[str] = "",
    prompt_strength: float = 1.0,
    prompt_reps: int = 20,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
):
    """
    Function invoked when calling the pipeline for generation.

    Args:
        prompt (`List[str]`):
            The prompt or prompts to guide the image generation.
        reference_image (`PIL.Image.Image`):
            The reference image to condition the generation on.
        condtioning_image (`PIL.Image.Image`):
            The conditioning canny edge image to condition the generation on.
        source_subject_category (`List[str]`):
            The source subject category.
        target_subject_category (`List[str]`):
            The target subject category.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor will ge generated by random sampling.
        guidance_scale (`float`, *optional*, defaults to 7.5):
            Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
            `guidance_scale` is defined as `w` of equation 2. of [Imagen
            Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
            1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
            usually at the expense of lower image quality.
        height (`int`, *optional*, defaults to 512):
            The height of the generated image.
        width (`int`, *optional*, defaults to 512):
            The width of the generated image.
        seed (`int`, *optional*, defaults to 42):
            The seed to use for random generation.
        num_inference_steps (`int`, *optional*, defaults to 50):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
            to make generation deterministic.
        neg_prompt (`str`, *optional*, defaults to ""):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `guidance_scale` is less than `1`).
        prompt_strength (`float`, *optional*, defaults to 1.0):
            The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
            to amplify the prompt.
        prompt_reps (`int`, *optional*, defaults to 20):
            The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
    Examples:

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`
    """

    reference_image = self.image_processor.preprocess(
        reference_image, image_mean=self.config.mean, image_std=self.config.std, return_tensors="np"
    )["pixel_values"]
    reference_image = ms.Tensor.from_numpy(reference_image)

    if isinstance(prompt, str):
        prompt = [prompt]
    if isinstance(source_subject_category, str):
        source_subject_category = [source_subject_category]
    if isinstance(target_subject_category, str):
        target_subject_category = [target_subject_category]

    batch_size = len(prompt)

    prompt = self._build_prompt(
        prompts=prompt,
        tgt_subjects=target_subject_category,
        prompt_strength=prompt_strength,
        prompt_reps=prompt_reps,
    )
    query_embeds = self.get_query_embeddings(reference_image, source_subject_category)
    text_embeddings = self.encode_prompt(query_embeds, prompt)
    # 3. unconditional embedding
    do_classifier_free_guidance = guidance_scale > 1.0
    if do_classifier_free_guidance:
        max_length = self.text_encoder.text_model.config.max_position_embeddings

        uncond_input = self.tokenizer(
            [neg_prompt] * batch_size,
            padding="max_length",
            max_length=max_length,
            return_tensors="np",
        )
        uncond_embeddings = self.text_encoder(
            input_ids=ms.Tensor.from_numpy(uncond_input.input_ids),
            ctx_embeddings=None,
        )[0]
        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
        # to avoid doing two forward passes
        text_embeddings = ops.cat([uncond_embeddings, text_embeddings])
    scale_down_factor = 2 ** (len(self.unet.config.block_out_channels) - 1)
    latents = self.prepare_latents(
        batch_size=batch_size,
        num_channels=self.unet.config.in_channels,
        height=height // scale_down_factor,
        width=width // scale_down_factor,
        generator=generator,
        latents=latents,
        dtype=self.unet.dtype,
    )
    # set timesteps
    extra_set_kwargs = {}
    self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)

    cond_image = self.prepare_control_image(
        image=condtioning_image,
        width=width,
        height=height,
        batch_size=batch_size,
        num_images_per_prompt=1,
        dtype=self.controlnet.dtype,
        do_classifier_free_guidance=do_classifier_free_guidance,
    )

    for i, t in enumerate(self.progress_bar(self.scheduler.timesteps)):
        # expand the latents if we are doing classifier free guidance
        do_classifier_free_guidance = guidance_scale > 1.0

        latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
        down_block_res_samples, mid_block_res_sample = self.controlnet(
            latent_model_input,
            t,
            encoder_hidden_states=text_embeddings,
            controlnet_cond=cond_image,
            return_dict=False,
        )

        noise_pred = self.unet(
            latent_model_input,
            timestep=t,
            encoder_hidden_states=text_embeddings,
            down_block_additional_residuals=ms.mutable(down_block_res_samples),
            mid_block_additional_residual=mid_block_res_sample,
        )[0]

        # perform guidance
        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

        latents_dtype = latents.dtype
        latents = self.scheduler.step(
            noise_pred,
            t,
            latents,
        )[0]
        latents = latents.to(latents_dtype)
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
    image = self.image_processor.postprocess(image, output_type=output_type)

    if not return_dict:
        return (image,)

    return ImagePipelineOutput(images=image)