FluxControlInpaint¶

FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.

FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. This is not a ControlNet model.

Control type	Developer	Link
Depth	Black Forest Labs	Link
Canny	Black Forest Labs	Link

Tip

Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out this section for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to this blog post to learn more. For an exhaustive list of resources, check out this gist.

import mindspore as ms
from mindone.diffusers import FluxControlInpaintPipeline
from mindone.diffusers.models.transformers import FluxTransformer2DModel
from mindone.transformers import T5EncoderModel
from mindone.diffusers.utils import load_image, make_image_grid
from image_gen_aux import DepthPreprocessor # https://github.com/huggingface/image_gen_aux
from PIL import Image
import numpy as np

pipe = FluxControlInpaintPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-Depth-dev",
    mindspore_dtype=ms.bfloat16,
)
# use following lines if you have NPU constraints
# ---------------------------------------------------------------
transformer = FluxTransformer2DModel.from_pretrained(
    "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="transformer", mindspore_dtype=ms.bfloat16
)
text_encoder_2 = T5EncoderModel.from_pretrained(
    "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="text_encoder_2", mindspore_dtype=ms.bfloat16
)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2
# ---------------------------------------------------------------

prompt = "a blue robot singing opera with human-like expressions"
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")

head_mask = np.zeros_like(image)
head_mask[65:580,300:642] = 255
mask_image = Image.fromarray(head_mask)

processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(image)[0].convert("RGB")

output = pipe(
    prompt=prompt,
    image=image,
    control_image=control_image,
    mask_image=mask_image,
    num_inference_steps=30,
    strength=0.9,
    guidance_scale=10.0,
    generator=np.random.default_rng(42),
)[0][0]
make_image_grid([image, control_image, mask_image, output.resize(image.size)], rows=1, cols=4).save("output.png")

`mindone.diffusers.FluxControlInpaintPipeline` ¶

Bases: DiffusionPipeline, FluxLoraLoaderMixin, FromSingleFileMixin, TextualInversionLoaderMixin

The Flux pipeline for image inpainting using Flux-dev-Depth/Canny.

Reference: https://blackforestlabs.ai/announcing-black-forest-labs/

PARAMETER	DESCRIPTION
`transformer`	Conditional Transformer (MMDiT) architecture to denoise the encoded image latents. TYPE: [`FluxTransformer2DModel`]
`scheduler`	A scheduler to be used in combination with `transformer` to denoise the encoded image latents. TYPE: [`FlowMatchEulerDiscreteScheduler`]
`vae`	Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. TYPE: [`AutoencoderKL`]
`text_encoder`	CLIP, specifically the clip-vit-large-patch14 variant. TYPE: [`CLIPTextModel`]
`text_encoder_2`	T5, specifically the google/t5-v1_1-xxl variant. TYPE: [`T5EncoderModel`]
`tokenizer`	Tokenizer of class CLIPTokenizer. TYPE: `CLIPTokenizer`
`tokenizer_2`	Second Tokenizer of class T5TokenizerFast. TYPE: `T5TokenizerFast`