DiT¶

Scalable Diffusion Models with Transformers (DiT) is by William Peebles and Saining Xie.

The abstract from the paper is:

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

The original codebase can be found at facebookresearch/dit.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

`mindone.diffusers.DiTPipeline` ¶

Bases: DiffusionPipeline

Pipeline for image generation based on a Transformer backbone instead of a UNet.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER	DESCRIPTION
`transformer`	A class conditioned `DiTTransformer2DModel` to denoise the encoded image latents. Initially published as `Transformer2DModel` in the config, but the mismatch can be ignored. TYPE: [`DiTTransformer2DModel`]
`vae`	Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. TYPE: [`AutoencoderKL`]
`scheduler`	A scheduler to be used in combination with `transformer` to denoise the encoded image latents. TYPE: [`DDIMScheduler`]

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py

class DiTPipeline(DiffusionPipeline):
    r"""
    Pipeline for image generation based on a Transformer backbone instead of a UNet.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Parameters:
        transformer ([`DiTTransformer2DModel`]):
            A class conditioned `DiTTransformer2DModel` to denoise the encoded image latents. Initially published as
            [`Transformer2DModel`](https://huggingface.co/facebook/DiT-XL-2-256/blob/main/transformer/config.json#L2)
            in the config, but the mismatch can be ignored.
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
        scheduler ([`DDIMScheduler`]):
            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
    """

    model_cpu_offload_seq = "transformer->vae"

    def __init__(
        self,
        transformer: DiTTransformer2DModel,
        vae: AutoencoderKL,
        scheduler: KarrasDiffusionSchedulers,
        id2label: Optional[Dict[int, str]] = None,
    ):
        super().__init__()
        self.register_modules(transformer=transformer, vae=vae, scheduler=scheduler)

        # create a imagenet -> id dictionary for easier use
        self.labels = {}
        if id2label is not None:
            for key, value in id2label.items():
                for label in value.split(","):
                    self.labels[label.lstrip().rstrip()] = int(key)
            self.labels = dict(sorted(self.labels.items()))

    def get_label_ids(self, label: Union[str, List[str]]) -> List[int]:
        r"""

        Map label strings from ImageNet to corresponding class ids.

        Parameters:
            label (`str` or `dict` of `str`):
                Label strings to be mapped to class ids.

        Returns:
            `list` of `int`:
                Class ids to be processed by pipeline.
        """

        if not isinstance(label, list):
            label = list(label)

        for i in label:
            if i not in self.labels:
                raise ValueError(
                    f"{i} does not exist. Please make sure to select one of the following labels: \n {self.labels}."
                )

        return [self.labels[i] for i in label]

    def __call__(
        self,
        class_labels: List[int],
        guidance_scale: float = 4.0,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        num_inference_steps: int = 50,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ) -> Union[ImagePipelineOutput, Tuple]:
        r"""
        The call function to the pipeline for generation.

        Args:
            class_labels (List[int]):
                List of ImageNet class labels for the images to be generated.
            guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make
                generation deterministic.
            num_inference_steps (`int`, *optional*, defaults to 250):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple.

        Examples:

        ```py
        >>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
        >>> import mindspore as ms

        >>> import numpy as np

        >>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
        >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

        >>> # pick words from Imagenet class labels
        >>> pipe.labels  # to print all available words

        >>> # pick words that exist in ImageNet
        >>> words = ["white shark", "umbrella"]

        >>> class_ids = pipe.get_label_ids(words)

        >>> generator = np.random.default_rng(33)
        >>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

        >>> image = output[0][0]  # label 'white shark'
        ```

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images
        """

        batch_size = len(class_labels)
        latent_size = self.transformer.config.sample_size
        latent_channels = self.transformer.config.in_channels

        latents = randn_tensor(
            shape=(batch_size, latent_channels, latent_size, latent_size),
            generator=generator,
            dtype=self.transformer.dtype,
        )
        latent_model_input = mint.cat([latents] * 2) if guidance_scale > 1 else latents

        class_labels = mint.reshape(ms.tensor(class_labels), (-1,))
        class_null = ms.tensor([1000] * batch_size)
        class_labels_input = mint.cat([class_labels, class_null], 0) if guidance_scale > 1 else class_labels

        # set step values
        self.scheduler.set_timesteps(num_inference_steps)
        for t in self.progress_bar(self.scheduler.timesteps):
            if guidance_scale > 1:
                half = latent_model_input[: len(latent_model_input) // 2]
                latent_model_input = mint.cat([half, half], dim=0)
            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

            timesteps = t
            # todo: unavailable mint interface
            if not ops.is_tensor(timesteps):
                if isinstance(timesteps, float):
                    dtype = ms.float32
                else:
                    dtype = ms.int32
                timesteps = ms.tensor([timesteps], dtype=dtype)
            elif len(timesteps.shape) == 0:
                timesteps = timesteps[None]
            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
            timesteps = timesteps.broadcast_to((latent_model_input.shape[0],))
            # predict noise model_output
            noise_pred = self.transformer(latent_model_input, timestep=timesteps, class_labels=class_labels_input)[0]

            # perform guidance
            if guidance_scale > 1:
                eps, rest = noise_pred[:, :latent_channels], noise_pred[:, latent_channels:]
                cond_eps, uncond_eps = mint.split(eps, len(eps) // 2, dim=0)

                half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
                eps = mint.cat([half_eps, half_eps], dim=0)

                noise_pred = mint.cat([eps, rest], dim=1)

            # learned sigma
            if self.transformer.config.out_channels // 2 == latent_channels:
                model_output, _ = mint.split(noise_pred, latent_channels, dim=1)
            else:
                model_output = noise_pred

            # compute previous image: x_t -> x_t-1
            latent_model_input = self.scheduler.step(model_output, t, latent_model_input)[0]

        if guidance_scale > 1:
            latents, _ = latent_model_input.chunk(2, dim=0)
        else:
            latents = latent_model_input

        latents = 1 / self.vae.config.scaling_factor * latents
        samples = self.vae.decode(latents)[0]

        samples = (samples / 2 + 0.5).clamp(0, 1)

        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
        samples = samples.permute(0, 2, 3, 1).float().asnumpy()

        if output_type == "pil":
            samples = self.numpy_to_pil(samples)

        if not return_dict:
            return (samples,)

        return ImagePipelineOutput(images=samples)

`mindone.diffusers.DiTPipeline.call(class_labels, guidance_scale=4.0, generator=None, num_inference_steps=50, output_type='pil', return_dict=False)` ¶

The call function to the pipeline for generation.

PARAMETER	DESCRIPTION
`class_labels`	List of ImageNet class labels for the images to be generated. TYPE: `List[int]`
`guidance_scale`	A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `4.0`
`generator`	A `np.random.Generator` to make generation deterministic. TYPE: `np.random.Generator`, optional DEFAULT: `None`
`num_inference_steps`	The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. TYPE: `int`, optional, defaults to 250 DEFAULT: `50`
`output_type`	The output format of the generated image. Choose between `PIL.Image` or `np.array`. TYPE: `str`, optional, defaults to `"pil"` DEFAULT: `'pil'`
`return_dict`	Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `True` DEFAULT: `False`

>>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
>>> import mindspore as ms

>>> import numpy as np

>>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
>>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

>>> # pick words from Imagenet class labels
>>> pipe.labels  # to print all available words

>>> # pick words that exist in ImageNet
>>> words = ["white shark", "umbrella"]

>>> class_ids = pipe.get_label_ids(words)

>>> generator = np.random.default_rng(33)
>>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

>>> image = output[0][0]  # label 'white shark'

RETURNS	DESCRIPTION
`Union[ImagePipelineOutput, Tuple]`	[`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is returned where the first element is a list with the generated images

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py

def __call__(
    self,
    class_labels: List[int],
    guidance_scale: float = 4.0,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    num_inference_steps: int = 50,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
) -> Union[ImagePipelineOutput, Tuple]:
    r"""
    The call function to the pipeline for generation.

    Args:
        class_labels (List[int]):
            List of ImageNet class labels for the images to be generated.
        guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make
            generation deterministic.
        num_inference_steps (`int`, *optional*, defaults to 250):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `True`):
            Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple.

    Examples:

    ```py
    >>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
    >>> import mindspore as ms

    >>> import numpy as np

    >>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
    >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

    >>> # pick words from Imagenet class labels
    >>> pipe.labels  # to print all available words

    >>> # pick words that exist in ImageNet
    >>> words = ["white shark", "umbrella"]

    >>> class_ids = pipe.get_label_ids(words)

    >>> generator = np.random.default_rng(33)
    >>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

    >>> image = output[0][0]  # label 'white shark'
    ```

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images
    """

    batch_size = len(class_labels)
    latent_size = self.transformer.config.sample_size
    latent_channels = self.transformer.config.in_channels

    latents = randn_tensor(
        shape=(batch_size, latent_channels, latent_size, latent_size),
        generator=generator,
        dtype=self.transformer.dtype,
    )
    latent_model_input = mint.cat([latents] * 2) if guidance_scale > 1 else latents

    class_labels = mint.reshape(ms.tensor(class_labels), (-1,))
    class_null = ms.tensor([1000] * batch_size)
    class_labels_input = mint.cat([class_labels, class_null], 0) if guidance_scale > 1 else class_labels

    # set step values
    self.scheduler.set_timesteps(num_inference_steps)
    for t in self.progress_bar(self.scheduler.timesteps):
        if guidance_scale > 1:
            half = latent_model_input[: len(latent_model_input) // 2]
            latent_model_input = mint.cat([half, half], dim=0)
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        timesteps = t
        # todo: unavailable mint interface
        if not ops.is_tensor(timesteps):
            if isinstance(timesteps, float):
                dtype = ms.float32
            else:
                dtype = ms.int32
            timesteps = ms.tensor([timesteps], dtype=dtype)
        elif len(timesteps.shape) == 0:
            timesteps = timesteps[None]
        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
        timesteps = timesteps.broadcast_to((latent_model_input.shape[0],))
        # predict noise model_output
        noise_pred = self.transformer(latent_model_input, timestep=timesteps, class_labels=class_labels_input)[0]

        # perform guidance
        if guidance_scale > 1:
            eps, rest = noise_pred[:, :latent_channels], noise_pred[:, latent_channels:]
            cond_eps, uncond_eps = mint.split(eps, len(eps) // 2, dim=0)

            half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
            eps = mint.cat([half_eps, half_eps], dim=0)

            noise_pred = mint.cat([eps, rest], dim=1)

        # learned sigma
        if self.transformer.config.out_channels // 2 == latent_channels:
            model_output, _ = mint.split(noise_pred, latent_channels, dim=1)
        else:
            model_output = noise_pred

        # compute previous image: x_t -> x_t-1
        latent_model_input = self.scheduler.step(model_output, t, latent_model_input)[0]

    if guidance_scale > 1:
        latents, _ = latent_model_input.chunk(2, dim=0)
    else:
        latents = latent_model_input

    latents = 1 / self.vae.config.scaling_factor * latents
    samples = self.vae.decode(latents)[0]

    samples = (samples / 2 + 0.5).clamp(0, 1)

    # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
    samples = samples.permute(0, 2, 3, 1).float().asnumpy()

    if output_type == "pil":
        samples = self.numpy_to_pil(samples)

    if not return_dict:
        return (samples,)

    return ImagePipelineOutput(images=samples)

`mindone.diffusers.DiTPipeline.get_label_ids(label)` ¶

Map label strings from ImageNet to corresponding class ids.

PARAMETER	DESCRIPTION
`label`	Label strings to be mapped to class ids. TYPE: `str` or `dict` of `str`

RETURNS	DESCRIPTION
`List[int]`	`list` of `int`: Class ids to be processed by pipeline.

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py

def get_label_ids(self, label: Union[str, List[str]]) -> List[int]:
    r"""

    Map label strings from ImageNet to corresponding class ids.

    Parameters:
        label (`str` or `dict` of `str`):
            Label strings to be mapped to class ids.

    Returns:
        `list` of `int`:
            Class ids to be processed by pipeline.
    """

    if not isinstance(label, list):
        label = list(label)

    for i in label:
        if i not in self.labels:
            raise ValueError(
                f"{i} does not exist. Please make sure to select one of the following labels: \n {self.labels}."
            )

    return [self.labels[i] for i in label]

`mindone.diffusers.pipelines.ImagePipelineOutput` `dataclass` ¶

Bases: BaseOutput

Output class for image pipelines.

Source code in mindone/diffusers/pipelines/pipeline_utils.py

@dataclass
class ImagePipelineOutput(BaseOutput):
    """
    Output class for image pipelines.

    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
            num_channels)`.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]

DiT¶

mindone.diffusers.DiTPipeline ¶

mindone.diffusers.DiTPipeline.__call__(class_labels, guidance_scale=4.0, generator=None, num_inference_steps=50, output_type='pil', return_dict=False) ¶

mindone.diffusers.DiTPipeline.get_label_ids(label) ¶

mindone.diffusers.pipelines.ImagePipelineOutput dataclass ¶

`mindone.diffusers.DiTPipeline` ¶

`mindone.diffusers.DiTPipeline.call(class_labels, guidance_scale=4.0, generator=None, num_inference_steps=50, output_type='pil', return_dict=False)` ¶

`mindone.diffusers.DiTPipeline.get_label_ids(label)` ¶

`mindone.diffusers.pipelines.ImagePipelineOutput` `dataclass` ¶