Skip to content

DiT

Scalable Diffusion Models with Transformers (DiT) is by William Peebles and Saining Xie.

The abstract from the paper is:

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

The original codebase can be found at facebookresearch/dit.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

mindone.diffusers.DiTPipeline

Bases: DiffusionPipeline

Pipeline for image generation based on a Transformer backbone instead of a UNet.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
transformer

A class conditioned DiTTransformer2DModel to denoise the encoded image latents.

TYPE: [`DiTTransformer2DModel`]

vae

Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.

TYPE: [`AutoencoderKL`]

scheduler

A scheduler to be used in combination with transformer to denoise the encoded image latents.

TYPE: [`DDIMScheduler`]

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
class DiTPipeline(DiffusionPipeline):
    r"""
    Pipeline for image generation based on a Transformer backbone instead of a UNet.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Parameters:
        transformer ([`DiTTransformer2DModel`]):
            A class conditioned `DiTTransformer2DModel` to denoise the encoded image latents.
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
        scheduler ([`DDIMScheduler`]):
            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
    """

    model_cpu_offload_seq = "transformer->vae"

    def __init__(
        self,
        transformer: DiTTransformer2DModel,
        vae: AutoencoderKL,
        scheduler: KarrasDiffusionSchedulers,
        id2label: Optional[Dict[int, str]] = None,
    ):
        super().__init__()
        self.register_modules(transformer=transformer, vae=vae, scheduler=scheduler)

        # create a imagenet -> id dictionary for easier use
        self.labels = {}
        if id2label is not None:
            for key, value in id2label.items():
                for label in value.split(","):
                    self.labels[label.lstrip().rstrip()] = int(key)
            self.labels = dict(sorted(self.labels.items()))

    def get_label_ids(self, label: Union[str, List[str]]) -> List[int]:
        r"""

        Map label strings from ImageNet to corresponding class ids.

        Parameters:
            label (`str` or `dict` of `str`):
                Label strings to be mapped to class ids.

        Returns:
            `list` of `int`:
                Class ids to be processed by pipeline.
        """

        if not isinstance(label, list):
            label = list(label)

        for i in label:
            if i not in self.labels:
                raise ValueError(
                    f"{i} does not exist. Please make sure to select one of the following labels: \n {self.labels}."
                )

        return [self.labels[i] for i in label]

    def __call__(
        self,
        class_labels: List[int],
        guidance_scale: float = 4.0,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        num_inference_steps: int = 50,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
    ) -> Union[ImagePipelineOutput, Tuple]:
        r"""
        The call function to the pipeline for generation.

        Args:
            class_labels (List[int]):
                List of ImageNet class labels for the images to be generated.
            guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make
                generation deterministic.
            num_inference_steps (`int`, *optional*, defaults to 250):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple.

        Examples:

        ```py
        >>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
        >>> import mindspore as ms

        >>> import numpy as np

        >>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
        >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

        >>> # pick words from Imagenet class labels
        >>> pipe.labels  # to print all available words

        >>> # pick words that exist in ImageNet
        >>> words = ["white shark", "umbrella"]

        >>> class_ids = pipe.get_label_ids(words)

        >>> generator = np.random.default_rng(33)
        >>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

        >>> image = output[0][0]  # label 'white shark'
        ```

        Returns:
            [`~pipelines.ImagePipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
                returned where the first element is a list with the generated images
        """

        batch_size = len(class_labels)
        latent_size = self.transformer.config.sample_size
        latent_channels = self.transformer.config.in_channels

        latents = randn_tensor(
            shape=(batch_size, latent_channels, latent_size, latent_size),
            generator=generator,
            dtype=self.transformer.dtype,
        )
        latent_model_input = ops.cat([latents] * 2) if guidance_scale > 1 else latents

        class_labels = ms.Tensor(class_labels).reshape(-1)
        class_null = ms.Tensor([1000] * batch_size)
        class_labels_input = ops.cat([class_labels, class_null], 0) if guidance_scale > 1 else class_labels

        # set step values
        self.scheduler.set_timesteps(num_inference_steps)
        for t in self.progress_bar(self.scheduler.timesteps):
            if guidance_scale > 1:
                half = latent_model_input[: len(latent_model_input) // 2]
                latent_model_input = ops.cat([half, half], axis=0)
            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

            timesteps = t
            if not ops.is_tensor(timesteps):
                # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
                # This would be a good case for the `match` statement (Python 3.10+)
                is_mps = False
                if isinstance(timesteps, float):
                    dtype = ms.float32 if is_mps else ms.float64
                else:
                    dtype = ms.int32 if is_mps else ms.int64
                timesteps = ms.Tensor([timesteps], dtype=dtype)
            elif len(timesteps.shape) == 0:
                timesteps = timesteps[None]
            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
            timesteps = timesteps.broadcast_to((latent_model_input.shape[0],))
            # predict noise model_output
            noise_pred = self.transformer(latent_model_input, timestep=timesteps, class_labels=class_labels_input)[0]

            # perform guidance
            if guidance_scale > 1:
                eps, rest = noise_pred[:, :latent_channels], noise_pred[:, latent_channels:]
                cond_eps, uncond_eps = ops.split(eps, len(eps) // 2, axis=0)

                half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
                eps = ops.cat([half_eps, half_eps], axis=0)

                noise_pred = ops.cat([eps, rest], axis=1)

            # learned sigma
            if self.transformer.config.out_channels // 2 == latent_channels:
                model_output, _ = ops.split(noise_pred, latent_channels, axis=1)
            else:
                model_output = noise_pred

            # compute previous image: x_t -> x_t-1
            latent_model_input = self.scheduler.step(model_output, t, latent_model_input)[0]

        if guidance_scale > 1:
            latents, _ = latent_model_input.chunk(2, axis=0)
        else:
            latents = latent_model_input

        latents = 1 / self.vae.config.scaling_factor * latents
        samples = self.vae.decode(latents)[0]

        samples = (samples / 2 + 0.5).clamp(0, 1)

        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
        samples = samples.permute(0, 2, 3, 1).float().asnumpy()

        if output_type == "pil":
            samples = self.numpy_to_pil(samples)

        if not return_dict:
            return (samples,)

        return ImagePipelineOutput(images=samples)

mindone.diffusers.DiTPipeline.__call__(class_labels, guidance_scale=4.0, generator=None, num_inference_steps=50, output_type='pil', return_dict=False)

The call function to the pipeline for generation.

PARAMETER DESCRIPTION
class_labels

List of ImageNet class labels for the images to be generated.

TYPE: List[int]

guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 4.0

generator

A np.random.Generator to make generation deterministic.

TYPE: `np.random.Generator`, *optional* DEFAULT: None

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 250 DEFAULT: 50

output_type

The output format of the generated image. Choose between PIL.Image or np.array.

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [ImagePipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: False

>>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
>>> import mindspore as ms

>>> import numpy as np

>>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
>>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

>>> # pick words from Imagenet class labels
>>> pipe.labels  # to print all available words

>>> # pick words that exist in ImageNet
>>> words = ["white shark", "umbrella"]

>>> class_ids = pipe.get_label_ids(words)

>>> generator = np.random.default_rng(33)
>>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

>>> image = output[0][0]  # label 'white shark'
RETURNS DESCRIPTION
Union[ImagePipelineOutput, Tuple]

[~pipelines.ImagePipelineOutput] or tuple: If return_dict is True, [~pipelines.ImagePipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def __call__(
    self,
    class_labels: List[int],
    guidance_scale: float = 4.0,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    num_inference_steps: int = 50,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
) -> Union[ImagePipelineOutput, Tuple]:
    r"""
    The call function to the pipeline for generation.

    Args:
        class_labels (List[int]):
            List of ImageNet class labels for the images to be generated.
        guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make
            generation deterministic.
        num_inference_steps (`int`, *optional*, defaults to 250):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `PIL.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `True`):
            Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple.

    Examples:

    ```py
    >>> from mindone.diffusers import DiTPipeline, DPMSolverMultistepScheduler
    >>> import mindspore as ms

    >>> import numpy as np

    >>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", mindspore_dtype=ms.float16)
    >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

    >>> # pick words from Imagenet class labels
    >>> pipe.labels  # to print all available words

    >>> # pick words that exist in ImageNet
    >>> words = ["white shark", "umbrella"]

    >>> class_ids = pipe.get_label_ids(words)

    >>> generator = np.random.default_rng(33)
    >>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)

    >>> image = output[0][0]  # label 'white shark'
    ```

    Returns:
        [`~pipelines.ImagePipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
            returned where the first element is a list with the generated images
    """

    batch_size = len(class_labels)
    latent_size = self.transformer.config.sample_size
    latent_channels = self.transformer.config.in_channels

    latents = randn_tensor(
        shape=(batch_size, latent_channels, latent_size, latent_size),
        generator=generator,
        dtype=self.transformer.dtype,
    )
    latent_model_input = ops.cat([latents] * 2) if guidance_scale > 1 else latents

    class_labels = ms.Tensor(class_labels).reshape(-1)
    class_null = ms.Tensor([1000] * batch_size)
    class_labels_input = ops.cat([class_labels, class_null], 0) if guidance_scale > 1 else class_labels

    # set step values
    self.scheduler.set_timesteps(num_inference_steps)
    for t in self.progress_bar(self.scheduler.timesteps):
        if guidance_scale > 1:
            half = latent_model_input[: len(latent_model_input) // 2]
            latent_model_input = ops.cat([half, half], axis=0)
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        timesteps = t
        if not ops.is_tensor(timesteps):
            # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
            # This would be a good case for the `match` statement (Python 3.10+)
            is_mps = False
            if isinstance(timesteps, float):
                dtype = ms.float32 if is_mps else ms.float64
            else:
                dtype = ms.int32 if is_mps else ms.int64
            timesteps = ms.Tensor([timesteps], dtype=dtype)
        elif len(timesteps.shape) == 0:
            timesteps = timesteps[None]
        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
        timesteps = timesteps.broadcast_to((latent_model_input.shape[0],))
        # predict noise model_output
        noise_pred = self.transformer(latent_model_input, timestep=timesteps, class_labels=class_labels_input)[0]

        # perform guidance
        if guidance_scale > 1:
            eps, rest = noise_pred[:, :latent_channels], noise_pred[:, latent_channels:]
            cond_eps, uncond_eps = ops.split(eps, len(eps) // 2, axis=0)

            half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
            eps = ops.cat([half_eps, half_eps], axis=0)

            noise_pred = ops.cat([eps, rest], axis=1)

        # learned sigma
        if self.transformer.config.out_channels // 2 == latent_channels:
            model_output, _ = ops.split(noise_pred, latent_channels, axis=1)
        else:
            model_output = noise_pred

        # compute previous image: x_t -> x_t-1
        latent_model_input = self.scheduler.step(model_output, t, latent_model_input)[0]

    if guidance_scale > 1:
        latents, _ = latent_model_input.chunk(2, axis=0)
    else:
        latents = latent_model_input

    latents = 1 / self.vae.config.scaling_factor * latents
    samples = self.vae.decode(latents)[0]

    samples = (samples / 2 + 0.5).clamp(0, 1)

    # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
    samples = samples.permute(0, 2, 3, 1).float().asnumpy()

    if output_type == "pil":
        samples = self.numpy_to_pil(samples)

    if not return_dict:
        return (samples,)

    return ImagePipelineOutput(images=samples)

mindone.diffusers.DiTPipeline.get_label_ids(label)

Map label strings from ImageNet to corresponding class ids.

PARAMETER DESCRIPTION
label

Label strings to be mapped to class ids.

TYPE: `str` or `dict` of `str`

RETURNS DESCRIPTION
List[int]

list of int: Class ids to be processed by pipeline.

Source code in mindone/diffusers/pipelines/dit/pipeline_dit.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def get_label_ids(self, label: Union[str, List[str]]) -> List[int]:
    r"""

    Map label strings from ImageNet to corresponding class ids.

    Parameters:
        label (`str` or `dict` of `str`):
            Label strings to be mapped to class ids.

    Returns:
        `list` of `int`:
            Class ids to be processed by pipeline.
    """

    if not isinstance(label, list):
        label = list(label)

    for i in label:
        if i not in self.labels:
            raise ValueError(
                f"{i} does not exist. Please make sure to select one of the following labels: \n {self.labels}."
            )

    return [self.labels[i] for i in label]

mindone.diffusers.pipelines.ImagePipelineOutput dataclass

Bases: BaseOutput

Output class for image pipelines.

Source code in mindone/diffusers/pipelines/pipeline_utils.py
69
70
71
72
73
74
75
76
77
78
79
80
@dataclass
class ImagePipelineOutput(BaseOutput):
    """
    Output class for image pipelines.

    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
            num_channels)`.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]