PixArtTransformer2DModel¶

A Transformer model for image-like data from PixArt-Alpha and PixArt-Sigma.

`mindone.diffusers.PixArtTransformer2DModel` ¶

Bases: ModelMixin, ConfigMixin

A 2D Transformer model as introduced in PixArt family of models (https://huggingface.co/papers/2310.00426, https://huggingface.co/papers/2403.04692).

PARAMETER	DESCRIPTION
`num_attention_heads`	The number of heads to use for multi-head attention. TYPE: `int, optional, defaults to 16` DEFAULT: `16`
`attention_head_dim`	The number of channels in each head. TYPE: `int, optional, defaults to 72` DEFAULT: `72`
`in_channels`	The number of channels in the input. TYPE: `int, defaults to 4` DEFAULT: `4`
`out_channels`	The number of channels in the output. Specify this parameter if the output channel number differs from the input. TYPE: `int` DEFAULT: `8`
`num_layers`	The number of layers of Transformer blocks to use. TYPE: `int, optional, defaults to 28` DEFAULT: `28`
`dropout`	The dropout probability to use within the Transformer blocks. TYPE: `float, optional, defaults to 0.0` DEFAULT: `0.0`
`norm_num_groups`	Number of groups for group normalization within Transformer blocks. TYPE: `int, optional, defaults to 32` DEFAULT: `32`
`cross_attention_dim`	The dimensionality for cross-attention layers, typically matching the encoder's hidden dimension. TYPE: `int` DEFAULT: `1152`
`attention_bias`	Configure if the Transformer blocks' attention should contain a bias parameter. TYPE: `bool, optional, defaults to True` DEFAULT: `True`
`sample_size`	The width of the latent images. This parameter is fixed during training. TYPE: `int, defaults to 128` DEFAULT: `128`
`patch_size`	Size of the patches the model processes, relevant for architectures working on non-sequential data. TYPE: `int, defaults to 2` DEFAULT: `2`
`activation_fn`	Activation function to use in feed-forward networks within Transformer blocks. TYPE: `str, optional, defaults to "gelu-approximate"` DEFAULT: `'gelu-approximate'`
`num_embeds_ada_norm`	Number of embeddings for AdaLayerNorm, fixed during training and affects the maximum denoising steps during inference. TYPE: `int, optional, defaults to 1000` DEFAULT: `1000`
`upcast_attention`	If true, upcasts the attention mechanism dimensions for potentially improved performance. TYPE: `bool, optional, defaults to False` DEFAULT: `False`
`norm_type`	Specifies the type of normalization used, can be 'ada_norm_zero'. TYPE: `str, optional, defaults to "ada_norm_zero"` DEFAULT: `'ada_norm_single'`
`norm_elementwise_affine`	If true, enables element-wise affine parameters in the normalization layers. TYPE: `bool, optional, defaults to False` DEFAULT: `False`
`norm_eps`	A small constant added to the denominator in normalization layers to prevent division by zero. TYPE: `float, optional, defaults to 1e-6` DEFAULT: `1e-06`
`interpolation_scale`	Scale factor to use during interpolating the position embeddings. TYPE: `int` DEFAULT: `None`
`use_additional_conditions`	If we're using additional conditions as inputs. TYPE: `bool` DEFAULT: `None`
`attention_type`	Kind of attention mechanism to be used. TYPE: `str, optional, defaults to "default"` DEFAULT: `'default'`
`caption_channels`	Number of channels to use for projecting the caption embeddings. TYPE: `int, optional, defaults to None` DEFAULT: `None`
`use_linear_projection`	Deprecated argument. Will be removed in a future version. TYPE: `bool, optional, defaults to False`
`num_vector_embeds`	Deprecated argument. Will be removed in a future version. TYPE: `bool, optional, defaults to False`

Source code in mindone/diffusers/models/transformers/pixart_transformer_2d.py

class PixArtTransformer2DModel(ModelMixin, ConfigMixin):
    r"""
    A 2D Transformer model as introduced in PixArt family of models (https://huggingface.co/papers/2310.00426,
    https://huggingface.co/papers/2403.04692).

    Parameters:
        num_attention_heads (int, optional, defaults to 16): The number of heads to use for multi-head attention.
        attention_head_dim (int, optional, defaults to 72): The number of channels in each head.
        in_channels (int, defaults to 4): The number of channels in the input.
        out_channels (int, optional):
            The number of channels in the output. Specify this parameter if the output channel number differs from the
            input.
        num_layers (int, optional, defaults to 28): The number of layers of Transformer blocks to use.
        dropout (float, optional, defaults to 0.0): The dropout probability to use within the Transformer blocks.
        norm_num_groups (int, optional, defaults to 32):
            Number of groups for group normalization within Transformer blocks.
        cross_attention_dim (int, optional):
            The dimensionality for cross-attention layers, typically matching the encoder's hidden dimension.
        attention_bias (bool, optional, defaults to True):
            Configure if the Transformer blocks' attention should contain a bias parameter.
        sample_size (int, defaults to 128):
            The width of the latent images. This parameter is fixed during training.
        patch_size (int, defaults to 2):
            Size of the patches the model processes, relevant for architectures working on non-sequential data.
        activation_fn (str, optional, defaults to "gelu-approximate"):
            Activation function to use in feed-forward networks within Transformer blocks.
        num_embeds_ada_norm (int, optional, defaults to 1000):
            Number of embeddings for AdaLayerNorm, fixed during training and affects the maximum denoising steps during
            inference.
        upcast_attention (bool, optional, defaults to False):
            If true, upcasts the attention mechanism dimensions for potentially improved performance.
        norm_type (str, optional, defaults to "ada_norm_zero"):
            Specifies the type of normalization used, can be 'ada_norm_zero'.
        norm_elementwise_affine (bool, optional, defaults to False):
            If true, enables element-wise affine parameters in the normalization layers.
        norm_eps (float, optional, defaults to 1e-6):
            A small constant added to the denominator in normalization layers to prevent division by zero.
        interpolation_scale (int, optional): Scale factor to use during interpolating the position embeddings.
        use_additional_conditions (bool, optional): If we're using additional conditions as inputs.
        attention_type (str, optional, defaults to "default"): Kind of attention mechanism to be used.
        caption_channels (int, optional, defaults to None):
            Number of channels to use for projecting the caption embeddings.
        use_linear_projection (bool, optional, defaults to False):
            Deprecated argument. Will be removed in a future version.
        num_vector_embeds (bool, optional, defaults to False):
            Deprecated argument. Will be removed in a future version.
    """

    _supports_gradient_checkpointing = True
    _no_split_modules = ["BasicTransformerBlock", "PatchEmbed"]
    _skip_layerwise_casting_patterns = ["pos_embed", "norm", "adaln_single"]

    @register_to_config
    def __init__(
        self,
        num_attention_heads: int = 16,
        attention_head_dim: int = 72,
        in_channels: int = 4,
        out_channels: Optional[int] = 8,
        num_layers: int = 28,
        dropout: float = 0.0,
        norm_num_groups: int = 32,
        cross_attention_dim: Optional[int] = 1152,
        attention_bias: bool = True,
        sample_size: int = 128,
        patch_size: int = 2,
        activation_fn: str = "gelu-approximate",
        num_embeds_ada_norm: Optional[int] = 1000,
        upcast_attention: bool = False,
        norm_type: str = "ada_norm_single",
        norm_elementwise_affine: bool = False,
        norm_eps: float = 1e-6,
        interpolation_scale: Optional[int] = None,
        use_additional_conditions: Optional[bool] = None,
        caption_channels: Optional[int] = None,
        attention_type: Optional[str] = "default",
    ):
        super().__init__()

        # Validate inputs.
        if norm_type != "ada_norm_single":
            raise NotImplementedError(
                f"Forward pass is not implemented when `patch_size` is not None and `norm_type` is '{norm_type}'."
            )
        elif norm_type == "ada_norm_single" and num_embeds_ada_norm is None:
            raise ValueError(
                f"When using a `patch_size` and this `norm_type` ({norm_type}), `num_embeds_ada_norm` cannot be None."
            )

        # Set some common variables used across the board.
        self.attention_head_dim = attention_head_dim
        self.inner_dim = self.config.num_attention_heads * self.config.attention_head_dim
        self.out_channels = in_channels if out_channels is None else out_channels
        if use_additional_conditions is None:
            if sample_size == 128:
                use_additional_conditions = True
            else:
                use_additional_conditions = False
        self.use_additional_conditions = use_additional_conditions

        self.gradient_checkpointing = False

        # 2. Initialize the position embedding and transformer blocks.
        self.height = self.config.sample_size
        self.width = self.config.sample_size

        interpolation_scale = (
            self.config.interpolation_scale
            if self.config.interpolation_scale is not None
            else max(self.config.sample_size // 64, 1)
        )
        self.pos_embed = PatchEmbed(
            height=self.config.sample_size,
            width=self.config.sample_size,
            patch_size=self.config.patch_size,
            in_channels=self.config.in_channels,
            embed_dim=self.inner_dim,
            interpolation_scale=interpolation_scale,
        )

        self.transformer_blocks = nn.CellList(
            [
                BasicTransformerBlock(
                    self.inner_dim,
                    self.config.num_attention_heads,
                    self.config.attention_head_dim,
                    dropout=self.config.dropout,
                    cross_attention_dim=self.config.cross_attention_dim,
                    activation_fn=self.config.activation_fn,
                    num_embeds_ada_norm=self.config.num_embeds_ada_norm,
                    attention_bias=self.config.attention_bias,
                    upcast_attention=self.config.upcast_attention,
                    norm_type=norm_type,
                    norm_elementwise_affine=self.config.norm_elementwise_affine,
                    norm_eps=self.config.norm_eps,
                    attention_type=self.config.attention_type,
                )
                for _ in range(self.config.num_layers)
            ]
        )

        # 3. Output blocks.
        self.norm_out = LayerNorm(self.inner_dim, elementwise_affine=False, eps=1e-6)
        self.scale_shift_table = ms.Parameter(
            mint.randn(2, self.inner_dim) / self.inner_dim**0.5, name="scale_shift_table"
        )
        self.proj_out = mint.nn.Linear(
            self.inner_dim, self.config.patch_size * self.config.patch_size * self.out_channels
        )

        self.adaln_single = AdaLayerNormSingle(self.inner_dim, use_additional_conditions=self.use_additional_conditions)
        self.caption_projection = None
        if self.config.caption_channels is not None:
            self.caption_projection = PixArtAlphaTextProjection(
                in_features=self.config.caption_channels, hidden_size=self.inner_dim
            )

    @property
    def attn_processors(self) -> Dict[str, AttentionProcessor]:
        r"""
        Returns:
            `dict` of attention processors: A dictionary containing all attention processors used in the model with
            indexed by its weight name.
        """
        # set recursively
        processors = {}

        def fn_recursive_add_processors(name: str, module: nn.Cell, processors: Dict[str, AttentionProcessor]):
            if hasattr(module, "get_processor"):
                processors[f"{name}.processor"] = module.get_processor()

            for sub_name, child in module.name_cells().items():
                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)

            return processors

        for name, module in self.name_cells().items():
            fn_recursive_add_processors(name, module, processors)

        return processors

    def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
        r"""
        Sets the attention processor to use to compute attention.

        Parameters:
            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
                The instantiated processor class or a dictionary of processor classes that will be set as the processor
                for **all** `Attention` layers.

                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
                processor. This is strongly recommended when setting trainable attention processors.

        """
        count = len(self.attn_processors.keys())

        if isinstance(processor, dict) and len(processor) != count:
            raise ValueError(
                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
            )

        def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
            if hasattr(module, "set_processor"):
                if not isinstance(processor, dict):
                    module.set_processor(processor)
                else:
                    module.set_processor(processor.pop(f"{name}.processor"))

            for sub_name, child in module.name_cells().items():
                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

        for name, module in self.name_cells().items():
            fn_recursive_attn_processor(name, module, processor)

    def set_default_attn_processor(self):
        """
        Disables custom attention processors and sets the default attention implementation.

        Safe to just use `AttnProcessor()` as PixArt doesn't have any exotic attention processors in default model.
        """
        self.set_attn_processor(AttnProcessor())

    def construct(
        self,
        hidden_states: ms.Tensor,
        encoder_hidden_states: Optional[ms.Tensor] = None,
        timestep: Optional[ms.Tensor] = None,
        added_cond_kwargs: Dict[str, ms.Tensor] = None,
        cross_attention_kwargs: Dict[str, Any] = None,
        attention_mask: Optional[ms.Tensor] = None,
        encoder_attention_mask: Optional[ms.Tensor] = None,
        return_dict: bool = False,
    ):
        """
        The [`PixArtTransformer2DModel`] forward method.

        Args:
            hidden_states (`ms.Tensor` of shape `(batch size, channel, height, width)`):
                Input `hidden_states`.
            encoder_hidden_states (`ms.Tensor` of shape `(batch size, sequence len, embed dims)`, *optional*):
                Conditional embeddings for cross attention layer. If not given, cross-attention defaults to
                self-attention.
            timestep (`ms.Tensor`, *optional*):
                Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`.
            added_cond_kwargs: (`Dict[str, Any]`, *optional*): Additional conditions to be used as inputs.
            cross_attention_kwargs ( `Dict[str, Any]`, *optional*):
                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
                `self.processor` in
                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
            attention_mask ( `ms.Tensor`, *optional*):
                An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
                is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
                negative values to the attention scores corresponding to "discard" tokens.
            encoder_attention_mask ( `ms.Tensor`, *optional*):
                Cross-attention mask applied to `encoder_hidden_states`. Two formats supported:

                    * Mask `(batch, sequence_length)` True = keep, False = discard.
                    * Bias `(batch, 1, sequence_length)` 0 = keep, -10000 = discard.

                If `ndim == 2`: will be interpreted as a mask, then converted into a bias consistent with the format
                above. This bias will be added to the cross-attention scores.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] instead of a plain
                tuple.

        Returns:
            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
            `tuple` where the first element is the sample tensor.
        """
        if self.use_additional_conditions and added_cond_kwargs is None:
            raise ValueError("`added_cond_kwargs` cannot be None when using additional conditions for `adaln_single`.")

        # ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
        #   we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
        #   we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
        # expects mask of shape:
        #   [batch, key_tokens]
        # adds singleton query_tokens dimension:
        #   [batch,                    1, key_tokens]
        # this helps to broadcast it as a bias over attention scores, which will be in one of the following shapes:
        #   [batch,  heads, query_tokens, key_tokens] (e.g. torch sdp attn)
        #   [batch * heads, query_tokens, key_tokens] (e.g. xformers or classic attn)
        if attention_mask is not None and attention_mask.ndim == 2:
            # assume that mask is expressed as:
            #   (1 = keep,      0 = discard)
            # convert mask into a bias that can be added to attention scores:
            #       (keep = +0,     discard = -10000.0)
            attention_mask = (1 - attention_mask.to(hidden_states.dtype)) * -10000.0
            attention_mask = attention_mask.unsqueeze(1)

        # convert encoder_attention_mask to a bias the same way we do for attention_mask
        if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:
            encoder_attention_mask = (1 - encoder_attention_mask.to(hidden_states.dtype)) * -10000.0
            encoder_attention_mask = encoder_attention_mask.unsqueeze(1)

        # 1. Input
        batch_size = hidden_states.shape[0]
        height, width = (
            hidden_states.shape[-2] // self.config["patch_size"],
            hidden_states.shape[-1] // self.config["patch_size"],
        )
        hidden_states = self.pos_embed(hidden_states)

        timestep, embedded_timestep = self.adaln_single(
            timestep, added_cond_kwargs, batch_size=batch_size, hidden_dtype=hidden_states.dtype
        )

        if self.caption_projection is not None:
            encoder_hidden_states = self.caption_projection(encoder_hidden_states)
            encoder_hidden_states = encoder_hidden_states.view(batch_size, -1, hidden_states.shape[-1])

        # 2. Blocks
        for block in self.transformer_blocks:
            hidden_states = block(
                hidden_states,
                attention_mask=attention_mask,
                encoder_hidden_states=encoder_hidden_states,
                encoder_attention_mask=encoder_attention_mask,
                timestep=timestep,
                cross_attention_kwargs=cross_attention_kwargs,
                class_labels=None,
            )

        # 3. Output
        shift, scale = (self.scale_shift_table[None] + embedded_timestep[:, None]).chunk(2, dim=1)
        hidden_states = self.norm_out(hidden_states)
        # Modulation
        hidden_states = hidden_states * (1 + scale) + shift
        hidden_states = self.proj_out(hidden_states)
        if hidden_states.shape[1] == 1:
            hidden_states = hidden_states.squeeze(1)

        # unpatchify
        hidden_states = hidden_states.reshape(
            -1, height, width, self.config["patch_size"], self.config["patch_size"], self.out_channels
        )
        hidden_states = mint.einsum("nhwpqc->nchpwq", hidden_states)
        output = hidden_states.reshape(
            -1, self.out_channels, height * self.config["patch_size"], width * self.config["patch_size"]
        )

        if not return_dict:
            return (output,)

        return Transformer2DModelOutput(sample=output)

`mindone.diffusers.PixArtTransformer2DModel.attn_processors` `property` ¶

RETURNS	DESCRIPTION
`Dict[str, AttentionProcessor]`	`dict` of attention processors: A dictionary containing all attention processors used in the model with
`Dict[str, AttentionProcessor]`	indexed by its weight name.

`mindone.diffusers.PixArtTransformer2DModel.construct(hidden_states, encoder_hidden_states=None, timestep=None, added_cond_kwargs=None, cross_attention_kwargs=None, attention_mask=None, encoder_attention_mask=None, return_dict=False)` ¶

The [PixArtTransformer2DModel] forward method.

PARAMETER	DESCRIPTION
`hidden_states`	Input `hidden_states`. TYPE: `ms.Tensor` of shape `(batch size, channel, height, width)`
`encoder_hidden_states`	Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention. TYPE: `ms.Tensor` of shape `(batch size, sequence len, embed dims)`, optional DEFAULT: `None`
`timestep`	Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`added_cond_kwargs`	(`Dict[str, Any]`, optional): Additional conditions to be used as inputs. TYPE: `Dict[str, Tensor]` DEFAULT: `None`
`cross_attention_kwargs`	A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in diffusers.models.attention_processor. TYPE: `Dict[str, Any]`, optional DEFAULT: `None`
`attention_mask`	An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large negative values to the attention scores corresponding to "discard" tokens. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`encoder_attention_mask`	Cross-attention mask applied to `encoder_hidden_states`. Two formats supported: * Mask `(batch, sequence_length)` True = keep, False = discard. * Bias `(batch, 1, sequence_length)` 0 = keep, -10000 = discard. If `ndim == 2`: will be interpreted as a mask, then converted into a bias consistent with the format above. This bias will be added to the cross-attention scores. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`return_dict`	Whether or not to return a [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`

RETURNS	DESCRIPTION
	If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
	`tuple` where the first element is the sample tensor.

Source code in mindone/diffusers/models/transformers/pixart_transformer_2d.py

def construct(
    self,
    hidden_states: ms.Tensor,
    encoder_hidden_states: Optional[ms.Tensor] = None,
    timestep: Optional[ms.Tensor] = None,
    added_cond_kwargs: Dict[str, ms.Tensor] = None,
    cross_attention_kwargs: Dict[str, Any] = None,
    attention_mask: Optional[ms.Tensor] = None,
    encoder_attention_mask: Optional[ms.Tensor] = None,
    return_dict: bool = False,
):
    """
    The [`PixArtTransformer2DModel`] forward method.

    Args:
        hidden_states (`ms.Tensor` of shape `(batch size, channel, height, width)`):
            Input `hidden_states`.
        encoder_hidden_states (`ms.Tensor` of shape `(batch size, sequence len, embed dims)`, *optional*):
            Conditional embeddings for cross attention layer. If not given, cross-attention defaults to
            self-attention.
        timestep (`ms.Tensor`, *optional*):
            Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`.
        added_cond_kwargs: (`Dict[str, Any]`, *optional*): Additional conditions to be used as inputs.
        cross_attention_kwargs ( `Dict[str, Any]`, *optional*):
            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
            `self.processor` in
            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
        attention_mask ( `ms.Tensor`, *optional*):
            An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
            is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
            negative values to the attention scores corresponding to "discard" tokens.
        encoder_attention_mask ( `ms.Tensor`, *optional*):
            Cross-attention mask applied to `encoder_hidden_states`. Two formats supported:

                * Mask `(batch, sequence_length)` True = keep, False = discard.
                * Bias `(batch, 1, sequence_length)` 0 = keep, -10000 = discard.

            If `ndim == 2`: will be interpreted as a mask, then converted into a bias consistent with the format
            above. This bias will be added to the cross-attention scores.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] instead of a plain
            tuple.

    Returns:
        If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
        `tuple` where the first element is the sample tensor.
    """
    if self.use_additional_conditions and added_cond_kwargs is None:
        raise ValueError("`added_cond_kwargs` cannot be None when using additional conditions for `adaln_single`.")

    # ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
    #   we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
    #   we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
    # expects mask of shape:
    #   [batch, key_tokens]
    # adds singleton query_tokens dimension:
    #   [batch,                    1, key_tokens]
    # this helps to broadcast it as a bias over attention scores, which will be in one of the following shapes:
    #   [batch,  heads, query_tokens, key_tokens] (e.g. torch sdp attn)
    #   [batch * heads, query_tokens, key_tokens] (e.g. xformers or classic attn)
    if attention_mask is not None and attention_mask.ndim == 2:
        # assume that mask is expressed as:
        #   (1 = keep,      0 = discard)
        # convert mask into a bias that can be added to attention scores:
        #       (keep = +0,     discard = -10000.0)
        attention_mask = (1 - attention_mask.to(hidden_states.dtype)) * -10000.0
        attention_mask = attention_mask.unsqueeze(1)

    # convert encoder_attention_mask to a bias the same way we do for attention_mask
    if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:
        encoder_attention_mask = (1 - encoder_attention_mask.to(hidden_states.dtype)) * -10000.0
        encoder_attention_mask = encoder_attention_mask.unsqueeze(1)

    # 1. Input
    batch_size = hidden_states.shape[0]
    height, width = (
        hidden_states.shape[-2] // self.config["patch_size"],
        hidden_states.shape[-1] // self.config["patch_size"],
    )
    hidden_states = self.pos_embed(hidden_states)

    timestep, embedded_timestep = self.adaln_single(
        timestep, added_cond_kwargs, batch_size=batch_size, hidden_dtype=hidden_states.dtype
    )

    if self.caption_projection is not None:
        encoder_hidden_states = self.caption_projection(encoder_hidden_states)
        encoder_hidden_states = encoder_hidden_states.view(batch_size, -1, hidden_states.shape[-1])

    # 2. Blocks
    for block in self.transformer_blocks:
        hidden_states = block(
            hidden_states,
            attention_mask=attention_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            timestep=timestep,
            cross_attention_kwargs=cross_attention_kwargs,
            class_labels=None,
        )

    # 3. Output
    shift, scale = (self.scale_shift_table[None] + embedded_timestep[:, None]).chunk(2, dim=1)
    hidden_states = self.norm_out(hidden_states)
    # Modulation
    hidden_states = hidden_states * (1 + scale) + shift
    hidden_states = self.proj_out(hidden_states)
    if hidden_states.shape[1] == 1:
        hidden_states = hidden_states.squeeze(1)

    # unpatchify
    hidden_states = hidden_states.reshape(
        -1, height, width, self.config["patch_size"], self.config["patch_size"], self.out_channels
    )
    hidden_states = mint.einsum("nhwpqc->nchpwq", hidden_states)
    output = hidden_states.reshape(
        -1, self.out_channels, height * self.config["patch_size"], width * self.config["patch_size"]
    )

    if not return_dict:
        return (output,)

    return Transformer2DModelOutput(sample=output)

`mindone.diffusers.PixArtTransformer2DModel.set_attn_processor(processor)` ¶

Sets the attention processor to use to compute attention.

PARAMETER	DESCRIPTION
`processor`	The instantiated processor class or a dictionary of processor classes that will be set as the processor for all `Attention` layers. If `processor` is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. TYPE: `dict` of `AttentionProcessor` or only `AttentionProcessor`

Source code in mindone/diffusers/models/transformers/pixart_transformer_2d.py

def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
    r"""
    Sets the attention processor to use to compute attention.

    Parameters:
        processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
            The instantiated processor class or a dictionary of processor classes that will be set as the processor
            for **all** `Attention` layers.

            If `processor` is a dict, the key needs to define the path to the corresponding cross attention
            processor. This is strongly recommended when setting trainable attention processors.

    """
    count = len(self.attn_processors.keys())

    if isinstance(processor, dict) and len(processor) != count:
        raise ValueError(
            f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
            f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
        )

    def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
        if hasattr(module, "set_processor"):
            if not isinstance(processor, dict):
                module.set_processor(processor)
            else:
                module.set_processor(processor.pop(f"{name}.processor"))

        for sub_name, child in module.name_cells().items():
            fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

    for name, module in self.name_cells().items():
        fn_recursive_attn_processor(name, module, processor)

`mindone.diffusers.PixArtTransformer2DModel.set_default_attn_processor()` ¶

Disables custom attention processors and sets the default attention implementation.

Safe to just use AttnProcessor() as PixArt doesn't have any exotic attention processors in default model.

Source code in mindone/diffusers/models/transformers/pixart_transformer_2d.py

def set_default_attn_processor(self):
    """
    Disables custom attention processors and sets the default attention implementation.

    Safe to just use `AttnProcessor()` as PixArt doesn't have any exotic attention processors in default model.
    """
    self.set_attn_processor(AttnProcessor())

PixArtTransformer2DModel¶

mindone.diffusers.PixArtTransformer2DModel ¶

mindone.diffusers.PixArtTransformer2DModel.attn_processors property ¶

mindone.diffusers.PixArtTransformer2DModel.construct(hidden_states, encoder_hidden_states=None, timestep=None, added_cond_kwargs=None, cross_attention_kwargs=None, attention_mask=None, encoder_attention_mask=None, return_dict=False) ¶

mindone.diffusers.PixArtTransformer2DModel.set_attn_processor(processor) ¶

mindone.diffusers.PixArtTransformer2DModel.set_default_attn_processor() ¶

`mindone.diffusers.PixArtTransformer2DModel` ¶

`mindone.diffusers.PixArtTransformer2DModel.attn_processors` `property` ¶

`mindone.diffusers.PixArtTransformer2DModel.construct(hidden_states, encoder_hidden_states=None, timestep=None, added_cond_kwargs=None, cross_attention_kwargs=None, attention_mask=None, encoder_attention_mask=None, return_dict=False)` ¶

`mindone.diffusers.PixArtTransformer2DModel.set_attn_processor(processor)` ¶

`mindone.diffusers.PixArtTransformer2DModel.set_default_attn_processor()` ¶