CogVideoXTransformer3DModel¶

A Diffusion Transformer model for 3D data from CogVideoX was introduced in CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

from mindone.diffusers import CogVideoXTransformer3DModel

vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", mindspore_dtype=mindspore.float16)

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel` ¶

Bases: ModelMixin, ConfigMixin, PeftAdapterMixin

A Transformer model for video-like data in CogVideoX.

PARAMETER	DESCRIPTION
`num_attention_heads`	The number of heads to use for multi-head attention. TYPE: `int`, defaults to `30` DEFAULT: `30`
`attention_head_dim`	The number of channels in each head. TYPE: `int`, defaults to `64` DEFAULT: `64`
`in_channels`	The number of channels in the input. TYPE: `int`, defaults to `16` DEFAULT: `16`
`out_channels`	The number of channels in the output. TYPE: `int`, optional, defaults to `16` DEFAULT: `16`
`flip_sin_to_cos`	Whether to flip the sin to cos in the time embedding. TYPE: `bool`, defaults to `True` DEFAULT: `True`
`time_embed_dim`	Output dimension of timestep embeddings. TYPE: `int`, defaults to `512` DEFAULT: `512`
`ofs_embed_dim`	Output dimension of "ofs" embeddings used in CogVideoX-5b-I2B in version 1.5 TYPE: `int`, defaults to `512` DEFAULT: `None`
`text_embed_dim`	Input dimension of text embeddings from the text encoder. TYPE: `int`, defaults to `4096` DEFAULT: `4096`
`num_layers`	The number of layers of Transformer blocks to use. TYPE: `int`, defaults to `30` DEFAULT: `30`
`dropout`	The dropout probability to use. TYPE: `float`, defaults to `0.0` DEFAULT: `0.0`
`attention_bias`	Whether to use bias in the attention projection layers. TYPE: `bool`, defaults to `True` DEFAULT: `True`
`sample_width`	The width of the input latents. TYPE: `int`, defaults to `90` DEFAULT: `90`
`sample_height`	The height of the input latents. TYPE: `int`, defaults to `60` DEFAULT: `60`
`sample_frames`	The number of frames in the input latents. Note that this parameter was incorrectly initialized to 49 instead of 13 because CogVideoX processed 13 latent frames at once in its default and recommended settings, but cannot be changed to the correct value to ensure backwards compatibility. To create a transformer with K latent frames, the correct value to pass here would be: ((K - 1) * temporal_compression_ratio + 1). TYPE: `int`, defaults to `49` DEFAULT: `49`
`patch_size`	The size of the patches to use in the patch embedding layer. TYPE: `int`, defaults to `2` DEFAULT: `2`
`temporal_compression_ratio`	The compression ratio across the temporal dimension. See documentation for `sample_frames`. TYPE: `int`, defaults to `4` DEFAULT: `4`
`max_text_seq_length`	The maximum sequence length of the input text embeddings. TYPE: `int`, defaults to `226` DEFAULT: `226`
`activation_fn`	Activation function to use in feed-forward. TYPE: `str`, defaults to `"gelu-approximate"` DEFAULT: `'gelu-approximate'`
`timestep_activation_fn`	Activation function to use when generating the timestep embeddings. TYPE: `str`, defaults to `"silu"` DEFAULT: `'silu'`
`norm_elementwise_affine`	Whether to use elementwise affine in normalization layers. TYPE: `bool`, defaults to `True` DEFAULT: `True`
`norm_eps`	The epsilon value to use in normalization layers. TYPE: `float`, defaults to `1e-5` DEFAULT: `1e-05`
`spatial_interpolation_scale`	Scaling factor to apply in 3D positional embeddings across spatial dimensions. TYPE: `float`, defaults to `1.875` DEFAULT: `1.875`
`temporal_interpolation_scale`	Scaling factor to apply in 3D positional embeddings across temporal dimensions. TYPE: `float`, defaults to `1.0` DEFAULT: `1.0`

Source code in mindone/diffusers/models/transformers/cogvideox_transformer_3d.py

class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
    """
    A Transformer model for video-like data in [CogVideoX](https://github.com/THUDM/CogVideo).

    Parameters:
        num_attention_heads (`int`, defaults to `30`):
            The number of heads to use for multi-head attention.
        attention_head_dim (`int`, defaults to `64`):
            The number of channels in each head.
        in_channels (`int`, defaults to `16`):
            The number of channels in the input.
        out_channels (`int`, *optional*, defaults to `16`):
            The number of channels in the output.
        flip_sin_to_cos (`bool`, defaults to `True`):
            Whether to flip the sin to cos in the time embedding.
        time_embed_dim (`int`, defaults to `512`):
            Output dimension of timestep embeddings.
        ofs_embed_dim (`int`, defaults to `512`):
            Output dimension of "ofs" embeddings used in CogVideoX-5b-I2B in version 1.5
        text_embed_dim (`int`, defaults to `4096`):
            Input dimension of text embeddings from the text encoder.
        num_layers (`int`, defaults to `30`):
            The number of layers of Transformer blocks to use.
        dropout (`float`, defaults to `0.0`):
            The dropout probability to use.
        attention_bias (`bool`, defaults to `True`):
            Whether to use bias in the attention projection layers.
        sample_width (`int`, defaults to `90`):
            The width of the input latents.
        sample_height (`int`, defaults to `60`):
            The height of the input latents.
        sample_frames (`int`, defaults to `49`):
            The number of frames in the input latents. Note that this parameter was incorrectly initialized to 49
            instead of 13 because CogVideoX processed 13 latent frames at once in its default and recommended settings,
            but cannot be changed to the correct value to ensure backwards compatibility. To create a transformer with
            K latent frames, the correct value to pass here would be: ((K - 1) * temporal_compression_ratio + 1).
        patch_size (`int`, defaults to `2`):
            The size of the patches to use in the patch embedding layer.
        temporal_compression_ratio (`int`, defaults to `4`):
            The compression ratio across the temporal dimension. See documentation for `sample_frames`.
        max_text_seq_length (`int`, defaults to `226`):
            The maximum sequence length of the input text embeddings.
        activation_fn (`str`, defaults to `"gelu-approximate"`):
            Activation function to use in feed-forward.
        timestep_activation_fn (`str`, defaults to `"silu"`):
            Activation function to use when generating the timestep embeddings.
        norm_elementwise_affine (`bool`, defaults to `True`):
            Whether to use elementwise affine in normalization layers.
        norm_eps (`float`, defaults to `1e-5`):
            The epsilon value to use in normalization layers.
        spatial_interpolation_scale (`float`, defaults to `1.875`):
            Scaling factor to apply in 3D positional embeddings across spatial dimensions.
        temporal_interpolation_scale (`float`, defaults to `1.0`):
            Scaling factor to apply in 3D positional embeddings across temporal dimensions.
    """

    _supports_gradient_checkpointing = True

    @register_to_config
    def __init__(
        self,
        num_attention_heads: int = 30,
        attention_head_dim: int = 64,
        in_channels: int = 16,
        out_channels: Optional[int] = 16,
        flip_sin_to_cos: bool = True,
        freq_shift: int = 0,
        time_embed_dim: int = 512,
        ofs_embed_dim: Optional[int] = None,
        text_embed_dim: int = 4096,
        num_layers: int = 30,
        dropout: float = 0.0,
        attention_bias: bool = True,
        sample_width: int = 90,
        sample_height: int = 60,
        sample_frames: int = 49,
        patch_size: int = 2,
        patch_size_t: Optional[int] = None,
        temporal_compression_ratio: int = 4,
        max_text_seq_length: int = 226,
        activation_fn: str = "gelu-approximate",
        timestep_activation_fn: str = "silu",
        norm_elementwise_affine: bool = True,
        norm_eps: float = 1e-5,
        spatial_interpolation_scale: float = 1.875,
        temporal_interpolation_scale: float = 1.0,
        use_rotary_positional_embeddings: bool = False,
        use_learned_positional_embeddings: bool = False,
        patch_bias: bool = True,
    ):
        super().__init__()
        inner_dim = num_attention_heads * attention_head_dim

        if not use_rotary_positional_embeddings and use_learned_positional_embeddings:
            raise ValueError(
                "There are no CogVideoX checkpoints available with disable rotary embeddings and learned positional "
                "embeddings. If you're using a custom model and/or believe this should be supported, please open an "
                "issue at https://github.com/huggingface/diffusers/issues."
            )

        # 1. Patch embedding
        self.patch_embed = CogVideoXPatchEmbed(
            patch_size=patch_size,
            patch_size_t=patch_size_t,
            in_channels=in_channels,
            embed_dim=inner_dim,
            text_embed_dim=text_embed_dim,
            bias=patch_bias,
            sample_width=sample_width,
            sample_height=sample_height,
            sample_frames=sample_frames,
            temporal_compression_ratio=temporal_compression_ratio,
            max_text_seq_length=max_text_seq_length,
            spatial_interpolation_scale=spatial_interpolation_scale,
            temporal_interpolation_scale=temporal_interpolation_scale,
            use_positional_embeddings=not use_rotary_positional_embeddings,
            use_learned_positional_embeddings=use_learned_positional_embeddings,
        )
        self.embedding_dropout = nn.Dropout(p=dropout)

        # 2. Time embeddings and ofs embedding(Only CogVideoX1.5-5B I2V have)
        self.time_proj = Timesteps(inner_dim, flip_sin_to_cos, freq_shift)
        self.time_embedding = TimestepEmbedding(inner_dim, time_embed_dim, timestep_activation_fn)

        self.ofs_proj = None
        self.ofs_embedding = None
        if ofs_embed_dim:
            self.ofs_proj = Timesteps(ofs_embed_dim, flip_sin_to_cos, freq_shift)
            self.ofs_embedding = TimestepEmbedding(
                ofs_embed_dim, ofs_embed_dim, timestep_activation_fn
            )  # same as time embeddings, for ofs

        # 3. Define spatio-temporal transformers blocks
        self.transformer_blocks = nn.CellList(
            [
                CogVideoXBlock(
                    dim=inner_dim,
                    num_attention_heads=num_attention_heads,
                    attention_head_dim=attention_head_dim,
                    time_embed_dim=time_embed_dim,
                    dropout=dropout,
                    activation_fn=activation_fn,
                    attention_bias=attention_bias,
                    norm_elementwise_affine=norm_elementwise_affine,
                    norm_eps=norm_eps,
                )
                for _ in range(num_layers)
            ]
        )
        self.norm_final = LayerNorm(inner_dim, norm_eps, norm_elementwise_affine)

        # 4. Output blocks
        self.norm_out = AdaLayerNorm(
            embedding_dim=time_embed_dim,
            output_dim=2 * inner_dim,
            norm_elementwise_affine=norm_elementwise_affine,
            norm_eps=norm_eps,
            chunk_dim=1,
        )

        if patch_size_t is None:
            # For CogVideox 1.0
            output_dim = patch_size * patch_size * out_channels
        else:
            # For CogVideoX 1.5
            output_dim = patch_size * patch_size * patch_size_t * out_channels

        self.proj_out = nn.Dense(inner_dim, output_dim)

        self._gradient_checkpointing = False

    @property
    def gradient_checkpointing(self):
        return self._gradient_checkpointing

    @gradient_checkpointing.setter
    def gradient_checkpointing(self, value):
        if self._gradient_checkpointing != value:
            self._gradient_checkpointing = value
            for block in self.transformer_blocks:
                block.recompute()

    def _set_gradient_checkpointing(self, module, value=False):
        if hasattr(module, "gradient_checkpointing"):
            module.gradient_checkpointing = value

    @property
    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
    def attn_processors(self) -> Dict[str, AttentionProcessor]:
        r"""
        Returns:
            `dict` of attention processors: A dictionary containing all attention processors used in the model with
            indexed by its weight name.
        """
        # set recursively
        processors = {}

        def fn_recursive_add_processors(name: str, module: nn.Cell, processors: Dict[str, AttentionProcessor]):
            if hasattr(module, "get_processor"):
                processors[f"{name}.processor"] = module.get_processor()

            for sub_name, child in module.name_cells().items():
                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)

            return processors

        for name, module in self.name_cells().items():
            fn_recursive_add_processors(name, module, processors)

        return processors

    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_attn_processor
    def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
        r"""
        Sets the attention processor to use to compute attention.

        Parameters:
            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
                The instantiated processor class or a dictionary of processor classes that will be set as the processor
                for **all** `Attention` layers.

                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
                processor. This is strongly recommended when setting trainable attention processors.

        """
        count = len(self.attn_processors.keys())

        if isinstance(processor, dict) and len(processor) != count:
            raise ValueError(
                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
            )

        def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
            if hasattr(module, "set_processor"):
                if not isinstance(processor, dict):
                    module.set_processor(processor)
                else:
                    module.set_processor(processor.pop(f"{name}.processor"))

            for sub_name, child in module.name_cells().items():
                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

        for name, module in self.name_cells().items():
            fn_recursive_attn_processor(name, module, processor)

    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections with FusedAttnProcessor2_0->FusedCogVideoXAttnProcessor2_0
    def fuse_qkv_projections(self):
        """
        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
        are fused. For cross-attention modules, key and value projection matrices are fused.

        <Tip warning={true}>

        This API is 🧪 experimental.

        </Tip>
        """
        self.original_attn_processors = None

        for _, attn_processor in self.attn_processors.items():
            if "Added" in str(attn_processor.__class__.__name__):
                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")

        self.original_attn_processors = self.attn_processors

        for _, module in self.cells_and_names():
            if isinstance(module, Attention):
                module.fuse_projections(fuse=True)

        self.set_attn_processor(FusedCogVideoXAttnProcessor2_0())

    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
    def unfuse_qkv_projections(self):
        """Disables the fused QKV projection if enabled.

        <Tip warning={true}>

        This API is 🧪 experimental.

        </Tip>

        """
        if self.original_attn_processors is not None:
            self.set_attn_processor(self.original_attn_processors)

    def construct(
        self,
        hidden_states: ms.Tensor,
        encoder_hidden_states: ms.Tensor,
        timestep: Union[int, float, ms.Tensor],
        timestep_cond: Optional[ms.Tensor] = None,
        ofs: Optional[Union[int, float, ms.Tensor]] = None,
        image_rotary_emb: Optional[Tuple[ms.Tensor, ms.Tensor]] = None,
        attention_kwargs: Optional[Dict[str, Any]] = None,
        return_dict: bool = False,
    ):
        if attention_kwargs is not None and "scale" in attention_kwargs:
            # weight the lora layers by setting `lora_scale` for each PEFT layer here
            # and remove `lora_scale` from each PEFT layer at the end.
            # scale_lora_layers & unscale_lora_layers maybe contains some operation forbidden in graph mode
            raise RuntimeError(
                f"You are trying to set scaling of lora layer by passing {attention_kwargs['scale']=}. "
                f"However it's not allowed in on-the-fly model forwarding. "
                f"Please manually call `scale_lora_layers(model, lora_scale)` before model forwarding and "
                f"`unscale_lora_layers(model, lora_scale)` after model forwarding. "
                f"For example, it can be done in a pipeline call like `StableDiffusionPipeline.__call__`."
            )

        batch_size, num_frames, channels, height, width = hidden_states.shape

        # 1. Time embedding
        timesteps = timestep
        t_emb = self.time_proj(timesteps)

        # timesteps does not contain any weights and will always return f32 tensors
        # but time_embedding might actually be running in fp16. so we need to cast here.
        # there might be better ways to encapsulate this.
        t_emb = t_emb.to(dtype=hidden_states.dtype)
        emb = self.time_embedding(t_emb, timestep_cond)

        if self.ofs_embedding is not None:
            ofs_emb = self.ofs_proj(ofs)
            ofs_emb = ofs_emb.to(dtype=hidden_states.dtype)
            ofs_emb = self.ofs_embedding(ofs_emb)
            emb = emb + ofs_emb

        # 2. Patch embedding
        hidden_states = self.patch_embed(encoder_hidden_states, hidden_states)
        hidden_states = self.embedding_dropout(hidden_states)

        text_seq_length = encoder_hidden_states.shape[1]
        encoder_hidden_states = hidden_states[:, :text_seq_length]
        hidden_states = hidden_states[:, text_seq_length:]

        # 3. Transformer blocks
        for i, block in enumerate(self.transformer_blocks):
            hidden_states, encoder_hidden_states = block(
                hidden_states=hidden_states,
                encoder_hidden_states=encoder_hidden_states,
                temb=emb,
                image_rotary_emb=image_rotary_emb,
            )

        if not self.config["use_rotary_positional_embeddings"]:
            # CogVideoX-2B
            hidden_states = self.norm_final(hidden_states)
        else:
            # CogVideoX-5B
            hidden_states = ops.cat([encoder_hidden_states, hidden_states], axis=1)
            hidden_states = self.norm_final(hidden_states)
            hidden_states = hidden_states[:, text_seq_length:]

        # 4. Final block
        hidden_states = self.norm_out(hidden_states, temb=emb)
        hidden_states = self.proj_out(hidden_states)

        # 5. Unpatchify
        p = self.config["patch_size"]
        p_t = self.config["patch_size_t"]

        if p_t is None:
            output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
            output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(start_dim=5, end_dim=6).flatten(start_dim=3, end_dim=4)
        else:
            output = hidden_states.reshape(
                batch_size, (num_frames + p_t - 1) // p_t, height // p, width // p, -1, p_t, p, p
            )
            output = (
                output.permute(0, 1, 5, 4, 2, 6, 3, 7)
                .flatten(start_dim=6, end_dim=7)
                .flatten(start_dim=4, end_dim=5)
                .flatten(start_dim=1, end_dim=2)
            )

        if not return_dict:
            return (output,)
        return Transformer2DModelOutput(sample=output)

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.attn_processors: Dict[str, AttentionProcessor]` `property` ¶

RETURNS	DESCRIPTION
`Dict[str, AttentionProcessor]`	`dict` of attention processors: A dictionary containing all attention processors used in the model with
`Dict[str, AttentionProcessor]`	indexed by its weight name.

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.fuse_qkv_projections()` ¶

Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.

This API is 🧪 experimental.

Source code in mindone/diffusers/models/transformers/cogvideox_transformer_3d.py

def fuse_qkv_projections(self):
    """
    Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
    are fused. For cross-attention modules, key and value projection matrices are fused.

    <Tip warning={true}>

    This API is 🧪 experimental.

    </Tip>
    """
    self.original_attn_processors = None

    for _, attn_processor in self.attn_processors.items():
        if "Added" in str(attn_processor.__class__.__name__):
            raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")

    self.original_attn_processors = self.attn_processors

    for _, module in self.cells_and_names():
        if isinstance(module, Attention):
            module.fuse_projections(fuse=True)

    self.set_attn_processor(FusedCogVideoXAttnProcessor2_0())

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.set_attn_processor(processor)` ¶

Sets the attention processor to use to compute attention.

PARAMETER	DESCRIPTION
`processor`	The instantiated processor class or a dictionary of processor classes that will be set as the processor for all `Attention` layers. If `processor` is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors. TYPE: `dict` of `AttentionProcessor` or only `AttentionProcessor`

Source code in mindone/diffusers/models/transformers/cogvideox_transformer_3d.py

def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):
    r"""
    Sets the attention processor to use to compute attention.

    Parameters:
        processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
            The instantiated processor class or a dictionary of processor classes that will be set as the processor
            for **all** `Attention` layers.

            If `processor` is a dict, the key needs to define the path to the corresponding cross attention
            processor. This is strongly recommended when setting trainable attention processors.

    """
    count = len(self.attn_processors.keys())

    if isinstance(processor, dict) and len(processor) != count:
        raise ValueError(
            f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
            f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
        )

    def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
        if hasattr(module, "set_processor"):
            if not isinstance(processor, dict):
                module.set_processor(processor)
            else:
                module.set_processor(processor.pop(f"{name}.processor"))

        for sub_name, child in module.name_cells().items():
            fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

    for name, module in self.name_cells().items():
        fn_recursive_attn_processor(name, module, processor)

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.unfuse_qkv_projections()` ¶

Disables the fused QKV projection if enabled.

This API is 🧪 experimental.

Source code in mindone/diffusers/models/transformers/cogvideox_transformer_3d.py

def unfuse_qkv_projections(self):
    """Disables the fused QKV projection if enabled.

    <Tip warning={true}>

    This API is 🧪 experimental.

    </Tip>

    """
    if self.original_attn_processors is not None:
        self.set_attn_processor(self.original_attn_processors)

`mindone.diffusers.models.modeling_outputs.Transformer2DModelOutput` `dataclass` ¶

Bases: BaseOutput

The output of [Transformer2DModel].

PARAMETER	DESCRIPTION
`(batch	The hidden states output conditioned on the `encoder_hidden_states` input. If discrete, returns probability distributions for the unnoised latent pixels. TYPE: size, num_vector_embeds - 1, num_latent_pixels)` if [`Transformer2DModel`] is discrete

Source code in mindone/diffusers/models/modeling_outputs.py

@dataclass
class Transformer2DModelOutput(BaseOutput):
    """
    The output of [`Transformer2DModel`].

    Args:
        sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` or
        `(batch size, num_vector_embeds - 1, num_latent_pixels)` if [`Transformer2DModel`] is discrete):
            The hidden states output conditioned on the `encoder_hidden_states` input. If discrete, returns probability
            distributions for the unnoised latent pixels.
    """

    sample: "ms.Tensor"  # noqa: F821

CogVideoXTransformer3DModel¶

mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel ¶

mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.attn_processors: Dict[str, AttentionProcessor] property ¶

mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.fuse_qkv_projections() ¶

mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.set_attn_processor(processor) ¶

mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.unfuse_qkv_projections() ¶

mindone.diffusers.models.modeling_outputs.Transformer2DModelOutput dataclass ¶

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel` ¶

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.attn_processors: Dict[str, AttentionProcessor]` `property` ¶

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.fuse_qkv_projections()` ¶

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.set_attn_processor(processor)` ¶

`mindone.diffusers.models.transformers.cogvideox_transformer_3d.CogVideoXTransformer3DModel.unfuse_qkv_projections()` ¶

`mindone.diffusers.models.modeling_outputs.Transformer2DModelOutput` `dataclass` ¶