UNetMotionModel¶

The UNet model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model.

The abstract from the paper is:

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.

`mindone.diffusers.UNetMotionModel` ¶

Bases: ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin, PeftAdapterMixin

A modified conditional 2D UNet model that takes a noisy sample, conditional state, and a timestep and returns a sample shaped output.

This model inherits from [ModelMixin]. Check the superclass documentation for it's generic methods implemented for all models (such as downloading or saving).

Source code in mindone/diffusers/models/unets/unet_motion_model.py

class UNetMotionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin, PeftAdapterMixin):
    r"""
    A modified conditional 2D UNet model that takes a noisy sample, conditional state, and a timestep and returns a
    sample shaped output.

    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
    for all models (such as downloading or saving).
    """

    _supports_gradient_checkpointing = True

    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 4,
        out_channels: int = 4,
        down_block_types: Tuple[str, ...] = (
            "CrossAttnDownBlockMotion",
            "CrossAttnDownBlockMotion",
            "CrossAttnDownBlockMotion",
            "DownBlockMotion",
        ),
        up_block_types: Tuple[str, ...] = (
            "UpBlockMotion",
            "CrossAttnUpBlockMotion",
            "CrossAttnUpBlockMotion",
            "CrossAttnUpBlockMotion",
        ),
        block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280),
        layers_per_block: Union[int, Tuple[int]] = 2,
        downsample_padding: int = 1,
        mid_block_scale_factor: float = 1,
        act_fn: str = "silu",
        norm_num_groups: int = 32,
        norm_eps: float = 1e-5,
        cross_attention_dim: int = 1280,
        transformer_layers_per_block: Union[int, Tuple[int], Tuple[Tuple]] = 1,
        reverse_transformer_layers_per_block: Optional[Union[int, Tuple[int], Tuple[Tuple]]] = None,
        temporal_transformer_layers_per_block: Union[int, Tuple[int], Tuple[Tuple]] = 1,
        reverse_temporal_transformer_layers_per_block: Optional[Union[int, Tuple[int], Tuple[Tuple]]] = None,
        transformer_layers_per_mid_block: Optional[Union[int, Tuple[int]]] = None,
        temporal_transformer_layers_per_mid_block: Optional[Union[int, Tuple[int]]] = 1,
        use_linear_projection: bool = False,
        num_attention_heads: Union[int, Tuple[int, ...]] = 8,
        motion_max_seq_length: int = 32,
        motion_num_attention_heads: Union[int, Tuple[int, ...]] = 8,
        reverse_motion_num_attention_heads: Optional[Union[int, Tuple[int, ...], Tuple[Tuple[int, ...], ...]]] = None,
        use_motion_mid_block: bool = True,
        mid_block_layers: int = 1,
        encoder_hid_dim: Optional[int] = None,
        encoder_hid_dim_type: Optional[str] = None,
        addition_embed_type: Optional[str] = None,
        addition_time_embed_dim: Optional[int] = None,
        projection_class_embeddings_input_dim: Optional[int] = None,
        time_cond_proj_dim: Optional[int] = None,
    ):
        super().__init__()

        self.sample_size = sample_size

        # Check inputs
        if len(down_block_types) != len(up_block_types):
            raise ValueError(
                f"Must provide the same number of `down_block_types` as `up_block_types`. "
                f"`down_block_types`: {down_block_types}. `up_block_types`: {up_block_types}."
            )

        if len(block_out_channels) != len(down_block_types):
            raise ValueError(
                f"Must provide the same number of `block_out_channels` as `down_block_types`. "
                f"`block_out_channels`: {block_out_channels}. `down_block_types`: {down_block_types}."
            )

        if not isinstance(num_attention_heads, int) and len(num_attention_heads) != len(down_block_types):
            raise ValueError(
                f"Must provide the same number of `num_attention_heads` as `down_block_types`. "
                f"`num_attention_heads`: {num_attention_heads}. `down_block_types`: {down_block_types}."
            )

        if isinstance(cross_attention_dim, list) and len(cross_attention_dim) != len(down_block_types):
            raise ValueError(
                f"Must provide the same number of `cross_attention_dim` as `down_block_types`. `cross_attention_dim`: {cross_attention_dim}. `down_block_types`: {down_block_types}."  # noqa: E501
            )

        if not isinstance(layers_per_block, int) and len(layers_per_block) != len(down_block_types):
            raise ValueError(
                f"Must provide the same number of `layers_per_block` as `down_block_types`. `layers_per_block`: {layers_per_block}. `down_block_types`: {down_block_types}."  # noqa: E501
            )

        if isinstance(transformer_layers_per_block, list) and reverse_transformer_layers_per_block is None:
            for layer_number_per_block in transformer_layers_per_block:
                if isinstance(layer_number_per_block, list):
                    raise ValueError("Must provide 'reverse_transformer_layers_per_block` if using asymmetrical UNet.")

        if (
            isinstance(temporal_transformer_layers_per_block, list)
            and reverse_temporal_transformer_layers_per_block is None
        ):
            for layer_number_per_block in temporal_transformer_layers_per_block:
                if isinstance(layer_number_per_block, list):
                    raise ValueError(
                        "Must provide 'reverse_temporal_transformer_layers_per_block` if using asymmetrical motion module in UNet."
                    )

        # input
        conv_in_kernel = 3
        conv_out_kernel = 3
        conv_in_padding = (conv_in_kernel - 1) // 2
        self.conv_in = nn.Conv2d(
            in_channels,
            block_out_channels[0],
            kernel_size=conv_in_kernel,
            pad_mode="pad",
            padding=conv_in_padding,
            has_bias=True,
        )

        # time
        time_embed_dim = block_out_channels[0] * 4
        self.time_proj = Timesteps(block_out_channels[0], True, 0)
        timestep_input_dim = block_out_channels[0]

        self.time_embedding = TimestepEmbedding(
            timestep_input_dim, time_embed_dim, act_fn=act_fn, cond_proj_dim=time_cond_proj_dim
        )

        if encoder_hid_dim_type is None:
            self.encoder_hid_proj = None

        if addition_embed_type == "text_time":
            self.add_time_proj = Timesteps(addition_time_embed_dim, True, 0)
            self.add_embedding = TimestepEmbedding(projection_class_embeddings_input_dim, time_embed_dim)

        # class embedding
        down_blocks = []
        up_blocks = []

        if isinstance(num_attention_heads, int):
            num_attention_heads = (num_attention_heads,) * len(down_block_types)

        if isinstance(cross_attention_dim, int):
            cross_attention_dim = (cross_attention_dim,) * len(down_block_types)

        if isinstance(layers_per_block, int):
            layers_per_block = [layers_per_block] * len(down_block_types)

        if isinstance(transformer_layers_per_block, int):
            transformer_layers_per_block = [transformer_layers_per_block] * len(down_block_types)

        if isinstance(reverse_transformer_layers_per_block, int):
            reverse_transformer_layers_per_block = [reverse_transformer_layers_per_block] * len(down_block_types)

        if isinstance(temporal_transformer_layers_per_block, int):
            temporal_transformer_layers_per_block = [temporal_transformer_layers_per_block] * len(down_block_types)

        if isinstance(reverse_temporal_transformer_layers_per_block, int):
            reverse_temporal_transformer_layers_per_block = [reverse_temporal_transformer_layers_per_block] * len(
                down_block_types
            )

        if isinstance(motion_num_attention_heads, int):
            motion_num_attention_heads = (motion_num_attention_heads,) * len(down_block_types)

        # down
        output_channel = block_out_channels[0]
        for i, down_block_type in enumerate(down_block_types):
            input_channel = output_channel
            output_channel = block_out_channels[i]
            is_final_block = i == len(block_out_channels) - 1

            if down_block_type == "CrossAttnDownBlockMotion":
                down_block = CrossAttnDownBlockMotion(
                    in_channels=input_channel,
                    out_channels=output_channel,
                    temb_channels=time_embed_dim,
                    num_layers=layers_per_block[i],
                    transformer_layers_per_block=transformer_layers_per_block[i],
                    resnet_eps=norm_eps,
                    resnet_act_fn=act_fn,
                    resnet_groups=norm_num_groups,
                    num_attention_heads=num_attention_heads[i],
                    cross_attention_dim=cross_attention_dim[i],
                    downsample_padding=downsample_padding,
                    add_downsample=not is_final_block,
                    use_linear_projection=use_linear_projection,
                    temporal_num_attention_heads=motion_num_attention_heads[i],
                    temporal_max_seq_length=motion_max_seq_length,
                    temporal_transformer_layers_per_block=temporal_transformer_layers_per_block[i],
                )
            elif down_block_type == "DownBlockMotion":
                down_block = DownBlockMotion(
                    in_channels=input_channel,
                    out_channels=output_channel,
                    temb_channels=time_embed_dim,
                    num_layers=layers_per_block[i],
                    resnet_eps=norm_eps,
                    resnet_act_fn=act_fn,
                    resnet_groups=norm_num_groups,
                    add_downsample=not is_final_block,
                    downsample_padding=downsample_padding,
                    temporal_num_attention_heads=motion_num_attention_heads[i],
                    temporal_max_seq_length=motion_max_seq_length,
                    temporal_transformer_layers_per_block=temporal_transformer_layers_per_block[i],
                )
            else:
                raise ValueError(
                    "Invalid `down_block_type` encountered. Must be one of `CrossAttnDownBlockMotion` or `DownBlockMotion`"
                )

            down_blocks.append(down_block)
        self.down_blocks = nn.CellList(down_blocks)

        # mid: only definition, binding attribute to UNetMotionModel later to maintain the order of sub-modules within
        # UNetMotionModel as self.down_blocks -> self.up_blocks -> self.mid_block, ensuring the correct sequence of
        # sub-modules is loaded when the ip-adpater is loaded.
        if transformer_layers_per_mid_block is None:
            transformer_layers_per_mid_block = (
                transformer_layers_per_block[-1] if isinstance(transformer_layers_per_block[-1], int) else 1
            )

        if use_motion_mid_block:
            mid_block = UNetMidBlockCrossAttnMotion(
                in_channels=block_out_channels[-1],
                temb_channels=time_embed_dim,
                resnet_eps=norm_eps,
                resnet_act_fn=act_fn,
                output_scale_factor=mid_block_scale_factor,
                cross_attention_dim=cross_attention_dim[-1],
                num_attention_heads=num_attention_heads[-1],
                resnet_groups=norm_num_groups,
                dual_cross_attention=False,
                use_linear_projection=use_linear_projection,
                num_layers=mid_block_layers,
                temporal_num_attention_heads=motion_num_attention_heads[-1],
                temporal_max_seq_length=motion_max_seq_length,
                transformer_layers_per_block=transformer_layers_per_mid_block,
                temporal_transformer_layers_per_block=temporal_transformer_layers_per_mid_block,
            )

        else:
            mid_block = UNetMidBlock2DCrossAttn(
                in_channels=block_out_channels[-1],
                temb_channels=time_embed_dim,
                resnet_eps=norm_eps,
                resnet_act_fn=act_fn,
                output_scale_factor=mid_block_scale_factor,
                cross_attention_dim=cross_attention_dim[-1],
                num_attention_heads=num_attention_heads[-1],
                resnet_groups=norm_num_groups,
                dual_cross_attention=False,
                use_linear_projection=use_linear_projection,
                num_layers=mid_block_layers,
                transformer_layers_per_block=transformer_layers_per_mid_block,
            )

        # count how many layers upsample the images
        self.num_upsamplers = 0

        # up
        layers_per_resnet_in_up_blocks = []
        reversed_block_out_channels = list(reversed(block_out_channels))
        reversed_num_attention_heads = list(reversed(num_attention_heads))
        reversed_layers_per_block = list(reversed(layers_per_block))
        reversed_cross_attention_dim = list(reversed(cross_attention_dim))
        reversed_motion_num_attention_heads = list(reversed(motion_num_attention_heads))

        if reverse_transformer_layers_per_block is None:
            reverse_transformer_layers_per_block = list(reversed(transformer_layers_per_block))

        if reverse_temporal_transformer_layers_per_block is None:
            reverse_temporal_transformer_layers_per_block = list(reversed(temporal_transformer_layers_per_block))

        output_channel = reversed_block_out_channels[0]
        for i, up_block_type in enumerate(up_block_types):
            is_final_block = i == len(block_out_channels) - 1

            prev_output_channel = output_channel
            output_channel = reversed_block_out_channels[i]
            input_channel = reversed_block_out_channels[min(i + 1, len(block_out_channels) - 1)]

            # add upsample block for all BUT final layer
            if not is_final_block:
                add_upsample = True
                self.num_upsamplers += 1
            else:
                add_upsample = False

            if up_block_type == "CrossAttnUpBlockMotion":
                up_block = CrossAttnUpBlockMotion(
                    in_channels=input_channel,
                    out_channels=output_channel,
                    prev_output_channel=prev_output_channel,
                    temb_channels=time_embed_dim,
                    resolution_idx=i,
                    num_layers=reversed_layers_per_block[i] + 1,
                    transformer_layers_per_block=reverse_transformer_layers_per_block[i],
                    resnet_eps=norm_eps,
                    resnet_act_fn=act_fn,
                    resnet_groups=norm_num_groups,
                    num_attention_heads=reversed_num_attention_heads[i],
                    cross_attention_dim=reversed_cross_attention_dim[i],
                    add_upsample=add_upsample,
                    use_linear_projection=use_linear_projection,
                    temporal_num_attention_heads=reversed_motion_num_attention_heads[i],
                    temporal_max_seq_length=motion_max_seq_length,
                    temporal_transformer_layers_per_block=reverse_temporal_transformer_layers_per_block[i],
                )
            elif up_block_type == "UpBlockMotion":
                up_block = UpBlockMotion(
                    in_channels=input_channel,
                    prev_output_channel=prev_output_channel,
                    out_channels=output_channel,
                    temb_channels=time_embed_dim,
                    resolution_idx=i,
                    num_layers=reversed_layers_per_block[i] + 1,
                    resnet_eps=norm_eps,
                    resnet_act_fn=act_fn,
                    resnet_groups=norm_num_groups,
                    add_upsample=add_upsample,
                    temporal_num_attention_heads=reversed_motion_num_attention_heads[i],
                    temporal_max_seq_length=motion_max_seq_length,
                    temporal_transformer_layers_per_block=reverse_temporal_transformer_layers_per_block[i],
                )
            else:
                raise ValueError(
                    "Invalid `up_block_type` encountered. Must be one of `CrossAttnUpBlockMotion` or `UpBlockMotion`"
                )

            up_blocks.append(up_block)
            prev_output_channel = output_channel
            layers_per_resnet_in_up_blocks.append(len(up_block.resnets))
        self.up_blocks = nn.CellList(up_blocks)
        self.layers_per_resnet_in_up_blocks = layers_per_resnet_in_up_blocks

        # bind mid_block to self here
        self.mid_block = mid_block

        # out
        if norm_num_groups is not None:
            self.conv_norm_out = GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=norm_eps)
            self.conv_act = nn.SiLU()
        else:
            self.conv_norm_out = None
            self.conv_act = None

        conv_out_padding = (conv_out_kernel - 1) // 2
        self.conv_out = nn.Conv2d(
            block_out_channels[0],
            out_channels,
            kernel_size=conv_out_kernel,
            pad_mode="pad",
            padding=conv_out_padding,
            has_bias=True,
        )

    @classmethod
    def from_unet2d(
        cls,
        unet: UNet2DConditionModel,
        motion_adapter: Optional[MotionAdapter] = None,
        load_weights: bool = True,
    ):
        has_motion_adapter = motion_adapter is not None

        if has_motion_adapter:
            # check compatibility of number of blocks
            if len(unet.config["down_block_types"]) != len(motion_adapter.config["block_out_channels"]):
                raise ValueError("Incompatible Motion Adapter, got different number of blocks")

            # check layers compatibility for each block
            if isinstance(unet.config["layers_per_block"], int):
                expanded_layers_per_block = [unet.config["layers_per_block"]] * len(unet.config["down_block_types"])
            else:
                expanded_layers_per_block = list(unet.config["layers_per_block"])
            if isinstance(motion_adapter.config["motion_layers_per_block"], int):
                expanded_adapter_layers_per_block = [motion_adapter.config["motion_layers_per_block"]] * len(
                    motion_adapter.config["block_out_channels"]
                )
            else:
                expanded_adapter_layers_per_block = list(motion_adapter.config["motion_layers_per_block"])
            if expanded_layers_per_block != expanded_adapter_layers_per_block:
                raise ValueError("Incompatible Motion Adapter, got different number of layers per block")

        # based on https://github.com/guoyww/AnimateDiff/blob/895f3220c06318ea0760131ec70408b466c49333/animatediff/models/unet.py#L459
        config = dict(unet.config)
        config["_class_name"] = cls.__name__

        down_blocks = []
        for down_blocks_type in config["down_block_types"]:
            if "CrossAttn" in down_blocks_type:
                down_blocks.append("CrossAttnDownBlockMotion")
            else:
                down_blocks.append("DownBlockMotion")
        config["down_block_types"] = down_blocks

        up_blocks = []
        for down_blocks_type in config["up_block_types"]:
            if "CrossAttn" in down_blocks_type:
                up_blocks.append("CrossAttnUpBlockMotion")
            else:
                up_blocks.append("UpBlockMotion")
        config["up_block_types"] = up_blocks

        if has_motion_adapter:
            config["motion_num_attention_heads"] = motion_adapter.config["motion_num_attention_heads"]
            config["motion_max_seq_length"] = motion_adapter.config["motion_max_seq_length"]
            config["use_motion_mid_block"] = motion_adapter.config["use_motion_mid_block"]
            config["layers_per_block"] = motion_adapter.config["motion_layers_per_block"]
            config["temporal_transformer_layers_per_mid_block"] = motion_adapter.config[
                "motion_transformer_layers_per_mid_block"
            ]
            config["temporal_transformer_layers_per_block"] = motion_adapter.config[
                "motion_transformer_layers_per_block"
            ]
            config["motion_num_attention_heads"] = motion_adapter.config["motion_num_attention_heads"]

            # For PIA UNets we need to set the number input channels to 9
            if motion_adapter.config["conv_in_channels"]:
                config["in_channels"] = motion_adapter.config["conv_in_channels"]

        # Need this for backwards compatibility with UNet2DConditionModel checkpoints
        if not config.get("num_attention_heads"):
            config["num_attention_heads"] = config["attention_head_dim"]

        expected_kwargs, optional_kwargs = cls._get_signature_keys(cls)
        config = FrozenDict({k: config.get(k) for k in config if k in expected_kwargs or k in optional_kwargs})
        config["_class_name"] = cls.__name__
        model = cls.from_config(config)

        # Move dtype conversion code here to avoid dtype mismatch issues when loading weights
        # ensure that the Motion UNet is the same dtype as the UNet2DConditionModel
        model.to(unet.dtype)

        if not load_weights:
            return model

        # Logic for loading PIA UNets which allow the first 4 channels to be any UNet2DConditionModel conv_in weight
        # while the last 5 channels must be PIA conv_in weights.
        if has_motion_adapter and motion_adapter.config["conv_in_channels"]:
            model.conv_in = motion_adapter.conv_in
            updated_conv_in_weight = ops.cat([unet.conv_in.weight, motion_adapter.conv_in.weight[:, 4:, :, :]], axis=1)
            ms.load_param_into_net(model.conv_in, {"weight": updated_conv_in_weight, "bias": unet.conv_in.bias})
        else:
            ms.load_param_into_net(model.conv_in, unet.conv_in.parameters_dict())

        ms.load_param_into_net(model.time_proj, unet.time_proj.parameters_dict())
        ms.load_param_into_net(model.time_embedding, unet.time_embedding.parameters_dict())

        if any(isinstance(proc, IPAdapterAttnProcessor) for proc in unet.attn_processors.values()):
            attn_procs = {}
            for name, processor in unet.attn_processors.items():
                if name.endswith("attn1.processor"):
                    attn_processor_class = AttnProcessor
                    attn_procs[name] = attn_processor_class()
                else:
                    attn_processor_class = IPAdapterAttnProcessor
                    attn_procs[name] = attn_processor_class(
                        hidden_size=processor.hidden_size,
                        cross_attention_dim=processor.cross_attention_dim,
                        scale=processor.scale,
                        num_tokens=processor.num_tokens,
                    )
            for name, processor in model.attn_processors.items():
                if name not in attn_procs:
                    attn_procs[name] = processor.__class__()
            model.set_attn_processor(attn_procs)
            model.config.encoder_hid_dim_type = "ip_image_proj"
            model.encoder_hid_proj = unet.encoder_hid_proj

        for i, down_block in enumerate(unet.down_blocks):
            ms.load_param_into_net(model.down_blocks[i].resnets, down_block.resnets.parameters_dict())
            if hasattr(model.down_blocks[i], "attentions"):
                ms.load_param_into_net(model.down_blocks[i].attentions, down_block.attentions.parameters_dict())
            if model.down_blocks[i].downsamplers:
                ms.load_param_into_net(model.down_blocks[i].downsamplers, down_block.downsamplers.parameters_dict())

        for i, up_block in enumerate(unet.up_blocks):
            ms.load_param_into_net(model.up_blocks[i].resnets, up_block.resnets.parameters_dict())
            if hasattr(model.up_blocks[i], "attentions"):
                ms.load_param_into_net(model.up_blocks[i].attentions, up_block.attentions.parameters_dict())
            if model.up_blocks[i].upsamplers:
                ms.load_param_into_net(model.up_blocks[i].upsamplers, up_block.upsamplers.parameters_dict())

        ms.load_param_into_net(model.mid_block.resnets, unet.mid_block.resnets.parameters_dict())
        ms.load_param_into_net(model.mid_block.attentions, unet.mid_block.attentions.parameters_dict())

        if unet.conv_norm_out is not None:
            ms.load_param_into_net(model.conv_norm_out, unet.conv_norm_out.parameters_dict())
        if unet.conv_act is not None:
            ms.load_param_into_net(model.conv_act, unet.conv_act.parameters_dict())
        ms.load_param_into_net(model.conv_out, unet.conv_out.parameters_dict())

        if has_motion_adapter:
            model.load_motion_modules(motion_adapter)

        return model

    def freeze_unet2d_params(self) -> None:
        """Freeze the weights of just the UNet2DConditionModel, and leave the motion modules
        unfrozen for fine tuning.
        """
        # Freeze everything
        for param in self.get_parameters():
            param.requires_grad = False

        # Unfreeze Motion Modules
        for down_block in self.down_blocks:
            motion_modules = down_block.motion_modules
            for param in motion_modules.get_parameters():
                param.requires_grad = True

        for up_block in self.up_blocks:
            motion_modules = up_block.motion_modules
            for param in motion_modules.get_parameters():
                param.requires_grad = True

        if hasattr(self.mid_block, "motion_modules"):
            motion_modules = self.mid_block.motion_modules
            for param in motion_modules.get_parameters():
                param.requires_grad = True

    def load_motion_modules(self, motion_adapter: Optional[MotionAdapter]) -> None:
        for i, down_block in enumerate(motion_adapter.down_blocks):
            ms.load_param_into_net(self.down_blocks[i].motion_modules, down_block.motion_modules.parameters_dict())
        for i, up_block in enumerate(motion_adapter.up_blocks):
            ms.load_param_into_net(self.up_blocks[i].motion_modules, up_block.motion_modules.parameters_dict())

        # to support older motion modules that don't have a mid_block
        if hasattr(self.mid_block, "motion_modules"):
            ms.load_param_into_net(
                self.mid_block.motion_modules, motion_adapter.mid_block.motion_modules.parameters_dict()
            )

    def save_motion_modules(
        self,
        save_directory: str,
        is_main_process: bool = True,
        safe_serialization: bool = True,
        variant: Optional[str] = None,
        push_to_hub: bool = False,
        **kwargs,
    ) -> None:
        state_dict = self.parameters_dict()

        # Extract all motion modules
        motion_state_dict = {}
        for k, v in state_dict.items():
            if "motion_modules" in k:
                motion_state_dict[k] = v

        adapter = MotionAdapter(
            block_out_channels=self.config["block_out_channels"],
            motion_layers_per_block=self.config["layers_per_block"],
            motion_norm_num_groups=self.config["norm_num_groups"],
            motion_num_attention_heads=self.config["motion_num_attention_heads"],
            motion_max_seq_length=self.config["motion_max_seq_length"],
            use_motion_mid_block=self.config["use_motion_mid_block"],
        )
        ms.load_param_into_net(adapter, motion_state_dict)
        adapter.save_pretrained(
            save_directory=save_directory,
            is_main_process=is_main_process,
            safe_serialization=safe_serialization,
            variant=variant,
            push_to_hub=push_to_hub,
            **kwargs,
        )

    @property
    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
    def attn_processors(self) -> Dict[str, AttentionProcessor]:  # type: ignore
        r"""
        Returns:
            `dict` of attention processors: A dictionary containing all attention processors used in the model with
            indexed by its weight name.
        """
        # set recursively
        processors = {}

        def fn_recursive_add_processors(name: str, module: nn.Cell, processors: Dict[str, AttentionProcessor]):  # type: ignore
            if hasattr(module, "get_processor"):
                processors[f"{name}.processor"] = module.get_processor()

            for sub_name, child in module.name_cells().items():
                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)

            return processors

        for name, module in self.name_cells().items():
            fn_recursive_add_processors(name, module, processors)

        return processors

    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_attn_processor
    def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):  # type: ignore
        r"""
        Sets the attention processor to use to compute attention.
        Parameters:
            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
                The instantiated processor class or a dictionary of processor classes that will be set as the processor
                for **all** `Attention` layers.
                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
                processor. This is strongly recommended when setting trainable attention processors.
        """
        count = len(self.attn_processors.keys())

        if isinstance(processor, dict) and len(processor) != count:
            raise ValueError(
                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
            )

        def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
            if hasattr(module, "set_processor"):
                if not isinstance(processor, dict):
                    module.set_processor(processor)
                else:
                    module.set_processor(processor.pop(f"{name}.processor"))

            for sub_name, child in module.name_cells().items():
                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

        for name, module in self.name_cells().items():
            fn_recursive_attn_processor(name, module, processor)

    def enable_forward_chunking(self, chunk_size: Optional[int] = None, dim: int = 0) -> None:
        """
        Sets the attention processor to use [feed forward
        chunking](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers).

        Parameters:
            chunk_size (`int`, *optional*):
                The chunk size of the feed-forward layers. If not specified, will run feed-forward layer individually
                over each tensor of dim=`dim`.
            dim (`int`, *optional*, defaults to `0`):
                The dimension over which the feed-forward computation should be chunked. Choose between dim=0 (batch)
                or dim=1 (sequence length).
        """
        if dim not in [0, 1]:
            raise ValueError(f"Make sure to set `dim` to either 0 or 1, not {dim}")

        # By default chunk size is 1
        chunk_size = chunk_size or 1

        def fn_recursive_feed_forward(module: nn.Cell, chunk_size: int, dim: int):
            if hasattr(module, "set_chunk_feed_forward"):
                module.set_chunk_feed_forward(chunk_size=chunk_size, dim=dim)

            for child in module.name_cells().values():
                fn_recursive_feed_forward(child, chunk_size, dim)

        for module in self.name_cells().values():
            fn_recursive_feed_forward(module, chunk_size, dim)

    def disable_forward_chunking(self):
        def fn_recursive_feed_forward(module: nn.Cell, chunk_size: int, dim: int):
            if hasattr(module, "set_chunk_feed_forward"):
                module.set_chunk_feed_forward(chunk_size=chunk_size, dim=dim)

            for child in module.name_cells().values():
                fn_recursive_feed_forward(child, chunk_size, dim)

        for module in self.name_cells().values():
            fn_recursive_feed_forward(module, None, 0)

    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor
    def set_default_attn_processor(self):
        """
        Disables custom attention processors and sets the default attention implementation.
        """
        if all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
            processor = AttnProcessor()
        else:
            raise ValueError(
                f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
            )

        self.set_attn_processor(processor)

    def _set_gradient_checkpointing(self, module, value: bool = False) -> None:
        if isinstance(module, (CrossAttnDownBlockMotion, DownBlockMotion, CrossAttnUpBlockMotion, UpBlockMotion)):
            module.gradient_checkpointing = value

    def construct(
        self,
        sample: ms.Tensor,
        timestep: Union[ms.Tensor, float, int],
        encoder_hidden_states: ms.Tensor,
        timestep_cond: Optional[ms.Tensor] = None,
        attention_mask: Optional[ms.Tensor] = None,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
        added_cond_kwargs: Optional[Dict[str, ms.Tensor]] = None,
        down_block_additional_residuals: Optional[Tuple[ms.Tensor]] = None,
        mid_block_additional_residual: Optional[ms.Tensor] = None,
        return_dict: bool = False,
    ) -> Union[UNetMotionOutput, Tuple[ms.Tensor]]:
        r"""
        The [`UNetMotionModel`] forward method.

        Args:
            sample (`ms.Tensor`):
                The noisy input tensor with the following shape `(batch, num_frames, channel, height, width`.
            timestep (`ms.Tensor` or `float` or `int`): The number of timesteps to denoise an input.
            encoder_hidden_states (`ms.Tensor`):
                The encoder hidden states with shape `(batch, sequence_length, feature_dim)`.
            timestep_cond: (`ms.Tensor`, *optional*, defaults to `None`):
                Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
                through the `self.time_embedding` layer to obtain the timestep embeddings.
            attention_mask (`ms.Tensor`, *optional*, defaults to `None`):
                An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
                is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
                negative values to the attention scores corresponding to "discard" tokens.
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
                `self.processor` in
                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
            down_block_additional_residuals: (`tuple` of `ms.Tensor`, *optional*):
                A tuple of tensors that if specified are added to the residuals of down unet blocks.
            mid_block_additional_residual: (`ms.Tensor`, *optional*):
                A tensor that if specified is added to the residual of the middle unet block.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~models.unets.unet_motion_model.UNetMotionOutput`] instead of a plain
                tuple.

        Returns:
            [`~models.unets.unet_motion_model.UNetMotionOutput`] or `tuple`:
                If `return_dict` is True, an [`~models.unets.unet_motion_model.UNetMotionOutput`] is returned,
                otherwise a `tuple` is returned where the first element is the sample tensor.
        """
        # By default samples have to be AT least a multiple of the overall upsampling factor.
        # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
        # However, the upsampling interpolation output size can be forced to fit any upsampling size
        # on the fly if necessary.
        default_overall_up_factor = 2**self.num_upsamplers

        # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
        forward_upsample_size = False
        upsample_size = None

        if sample.shape[-2] % default_overall_up_factor != 0 or sample.shape[-1] % default_overall_up_factor != 0:
            forward_upsample_size = True

        # prepare attention_mask
        if attention_mask is not None:
            attention_mask = (1 - attention_mask.to(sample.dtype)) * -10000.0
            attention_mask = attention_mask.unsqueeze(1)

        # 1. time
        timesteps = timestep
        if not ops.is_tensor(timesteps):
            # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
            if isinstance(timestep, float):
                dtype = ms.float64
            else:
                dtype = ms.int64
            timesteps = ms.Tensor([timesteps], dtype=dtype)
        elif len(timesteps.shape) == 0:
            timesteps = timesteps[None]

        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
        num_frames = sample.shape[2]
        timesteps = timesteps.broadcast_to((sample.shape[0],))

        t_emb = self.time_proj(timesteps)

        # timesteps does not contain any weights and will always return f32 tensors
        # but time_embedding might actually be running in fp16. so we need to cast here.
        # there might be better ways to encapsulate this.
        t_emb = t_emb.to(dtype=self.dtype)

        emb = self.time_embedding(t_emb, timestep_cond)
        aug_emb = None

        if self.config["addition_embed_type"] == "text_time":
            if "text_embeds" not in added_cond_kwargs:
                raise ValueError(
                    f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `text_embeds` to be passed in `added_cond_kwargs`"  # noqa: E501
                )

            text_embeds = added_cond_kwargs.get("text_embeds")
            if "time_ids" not in added_cond_kwargs:
                raise ValueError(
                    f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `time_ids` to be passed in `added_cond_kwargs`"  # noqa: E501
                )
            time_ids = added_cond_kwargs.get("time_ids")
            time_embeds = self.add_time_proj(time_ids.flatten()).to(text_embeds.dtype)
            time_embeds = time_embeds.reshape((text_embeds.shape[0], -1))

            add_embeds = ops.concat([text_embeds, time_embeds], axis=-1)
            add_embeds = add_embeds.to(emb.dtype)
            aug_emb = self.add_embedding(add_embeds)

        emb = emb if aug_emb is None else emb + aug_emb
        emb = emb.repeat_interleave(repeats=num_frames, dim=0)

        if self.encoder_hid_proj is not None and self.config["encoder_hid_dim_type"] == "ip_image_proj":
            if "image_embeds" not in added_cond_kwargs:
                raise ValueError(
                    f"{self.__class__} has the config param `encoder_hid_dim_type` set to "
                    f"'ip_image_proj' which requires the keyword argument `image_embeds` to be passed in  `added_conditions`"
                )
            image_embeds = added_cond_kwargs.get("image_embeds")
            image_embeds = self.encoder_hid_proj(image_embeds)
            image_embeds = [image_embed.repeat_interleave(repeats=num_frames, dim=0) for image_embed in image_embeds]
            encoder_hidden_states = (encoder_hidden_states, image_embeds)

        # 2. pre-process
        sample = sample.permute(0, 2, 1, 3, 4).reshape((sample.shape[0] * num_frames, -1) + sample.shape[3:])
        sample = self.conv_in(sample)

        # 3. down
        down_block_res_samples = (sample,)
        for downsample_block in self.down_blocks:
            if downsample_block.has_cross_attention:
                sample, res_samples = downsample_block(
                    hidden_states=sample,
                    temb=emb,
                    encoder_hidden_states=encoder_hidden_states,
                    attention_mask=attention_mask,
                    num_frames=num_frames,
                    cross_attention_kwargs=cross_attention_kwargs,
                )
            else:
                sample, res_samples = downsample_block(hidden_states=sample, temb=emb, num_frames=num_frames)

            down_block_res_samples += res_samples

        if down_block_additional_residuals is not None:
            new_down_block_res_samples = ()

            for down_block_res_sample, down_block_additional_residual in zip(
                down_block_res_samples, down_block_additional_residuals
            ):
                down_block_res_sample = down_block_res_sample + down_block_additional_residual
                new_down_block_res_samples += (down_block_res_sample,)

            down_block_res_samples = new_down_block_res_samples

        # 4. mid
        if self.mid_block is not None:
            # To support older versions of motion modules that don't have a mid_block
            if self.mid_block.has_motion_modules:
                sample = self.mid_block(
                    sample,
                    emb,
                    encoder_hidden_states=encoder_hidden_states,
                    attention_mask=attention_mask,
                    num_frames=num_frames,
                    cross_attention_kwargs=cross_attention_kwargs,
                )
            else:
                sample = self.mid_block(
                    sample,
                    emb,
                    encoder_hidden_states=encoder_hidden_states,
                    attention_mask=attention_mask,
                    cross_attention_kwargs=cross_attention_kwargs,
                )

        if mid_block_additional_residual is not None:
            sample = sample + mid_block_additional_residual

        # 5. up
        for i, upsample_block in enumerate(self.up_blocks):
            is_final_block = i == len(self.up_blocks) - 1

            res_samples = down_block_res_samples[-self.layers_per_resnet_in_up_blocks[i] :]
            down_block_res_samples = down_block_res_samples[: -self.layers_per_resnet_in_up_blocks[i]]

            # if we have not reached the final block and need to forward the
            # upsample size, we do it here
            if not is_final_block and forward_upsample_size:
                upsample_size = down_block_res_samples[-1].shape[2:]

            if upsample_block.has_cross_attention:
                sample = upsample_block(
                    hidden_states=sample,
                    temb=emb,
                    res_hidden_states_tuple=res_samples,
                    encoder_hidden_states=encoder_hidden_states,
                    upsample_size=upsample_size,
                    attention_mask=attention_mask,
                    num_frames=num_frames,
                    cross_attention_kwargs=cross_attention_kwargs,
                )
            else:
                sample = upsample_block(
                    hidden_states=sample,
                    temb=emb,
                    res_hidden_states_tuple=res_samples,
                    upsample_size=upsample_size,
                    num_frames=num_frames,
                )

        # 6. post-process
        if self.conv_norm_out:
            sample = self.conv_norm_out(sample)
            sample = self.conv_act(sample)

        sample = self.conv_out(sample)

        # reshape to (batch, channel, framerate, width, height)
        sample = sample[None, :].reshape((-1, num_frames) + sample.shape[1:]).permute(0, 2, 1, 3, 4)

        if not return_dict:
            return (sample,)

        return UNetMotionOutput(sample=sample)

`mindone.diffusers.UNetMotionModel.attn_processors: Dict[str, AttentionProcessor]` `property` ¶

RETURNS	DESCRIPTION
`Dict[str, AttentionProcessor]`	`dict` of attention processors: A dictionary containing all attention processors used in the model with
`Dict[str, AttentionProcessor]`	indexed by its weight name.

`mindone.diffusers.UNetMotionModel.construct(sample, timestep, encoder_hidden_states, timestep_cond=None, attention_mask=None, cross_attention_kwargs=None, added_cond_kwargs=None, down_block_additional_residuals=None, mid_block_additional_residual=None, return_dict=False)` ¶

The [UNetMotionModel] forward method.

PARAMETER	DESCRIPTION
`sample`	The noisy input tensor with the following shape `(batch, num_frames, channel, height, width`. TYPE: `ms.Tensor`
`timestep`	The number of timesteps to denoise an input. TYPE: `ms.Tensor` or `float` or `int`
`encoder_hidden_states`	The encoder hidden states with shape `(batch, sequence_length, feature_dim)`. TYPE: `ms.Tensor`
`timestep_cond`	(`ms.Tensor`, optional, defaults to `None`): Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed through the `self.time_embedding` layer to obtain the timestep embeddings. TYPE: `Optional[Tensor]` DEFAULT: `None`
`attention_mask`	An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large negative values to the attention scores corresponding to "discard" tokens. TYPE: `ms.Tensor`, optional, defaults to `None` DEFAULT: `None`
`cross_attention_kwargs`	A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in diffusers.models.attention_processor. TYPE: `dict`, optional DEFAULT: `None`
`down_block_additional_residuals`	(`tuple` of `ms.Tensor`, optional): A tuple of tensors that if specified are added to the residuals of down unet blocks. TYPE: `Optional[Tuple[Tensor]]` DEFAULT: `None`
`mid_block_additional_residual`	(`ms.Tensor`, optional): A tensor that if specified is added to the residual of the middle unet block. TYPE: `Optional[Tensor]` DEFAULT: `None`
`return_dict`	Whether or not to return a [`~models.unets.unet_motion_model.UNetMotionOutput`] instead of a plain tuple. TYPE: `bool`, optional, defaults to `True` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[UNetMotionOutput, Tuple[Tensor]]`	[`~models.unets.unet_motion_model.UNetMotionOutput`] or `tuple`: If `return_dict` is True, an [`~models.unets.unet_motion_model.UNetMotionOutput`] is returned, otherwise a `tuple` is returned where the first element is the sample tensor.

Source code in mindone/diffusers/models/unets/unet_motion_model.py

def construct(
    self,
    sample: ms.Tensor,
    timestep: Union[ms.Tensor, float, int],
    encoder_hidden_states: ms.Tensor,
    timestep_cond: Optional[ms.Tensor] = None,
    attention_mask: Optional[ms.Tensor] = None,
    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    added_cond_kwargs: Optional[Dict[str, ms.Tensor]] = None,
    down_block_additional_residuals: Optional[Tuple[ms.Tensor]] = None,
    mid_block_additional_residual: Optional[ms.Tensor] = None,
    return_dict: bool = False,
) -> Union[UNetMotionOutput, Tuple[ms.Tensor]]:
    r"""
    The [`UNetMotionModel`] forward method.

    Args:
        sample (`ms.Tensor`):
            The noisy input tensor with the following shape `(batch, num_frames, channel, height, width`.
        timestep (`ms.Tensor` or `float` or `int`): The number of timesteps to denoise an input.
        encoder_hidden_states (`ms.Tensor`):
            The encoder hidden states with shape `(batch, sequence_length, feature_dim)`.
        timestep_cond: (`ms.Tensor`, *optional*, defaults to `None`):
            Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
            through the `self.time_embedding` layer to obtain the timestep embeddings.
        attention_mask (`ms.Tensor`, *optional*, defaults to `None`):
            An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
            is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
            negative values to the attention scores corresponding to "discard" tokens.
        cross_attention_kwargs (`dict`, *optional*):
            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
            `self.processor` in
            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
        down_block_additional_residuals: (`tuple` of `ms.Tensor`, *optional*):
            A tuple of tensors that if specified are added to the residuals of down unet blocks.
        mid_block_additional_residual: (`ms.Tensor`, *optional*):
            A tensor that if specified is added to the residual of the middle unet block.
        return_dict (`bool`, *optional*, defaults to `True`):
            Whether or not to return a [`~models.unets.unet_motion_model.UNetMotionOutput`] instead of a plain
            tuple.

    Returns:
        [`~models.unets.unet_motion_model.UNetMotionOutput`] or `tuple`:
            If `return_dict` is True, an [`~models.unets.unet_motion_model.UNetMotionOutput`] is returned,
            otherwise a `tuple` is returned where the first element is the sample tensor.
    """
    # By default samples have to be AT least a multiple of the overall upsampling factor.
    # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
    # However, the upsampling interpolation output size can be forced to fit any upsampling size
    # on the fly if necessary.
    default_overall_up_factor = 2**self.num_upsamplers

    # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
    forward_upsample_size = False
    upsample_size = None

    if sample.shape[-2] % default_overall_up_factor != 0 or sample.shape[-1] % default_overall_up_factor != 0:
        forward_upsample_size = True

    # prepare attention_mask
    if attention_mask is not None:
        attention_mask = (1 - attention_mask.to(sample.dtype)) * -10000.0
        attention_mask = attention_mask.unsqueeze(1)

    # 1. time
    timesteps = timestep
    if not ops.is_tensor(timesteps):
        # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
        if isinstance(timestep, float):
            dtype = ms.float64
        else:
            dtype = ms.int64
        timesteps = ms.Tensor([timesteps], dtype=dtype)
    elif len(timesteps.shape) == 0:
        timesteps = timesteps[None]

    # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
    num_frames = sample.shape[2]
    timesteps = timesteps.broadcast_to((sample.shape[0],))

    t_emb = self.time_proj(timesteps)

    # timesteps does not contain any weights and will always return f32 tensors
    # but time_embedding might actually be running in fp16. so we need to cast here.
    # there might be better ways to encapsulate this.
    t_emb = t_emb.to(dtype=self.dtype)

    emb = self.time_embedding(t_emb, timestep_cond)
    aug_emb = None

    if self.config["addition_embed_type"] == "text_time":
        if "text_embeds" not in added_cond_kwargs:
            raise ValueError(
                f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `text_embeds` to be passed in `added_cond_kwargs`"  # noqa: E501
            )

        text_embeds = added_cond_kwargs.get("text_embeds")
        if "time_ids" not in added_cond_kwargs:
            raise ValueError(
                f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `time_ids` to be passed in `added_cond_kwargs`"  # noqa: E501
            )
        time_ids = added_cond_kwargs.get("time_ids")
        time_embeds = self.add_time_proj(time_ids.flatten()).to(text_embeds.dtype)
        time_embeds = time_embeds.reshape((text_embeds.shape[0], -1))

        add_embeds = ops.concat([text_embeds, time_embeds], axis=-1)
        add_embeds = add_embeds.to(emb.dtype)
        aug_emb = self.add_embedding(add_embeds)

    emb = emb if aug_emb is None else emb + aug_emb
    emb = emb.repeat_interleave(repeats=num_frames, dim=0)

    if self.encoder_hid_proj is not None and self.config["encoder_hid_dim_type"] == "ip_image_proj":
        if "image_embeds" not in added_cond_kwargs:
            raise ValueError(
                f"{self.__class__} has the config param `encoder_hid_dim_type` set to "
                f"'ip_image_proj' which requires the keyword argument `image_embeds` to be passed in  `added_conditions`"
            )
        image_embeds = added_cond_kwargs.get("image_embeds")
        image_embeds = self.encoder_hid_proj(image_embeds)
        image_embeds = [image_embed.repeat_interleave(repeats=num_frames, dim=0) for image_embed in image_embeds]
        encoder_hidden_states = (encoder_hidden_states, image_embeds)

    # 2. pre-process
    sample = sample.permute(0, 2, 1, 3, 4).reshape((sample.shape[0] * num_frames, -1) + sample.shape[3:])
    sample = self.conv_in(sample)

    # 3. down
    down_block_res_samples = (sample,)
    for downsample_block in self.down_blocks:
        if downsample_block.has_cross_attention:
            sample, res_samples = downsample_block(
                hidden_states=sample,
                temb=emb,
                encoder_hidden_states=encoder_hidden_states,
                attention_mask=attention_mask,
                num_frames=num_frames,
                cross_attention_kwargs=cross_attention_kwargs,
            )
        else:
            sample, res_samples = downsample_block(hidden_states=sample, temb=emb, num_frames=num_frames)

        down_block_res_samples += res_samples

    if down_block_additional_residuals is not None:
        new_down_block_res_samples = ()

        for down_block_res_sample, down_block_additional_residual in zip(
            down_block_res_samples, down_block_additional_residuals
        ):
            down_block_res_sample = down_block_res_sample + down_block_additional_residual
            new_down_block_res_samples += (down_block_res_sample,)

        down_block_res_samples = new_down_block_res_samples

    # 4. mid
    if self.mid_block is not None:
        # To support older versions of motion modules that don't have a mid_block
        if self.mid_block.has_motion_modules:
            sample = self.mid_block(
                sample,
                emb,
                encoder_hidden_states=encoder_hidden_states,
                attention_mask=attention_mask,
                num_frames=num_frames,
                cross_attention_kwargs=cross_attention_kwargs,
            )
        else:
            sample = self.mid_block(
                sample,
                emb,
                encoder_hidden_states=encoder_hidden_states,
                attention_mask=attention_mask,
                cross_attention_kwargs=cross_attention_kwargs,
            )

    if mid_block_additional_residual is not None:
        sample = sample + mid_block_additional_residual

    # 5. up
    for i, upsample_block in enumerate(self.up_blocks):
        is_final_block = i == len(self.up_blocks) - 1

        res_samples = down_block_res_samples[-self.layers_per_resnet_in_up_blocks[i] :]
        down_block_res_samples = down_block_res_samples[: -self.layers_per_resnet_in_up_blocks[i]]

        # if we have not reached the final block and need to forward the
        # upsample size, we do it here
        if not is_final_block and forward_upsample_size:
            upsample_size = down_block_res_samples[-1].shape[2:]

        if upsample_block.has_cross_attention:
            sample = upsample_block(
                hidden_states=sample,
                temb=emb,
                res_hidden_states_tuple=res_samples,
                encoder_hidden_states=encoder_hidden_states,
                upsample_size=upsample_size,
                attention_mask=attention_mask,
                num_frames=num_frames,
                cross_attention_kwargs=cross_attention_kwargs,
            )
        else:
            sample = upsample_block(
                hidden_states=sample,
                temb=emb,
                res_hidden_states_tuple=res_samples,
                upsample_size=upsample_size,
                num_frames=num_frames,
            )

    # 6. post-process
    if self.conv_norm_out:
        sample = self.conv_norm_out(sample)
        sample = self.conv_act(sample)

    sample = self.conv_out(sample)

    # reshape to (batch, channel, framerate, width, height)
    sample = sample[None, :].reshape((-1, num_frames) + sample.shape[1:]).permute(0, 2, 1, 3, 4)

    if not return_dict:
        return (sample,)

    return UNetMotionOutput(sample=sample)

`mindone.diffusers.UNetMotionModel.enable_forward_chunking(chunk_size=None, dim=0)` ¶

Sets the attention processor to use feed forward chunking.

PARAMETER	DESCRIPTION
`chunk_size`	The chunk size of the feed-forward layers. If not specified, will run feed-forward layer individually over each tensor of dim=`dim`. TYPE: `int`, optional DEFAULT: `None`
`dim`	The dimension over which the feed-forward computation should be chunked. Choose between dim=0 (batch) or dim=1 (sequence length). TYPE: `int`, optional, defaults to `0` DEFAULT: `0`

Source code in mindone/diffusers/models/unets/unet_motion_model.py

def enable_forward_chunking(self, chunk_size: Optional[int] = None, dim: int = 0) -> None:
    """
    Sets the attention processor to use [feed forward
    chunking](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers).

    Parameters:
        chunk_size (`int`, *optional*):
            The chunk size of the feed-forward layers. If not specified, will run feed-forward layer individually
            over each tensor of dim=`dim`.
        dim (`int`, *optional*, defaults to `0`):
            The dimension over which the feed-forward computation should be chunked. Choose between dim=0 (batch)
            or dim=1 (sequence length).
    """
    if dim not in [0, 1]:
        raise ValueError(f"Make sure to set `dim` to either 0 or 1, not {dim}")

    # By default chunk size is 1
    chunk_size = chunk_size or 1

    def fn_recursive_feed_forward(module: nn.Cell, chunk_size: int, dim: int):
        if hasattr(module, "set_chunk_feed_forward"):
            module.set_chunk_feed_forward(chunk_size=chunk_size, dim=dim)

        for child in module.name_cells().values():
            fn_recursive_feed_forward(child, chunk_size, dim)

    for module in self.name_cells().values():
        fn_recursive_feed_forward(module, chunk_size, dim)

`mindone.diffusers.UNetMotionModel.freeze_unet2d_params()` ¶

Freeze the weights of just the UNet2DConditionModel, and leave the motion modules unfrozen for fine tuning.

Source code in mindone/diffusers/models/unets/unet_motion_model.py

def freeze_unet2d_params(self) -> None:
    """Freeze the weights of just the UNet2DConditionModel, and leave the motion modules
    unfrozen for fine tuning.
    """
    # Freeze everything
    for param in self.get_parameters():
        param.requires_grad = False

    # Unfreeze Motion Modules
    for down_block in self.down_blocks:
        motion_modules = down_block.motion_modules
        for param in motion_modules.get_parameters():
            param.requires_grad = True

    for up_block in self.up_blocks:
        motion_modules = up_block.motion_modules
        for param in motion_modules.get_parameters():
            param.requires_grad = True

    if hasattr(self.mid_block, "motion_modules"):
        motion_modules = self.mid_block.motion_modules
        for param in motion_modules.get_parameters():
            param.requires_grad = True

`mindone.diffusers.UNetMotionModel.set_attn_processor(processor)` ¶

Sets the attention processor to use to compute attention. Parameters: processor (dict of AttentionProcessor or only AttentionProcessor): The instantiated processor class or a dictionary of processor classes that will be set as the processor for all Attention layers. If processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.

Source code in mindone/diffusers/models/unets/unet_motion_model.py

def set_attn_processor(self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]):  # type: ignore
    r"""
    Sets the attention processor to use to compute attention.
    Parameters:
        processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
            The instantiated processor class or a dictionary of processor classes that will be set as the processor
            for **all** `Attention` layers.
            If `processor` is a dict, the key needs to define the path to the corresponding cross attention
            processor. This is strongly recommended when setting trainable attention processors.
    """
    count = len(self.attn_processors.keys())

    if isinstance(processor, dict) and len(processor) != count:
        raise ValueError(
            f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
            f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
        )

    def fn_recursive_attn_processor(name: str, module: nn.Cell, processor):
        if hasattr(module, "set_processor"):
            if not isinstance(processor, dict):
                module.set_processor(processor)
            else:
                module.set_processor(processor.pop(f"{name}.processor"))

        for sub_name, child in module.name_cells().items():
            fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)

    for name, module in self.name_cells().items():
        fn_recursive_attn_processor(name, module, processor)

`mindone.diffusers.UNetMotionModel.set_default_attn_processor()` ¶

Disables custom attention processors and sets the default attention implementation.

Source code in mindone/diffusers/models/unets/unet_motion_model.py

def set_default_attn_processor(self):
    """
    Disables custom attention processors and sets the default attention implementation.
    """
    if all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
        processor = AttnProcessor()
    else:
        raise ValueError(
            f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
        )

    self.set_attn_processor(processor)

`mindone.diffusers.models.unets.unet_3d_condition.UNet3DConditionOutput` `dataclass` ¶

Bases: BaseOutput

The output of [UNet3DConditionModel].

PARAMETER	DESCRIPTION
`sample`	The hidden states output conditioned on `encoder_hidden_states` input. Output of last layer of model. TYPE: `ms.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`

Source code in mindone/diffusers/models/unets/unet_3d_condition.py

@dataclass
class UNet3DConditionOutput(BaseOutput):
    """
    The output of [`UNet3DConditionModel`].

    Args:
        sample (`ms.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
            The hidden states output conditioned on `encoder_hidden_states` input. Output of last layer of model.
    """

    sample: ms.Tensor

UNetMotionModel¶

mindone.diffusers.UNetMotionModel ¶

mindone.diffusers.UNetMotionModel.attn_processors: Dict[str, AttentionProcessor] property ¶

mindone.diffusers.UNetMotionModel.construct(sample, timestep, encoder_hidden_states, timestep_cond=None, attention_mask=None, cross_attention_kwargs=None, added_cond_kwargs=None, down_block_additional_residuals=None, mid_block_additional_residual=None, return_dict=False) ¶

mindone.diffusers.UNetMotionModel.enable_forward_chunking(chunk_size=None, dim=0) ¶

mindone.diffusers.UNetMotionModel.freeze_unet2d_params() ¶

mindone.diffusers.UNetMotionModel.set_attn_processor(processor) ¶

mindone.diffusers.UNetMotionModel.set_default_attn_processor() ¶

mindone.diffusers.models.unets.unet_3d_condition.UNet3DConditionOutput dataclass ¶

`mindone.diffusers.UNetMotionModel` ¶

`mindone.diffusers.UNetMotionModel.attn_processors: Dict[str, AttentionProcessor]` `property` ¶

`mindone.diffusers.UNetMotionModel.construct(sample, timestep, encoder_hidden_states, timestep_cond=None, attention_mask=None, cross_attention_kwargs=None, added_cond_kwargs=None, down_block_additional_residuals=None, mid_block_additional_residual=None, return_dict=False)` ¶

`mindone.diffusers.UNetMotionModel.enable_forward_chunking(chunk_size=None, dim=0)` ¶

`mindone.diffusers.UNetMotionModel.freeze_unet2d_params()` ¶

`mindone.diffusers.UNetMotionModel.set_attn_processor(processor)` ¶

`mindone.diffusers.UNetMotionModel.set_default_attn_processor()` ¶

`mindone.diffusers.models.unets.unet_3d_condition.UNet3DConditionOutput` `dataclass` ¶