Skip to content

Limitations

Due to differences in framework, some APIs & models will not be identical to huggingface/diffusers in the foreseeable future.

APIs

xxx.from_pretrained

  • torch_dtype is renamed to mindspore_dtype
  • device_map, max_memory, offload_folder, offload_state_dict, low_cpu_mem_usage will not be supported.

BaseOutput

  • Default value of return_dict is changed to False, for GRAPH_MODE does not allow to construct an instance of it.

Output of AutoencoderKL.encode

Unlike the output posterior = DiagonalGaussianDistribution(latent), which can do sampling by posterior.sample(). We can only output the latent and then do sampling through AutoencoderKL.diag_gauss_dist.sample(latent).

self.config in construct()

For many models, parameters used in initialization will be registered in self.config. They are often accessed during the construct like using if self.config.xxx == xxx to determine execution paths in origin πŸ€—diffusers. However getting attributes like this is not supported by static graph syntax of MindSpore. Two feasible replacement options are

  • set new attributes in initialization for self like self.xxx = self.config.xxx, then use self.xxx in construct instead.
  • use self.config["xxx"] as self.config is an OrderedDict and getting items like this is supported in static graph mode.

When self.config.xxx changed, we change self.xxx and self.config["xxx"] both.

Models

The table below represents the current support in mindone/diffusers for each of those modules, whether they have support in Pynative fp16 mode, Graph fp16 mode, Pynative fp32 mode or Graph fp32 mode.

Names Pynative FP16 Pynative FP32 Graph FP16 Graph FP32 Description
StableCascadeUNet ❌ βœ… ❌ βœ… huggingface/diffusers output NaN when using float16.
nn.Conv3d βœ… ❌ βœ… ❌ FP32 is not supported on Ascend
TemporalConvLayer βœ… ❌ βœ… ❌ contains nn.Conv3d
TemporalResnetBlock βœ… ❌ βœ… ❌ contains nn.Conv3d
SpatioTemporalResBlock βœ… ❌ βœ… ❌ contains TemporalResnetBlock
UNetMidBlock3DCrossAttn βœ… ❌ βœ… ❌ contains TemporalConvLayer
CrossAttnDownBlock3D βœ… ❌ βœ… ❌ contains TemporalConvLayer
DownBlock3D βœ… ❌ βœ… ❌ contains TemporalConvLayer
CrossAttnUpBlock3D βœ… ❌ βœ… ❌ contains TemporalConvLayer
UpBlock3D βœ… ❌ βœ… ❌ contains TemporalConvLayer
MidBlockTemporalDecoder βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
UpBlockTemporalDecoder βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
UNetMidBlockSpatioTemporal βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
DownBlockSpatioTemporal βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
CrossAttnDownBlockSpatioTemporal βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
UpBlockSpatioTemporal βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
CrossAttnUpBlockSpatioTemporal βœ… ❌ βœ… ❌ contains SpatioTemporalResBlock
TemporalDecoder βœ… ❌ βœ… ❌ contains nn.Conv3d, MidBlockTemporalDecoder etc.
UNet3DConditionModel βœ… ❌ βœ… ❌ contains UNetMidBlock3DCrossAttn etc.
I2VGenXLUNet βœ… ❌ βœ… ❌ contains UNetMidBlock3DCrossAttn etc.
AutoencoderKLTemporalDecoder βœ… ❌ βœ… ❌ contains MidBlockTemporalDecoder etc.
UNetSpatioTemporalConditionModel βœ… ❌ βœ… ❌ contains UNetMidBlockSpatioTemporal etc.
FirUpsample2D ❌ βœ… βœ… βœ… ops.Conv2D has poor precision in fp16 and PyNative mode
FirDownsample2D ❌ βœ… βœ… βœ… ops.Conv2D has poor precision in fp16 and PyNative mode
AttnSkipUpBlock2D ❌ βœ… βœ… βœ… contains FirUpsample2D
SkipUpBlock2D ❌ βœ… βœ… βœ… contains FirUpsample2D
AttnSkipDownBlock2D ❌ βœ… βœ… βœ… contains FirDownsample2D
SkipDownBlock2D ❌ βœ… βœ… βœ… contains FirDownsample2D
ResnetBlock2D (kernel='fir') ❌ βœ… βœ… βœ… ops.Conv2D has poor precision in fp16 and PyNative mode
CogVideoXTransformer3DModel βœ… ❌ βœ… ❌ Using FlashAttention which only supports FP16/BF16
AutoencoderKLCogVideoX βœ… ❌ ❌ ❌ Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside
CogVideoXEncoder3D βœ… ❌ ❌ ❌ Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside
CogVideoXDecoder3D βœ… ❌ ❌ ❌ Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside

Pipelines

The table below represents the current support in mindone/diffusers for each of those pipelines in MindSpore 2.3.1, whether they have support in Pynative fp16 mode, Graph fp16 mode, Pynative fp32 mode or Graph fp32 mode.

precision issues of pipelines, the experiments in the table below default to upcasting GroupNorm to FP32 to avoid this issue.

Pipelines Pynative FP16 Pynative FP32 Graph FP16 Graph FP32 Description
AnimateDiffPipeline βœ… βœ… βœ… βœ…
AnimateDiffVideoToVideoPipeline βœ… ❌ βœ… βœ… In FP32 and Pynative mode, this pipeline will run out of memory
BlipDiffusionPipeline βœ… βœ… βœ… βœ…
ConsistencyModelPipeline βœ… βœ… βœ… βœ…
CogVideoXPipeline βœ… ❌ ❌ ❌ Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode
CogVideoXImageToVideoPipeline βœ… ❌ ❌ ❌ Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode
CogVideoXVideoToVideoPipeline βœ… ❌ ❌ ❌ Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode
DDIMPipeline βœ… βœ… βœ… βœ…
DDPMPipeline βœ… βœ… βœ… βœ…
DiTPipeline βœ… βœ… βœ… βœ…
I2VGenXLPipeline βœ… ❌ βœ… ❌ ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result; FP32 is not supported since I2VGenXLPipeline contains nn.Conv3d
IFImg2ImgPipeline βœ… βœ… βœ… βœ…
IFImg2ImgSuperResolutionPipeline βœ… βœ… βœ… βœ…
IFInpaintingPipeline βœ… βœ… βœ… βœ…
IFInpaintingSuperResolutionPipeline βœ… βœ… βœ… βœ…
IFPipeline βœ… βœ… βœ… βœ…
IFSuperResolutionPipeline βœ… βœ… βœ… βœ…
Kandinsky3Img2ImgPipeline ❌ ❌ ❌ ❌ Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask.
Kandinsky3Pipeline ❌ ❌ ❌ ❌ Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask.
KandinskyImg2ImgPipeline βœ… βœ… βœ… βœ…
KandinskyInpaintPipeline βœ… βœ… βœ… βœ…
KandinskyPipeline βœ… βœ… βœ… βœ…
KandinskyV22ControlnetImg2ImgPipeline βœ… βœ… βœ… βœ…
KandinskyV22ControlnetPipeline βœ… βœ… βœ… βœ…
KandinskyV22Img2ImgPipeline βœ… βœ… βœ… βœ…
KandinskyV22InpaintPipeline βœ… βœ… βœ… βœ…
KandinskyV22Pipeline βœ… βœ… βœ… βœ…
LatentConsistencyModelImg2ImgPipeline βœ… βœ… βœ… βœ…
LatentConsistencyModelPipeline βœ… βœ… βœ… βœ…
LDMSuperResolutionPipeline βœ… βœ… βœ… βœ…
LDMTextToImagePipeline βœ… βœ… βœ… βœ…
PixArtAlphaPipeline βœ… βœ… βœ… βœ…
ShapEImg2ImgPipeline βœ… βœ… ❌ ❌ The syntax in Render only supports Pynative mode
ShapEPipeline βœ… βœ… ❌ ❌ The syntax in Render only supports Pynative mode
StableCascadePipeline ❌ βœ… ❌ βœ… This pipeline does not support FP16 due to precision issues
StableDiffusion3Pipeline βœ… βœ… βœ… βœ…
StableDiffusionAdapterPipeline βœ… βœ… βœ… βœ…
StableDiffusionControlNetImg2ImgPipeline βœ… βœ… βœ… βœ…
StableDiffusionControlNetInpaintPipeline βœ… βœ… βœ… βœ…
StableDiffusionControlNetPipeline βœ… βœ… βœ… βœ…
StableDiffusionDepth2ImgPipeline βœ… βœ… βœ… βœ…
StableDiffusionDiffEditPipeline βœ… βœ… βœ… βœ…
StableDiffusionGLIGENPipeline βœ… βœ… βœ… βœ…
StableDiffusionGLIGENTextImagePipeline βœ… βœ… βœ… βœ…
StableDiffusionImageVariationPipeline βœ… βœ… βœ… βœ…
StableDiffusionImg2ImgPipeline βœ… βœ… βœ… βœ…
StableDiffusionInpaintPipeline βœ… βœ… βœ… βœ…
StableDiffusionInstructPix2PixPipeline βœ… βœ… βœ… βœ…
StableDiffusionLatentUpscalePipeline βœ… βœ… βœ… βœ…
StableDiffusionPipeline βœ… βœ… βœ… βœ…
StableDiffusionUpscalePipeline βœ… βœ… βœ… βœ…
StableDiffusionXLAdapterPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLControlNetImg2ImgPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLControlNetInpaintPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLControlNetPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLImg2ImgPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLInpaintPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLInstructPix2PixPipeline βœ… βœ… βœ… βœ…
StableDiffusionXLPipeline βœ… βœ… βœ… βœ…
StableVideoDiffusionPipeline βœ… ❌ βœ… ❌ This pipeline will run out of memory under FP32; ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result
UnCLIPImageVariationPipeline βœ… βœ… βœ… βœ…
UnCLIPPipeline βœ… βœ… βœ… βœ…
WuerstchenPipeline βœ… βœ… βœ… βœ… GlobalResponseNorm has precision issue under FP16, so we need to upcast it to FP32 to get a good result