Limitations¶
Due to differences in framework, some APIs & models will not be identical to huggingface/diffusers in the foreseeable future.
APIs¶
xxx.from_pretrained
¶
torch_dtype
is renamed tomindspore_dtype
device_map
,max_memory
,offload_folder
,offload_state_dict
,low_cpu_mem_usage
will not be supported.
BaseOutput
¶
- Default value of
return_dict
is changed toFalse
, forGRAPH_MODE
does not allow to construct an instance of it.
Output of AutoencoderKL.encode
¶
Unlike the output posterior = DiagonalGaussianDistribution(latent)
, which can do sampling by posterior.sample()
.
We can only output the latent
and then do sampling through AutoencoderKL.diag_gauss_dist.sample(latent)
.
self.config
in construct()
¶
For many models, parameters used in initialization will be registered in self.config
. They are often accessed during the construct
like using if self.config.xxx == xxx
to determine execution paths in origin π€diffusers. However getting attributes like this is not supported by static graph syntax of MindSpore. Two feasible replacement options are
- set new attributes in initialization for
self
likeself.xxx = self.config.xxx
, then useself.xxx
inconstruct
instead. - use
self.config["xxx"]
asself.config
is anOrderedDict
and getting items like this is supported in static graph mode.
When self.config.xxx
changed, we change self.xxx
and self.config["xxx"]
both.
Models¶
The table below represents the current support in mindone/diffusers for each of those modules, whether they have support in Pynative fp16 mode, Graph fp16 mode, Pynative fp32 mode or Graph fp32 mode.
Names | Pynative FP16 | Pynative FP32 | Graph FP16 | Graph FP32 | Description |
---|---|---|---|---|---|
StableCascadeUNet | β | β | β | β | huggingface/diffusers output NaN when using float16. |
nn.Conv3d | β | β | β | β | FP32 is not supported on Ascend |
TemporalConvLayer | β | β | β | β | contains nn.Conv3d |
TemporalResnetBlock | β | β | β | β | contains nn.Conv3d |
SpatioTemporalResBlock | β | β | β | β | contains TemporalResnetBlock |
UNetMidBlock3DCrossAttn | β | β | β | β | contains TemporalConvLayer |
CrossAttnDownBlock3D | β | β | β | β | contains TemporalConvLayer |
DownBlock3D | β | β | β | β | contains TemporalConvLayer |
CrossAttnUpBlock3D | β | β | β | β | contains TemporalConvLayer |
UpBlock3D | β | β | β | β | contains TemporalConvLayer |
MidBlockTemporalDecoder | β | β | β | β | contains SpatioTemporalResBlock |
UpBlockTemporalDecoder | β | β | β | β | contains SpatioTemporalResBlock |
UNetMidBlockSpatioTemporal | β | β | β | β | contains SpatioTemporalResBlock |
DownBlockSpatioTemporal | β | β | β | β | contains SpatioTemporalResBlock |
CrossAttnDownBlockSpatioTemporal | β | β | β | β | contains SpatioTemporalResBlock |
UpBlockSpatioTemporal | β | β | β | β | contains SpatioTemporalResBlock |
CrossAttnUpBlockSpatioTemporal | β | β | β | β | contains SpatioTemporalResBlock |
TemporalDecoder | β | β | β | β | contains nn.Conv3d, MidBlockTemporalDecoder etc. |
UNet3DConditionModel | β | β | β | β | contains UNetMidBlock3DCrossAttn etc. |
I2VGenXLUNet | β | β | β | β | contains UNetMidBlock3DCrossAttn etc. |
AutoencoderKLTemporalDecoder | β | β | β | β | contains MidBlockTemporalDecoder etc. |
UNetSpatioTemporalConditionModel | β | β | β | β | contains UNetMidBlockSpatioTemporal etc. |
FirUpsample2D | β | β | β | β | ops.Conv2D has poor precision in fp16 and PyNative mode |
FirDownsample2D | β | β | β | β | ops.Conv2D has poor precision in fp16 and PyNative mode |
AttnSkipUpBlock2D | β | β | β | β | contains FirUpsample2D |
SkipUpBlock2D | β | β | β | β | contains FirUpsample2D |
AttnSkipDownBlock2D | β | β | β | β | contains FirDownsample2D |
SkipDownBlock2D | β | β | β | β | contains FirDownsample2D |
ResnetBlock2D (kernel='fir') | β | β | β | β | ops.Conv2D has poor precision in fp16 and PyNative mode |
CogVideoXTransformer3DModel | β | β | β | β | Using FlashAttention which only supports FP16/BF16 |
AutoencoderKLCogVideoX | β | β | β | β | Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside |
CogVideoXEncoder3D | β | β | β | β | Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside |
CogVideoXDecoder3D | β | β | β | β | Only FP16/BF16: contains nn.Conv3d as sub cells & Only PyNative: Static Graph Syntax doesn't support cache operations inside |
Pipelines¶
The table below represents the current support in mindone/diffusers for each of those pipelines in MindSpore 2.3.1, whether they have support in Pynative fp16 mode, Graph fp16 mode, Pynative fp32 mode or Graph fp32 mode.
precision issues of pipelines, the experiments in the table below default to upcasting GroupNorm to FP32 to avoid this issue.
Pipelines | Pynative FP16 | Pynative FP32 | Graph FP16 | Graph FP32 | Description |
---|---|---|---|---|---|
AnimateDiffPipeline | β | β | β | β | |
AnimateDiffVideoToVideoPipeline | β | β | β | β | In FP32 and Pynative mode, this pipeline will run out of memory |
BlipDiffusionPipeline | β | β | β | β | |
ConsistencyModelPipeline | β | β | β | β | |
CogVideoXPipeline | β | β | β | β | Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode |
CogVideoXImageToVideoPipeline | β | β | β | β | Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode |
CogVideoXVideoToVideoPipeline | β | β | β | β | Flash Attention used in transformer and nn.Conv3d used in VAE only support FP16/BF16; Cache opeartions in VAE is not supported by MindSpore Graph Mode |
DDIMPipeline | β | β | β | β | |
DDPMPipeline | β | β | β | β | |
DiTPipeline | β | β | β | β | |
I2VGenXLPipeline | β | β | β | β | ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result; FP32 is not supported since I2VGenXLPipeline contains nn.Conv3d |
IFImg2ImgPipeline | β | β | β | β | |
IFImg2ImgSuperResolutionPipeline | β | β | β | β | |
IFInpaintingPipeline | β | β | β | β | |
IFInpaintingSuperResolutionPipeline | β | β | β | β | |
IFPipeline | β | β | β | β | |
IFSuperResolutionPipeline | β | β | β | β | |
Kandinsky3Img2ImgPipeline | β | β | β | β | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. |
Kandinsky3Pipeline | β | β | β | β | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. |
KandinskyImg2ImgPipeline | β | β | β | β | |
KandinskyInpaintPipeline | β | β | β | β | |
KandinskyPipeline | β | β | β | β | |
KandinskyV22ControlnetImg2ImgPipeline | β | β | β | β | |
KandinskyV22ControlnetPipeline | β | β | β | β | |
KandinskyV22Img2ImgPipeline | β | β | β | β | |
KandinskyV22InpaintPipeline | β | β | β | β | |
KandinskyV22Pipeline | β | β | β | β | |
LatentConsistencyModelImg2ImgPipeline | β | β | β | β | |
LatentConsistencyModelPipeline | β | β | β | β | |
LDMSuperResolutionPipeline | β | β | β | β | |
LDMTextToImagePipeline | β | β | β | β | |
PixArtAlphaPipeline | β | β | β | β | |
ShapEImg2ImgPipeline | β | β | β | β | The syntax in Render only supports Pynative mode |
ShapEPipeline | β | β | β | β | The syntax in Render only supports Pynative mode |
StableCascadePipeline | β | β | β | β | This pipeline does not support FP16 due to precision issues |
StableDiffusion3Pipeline | β | β | β | β | |
StableDiffusionAdapterPipeline | β | β | β | β | |
StableDiffusionControlNetImg2ImgPipeline | β | β | β | β | |
StableDiffusionControlNetInpaintPipeline | β | β | β | β | |
StableDiffusionControlNetPipeline | β | β | β | β | |
StableDiffusionDepth2ImgPipeline | β | β | β | β | |
StableDiffusionDiffEditPipeline | β | β | β | β | |
StableDiffusionGLIGENPipeline | β | β | β | β | |
StableDiffusionGLIGENTextImagePipeline | β | β | β | β | |
StableDiffusionImageVariationPipeline | β | β | β | β | |
StableDiffusionImg2ImgPipeline | β | β | β | β | |
StableDiffusionInpaintPipeline | β | β | β | β | |
StableDiffusionInstructPix2PixPipeline | β | β | β | β | |
StableDiffusionLatentUpscalePipeline | β | β | β | β | |
StableDiffusionPipeline | β | β | β | β | |
StableDiffusionUpscalePipeline | β | β | β | β | |
StableDiffusionXLAdapterPipeline | β | β | β | β | |
StableDiffusionXLControlNetImg2ImgPipeline | β | β | β | β | |
StableDiffusionXLControlNetInpaintPipeline | β | β | β | β | |
StableDiffusionXLControlNetPipeline | β | β | β | β | |
StableDiffusionXLImg2ImgPipeline | β | β | β | β | |
StableDiffusionXLInpaintPipeline | β | β | β | β | |
StableDiffusionXLInstructPix2PixPipeline | β | β | β | β | |
StableDiffusionXLPipeline | β | β | β | β | |
StableVideoDiffusionPipeline | β | β | β | β | This pipeline will run out of memory under FP32; ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result |
UnCLIPImageVariationPipeline | β | β | β | β | |
UnCLIPPipeline | β | β | β | β | |
WuerstchenPipeline | β | β | β | β | GlobalResponseNorm has precision issue under FP16, so we need to upcast it to FP32 to get a good result |