Skip to content

LEDITS++

LoRA

LEDITS++ was proposed in LEDITS++: Limitless Image Editing using Text-to-Image Models by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos.

The abstract from the paper is:

Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .

Tip

You can find additional information about LEDITS++ on the project page and try it out in a demo.

Tip

Due to some backward compatability issues with the current diffusers implementation of [~schedulers.DPMSolverMultistepScheduler] this implementation of LEdits++ can no longer guarantee perfect inversion. This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated GitHub repo.

We provide two distinct pipelines based on different pre-trained models.

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion

Bases: DiffusionPipeline, TextualInversionLoaderMixin, StableDiffusionLoraLoaderMixin, IPAdapterMixin, FromSingleFileMixin

Pipeline for textual image editing using LEDits++ with Stable Diffusion.

This model inherits from [DiffusionPipeline] and builds on the [StableDiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
vae

Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.

TYPE: [`AutoencoderKL`]

text_encoder

Frozen text-encoder. Stable Diffusion uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.

TYPE: [`~transformers.CLIPTextModel`]

tokenizer

Tokenizer of class CLIPTokenizer.

TYPE: [`~transformers.CLIPTokenizer`]

unet

Conditional U-Net architecture to denoise the encoded image latents.

TYPE: [`UNet2DConditionModel`]

scheduler

A scheduler to be used in combination with unet to denoise the encoded image latens. Can be one of [DPMSolverMultistepScheduler] or [DDIMScheduler]. If any other scheduler is passed it will automatically be set to [DPMSolverMultistepScheduler].

TYPE: [`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]

safety_checker

Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the model card for details.

TYPE: [`StableDiffusionSafetyChecker`]

feature_extractor

Model that extracts features from generated images to be used as inputs for the safety_checker.

TYPE: [`~transformers.CLIPImageProcessor`]

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
class LEditsPPPipelineStableDiffusion(
    DiffusionPipeline, TextualInversionLoaderMixin, StableDiffusionLoraLoaderMixin, IPAdapterMixin, FromSingleFileMixin
):
    """
    Pipeline for textual image editing using LEDits++ with Stable Diffusion.

    This model inherits from [`DiffusionPipeline`] and builds on the [`StableDiffusionPipeline`]. Check the superclass
    documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular
    device, etc.).

    Args:
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
        text_encoder ([`~transformers.CLIPTextModel`]):
            Frozen text-encoder. Stable Diffusion uses the text portion of
            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            Tokenizer of class
            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
        scheduler ([`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]):
            A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of
            [`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]. If any other scheduler is passed it will
            automatically be set to [`DPMSolverMultistepScheduler`].
        safety_checker ([`StableDiffusionSafetyChecker`]):
            Classification module that estimates whether generated images could be considered offensive or harmful.
            Please, refer to the [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) for details.
        feature_extractor ([`~transformers.CLIPImageProcessor`]):
            Model that extracts features from generated images to be used as inputs for the `safety_checker`.
    """

    model_cpu_offload_seq = "text_encoder->unet->vae"
    _exclude_from_cpu_offload = ["safety_checker"]
    _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
    _optional_components = ["safety_checker", "feature_extractor", "image_encoder"]

    def __init__(
        self,
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        tokenizer: CLIPTokenizer,
        unet: UNet2DConditionModel,
        scheduler: Union[DDIMScheduler, DPMSolverMultistepScheduler],
        safety_checker: StableDiffusionSafetyChecker,
        feature_extractor: CLIPImageProcessor,
        requires_safety_checker: bool = True,
    ):
        super().__init__()

        if not isinstance(scheduler, DDIMScheduler) and not isinstance(scheduler, DPMSolverMultistepScheduler):
            scheduler = DPMSolverMultistepScheduler.from_config(
                scheduler.config, algorithm_type="sde-dpmsolver++", solver_order=2
            )
            logger.warning(
                "This pipeline only supports DDIMScheduler and DPMSolverMultistepScheduler. "
                "The scheduler has been changed to DPMSolverMultistepScheduler."
            )

        if scheduler is not None and getattr(scheduler.config, "steps_offset", 1) != 1:
            deprecation_message = (
                f"The configuration file of this scheduler: {scheduler} is outdated. `steps_offset`"
                f" should be set to 1 instead of {scheduler.config.steps_offset}. Please make sure "
                "to update the config accordingly as leaving `steps_offset` might led to incorrect results"
                " in future versions. If you have downloaded this checkpoint from the Hugging Face Hub,"
                " it would be very nice if you could open a Pull request for the `scheduler/scheduler_config.json`"
                " file"
            )
            deprecate("steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False)
            new_config = dict(scheduler.config)
            new_config["steps_offset"] = 1
            scheduler._internal_dict = FrozenDict(new_config)

        if scheduler is not None and getattr(scheduler.config, "clip_sample", False) is True:
            deprecation_message = (
                f"The configuration file of this scheduler: {scheduler} has not set the configuration `clip_sample`."
                " `clip_sample` should be set to False in the configuration file. Please make sure to update the"
                " config accordingly as not setting `clip_sample` in the config might lead to incorrect results in"
                " future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very"
                " nice if you could open a Pull request for the `scheduler/scheduler_config.json` file"
            )
            deprecate("clip_sample not set", "1.0.0", deprecation_message, standard_warn=False)
            new_config = dict(scheduler.config)
            new_config["clip_sample"] = False
            scheduler._internal_dict = FrozenDict(new_config)

        if safety_checker is None and requires_safety_checker:
            logger.warning(
                f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure"
                " that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered"
                " results in services or applications open to the public. Both the diffusers team and Hugging Face"
                " strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling"
                " it only for use-cases that involve analyzing network behavior or auditing its results. For more"
                " information, please have a look at https://github.com/huggingface/diffusers/pull/254 ."
            )

        if safety_checker is not None and feature_extractor is None:
            raise ValueError(
                "Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety"
                " checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead."
            )

        is_unet_version_less_0_9_0 = (
            unet is not None
            and hasattr(unet.config, "_diffusers_version")
            and version.parse(version.parse(unet.config._diffusers_version).base_version) < version.parse("0.9.0.dev0")
        )
        is_unet_sample_size_less_64 = (
            unet is not None and hasattr(unet.config, "sample_size") and unet.config.sample_size < 64
        )
        if is_unet_version_less_0_9_0 and is_unet_sample_size_less_64:
            deprecation_message = (
                "The configuration file of the unet has set the default `sample_size` to smaller than"
                " 64 which seems highly unlikely. If your checkpoint is a fine-tuned version of any of the"
                " following: \n- CompVis/stable-diffusion-v1-4 \n- CompVis/stable-diffusion-v1-3 \n-"
                " CompVis/stable-diffusion-v1-2 \n- CompVis/stable-diffusion-v1-1 \n- runwayml/stable-diffusion-v1-5"
                " \n- runwayml/stable-diffusion-inpainting \n you should change 'sample_size' to 64 in the"
                " configuration file. Please make sure to update the config accordingly as leaving `sample_size=32`"
                " in the config might lead to incorrect results in future versions. If you have downloaded this"
                " checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for"
                " the `unet/config.json` file"
            )
            deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
            new_config = dict(unet.config)
            new_config["sample_size"] = 64
            unet._internal_dict = FrozenDict(new_config)

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            unet=unet,
            scheduler=scheduler,
            safety_checker=safety_checker,
            feature_extractor=feature_extractor,
        )
        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
        self.register_to_config(requires_safety_checker=requires_safety_checker)

        self.inversion_steps = None

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.run_safety_checker
    def run_safety_checker(self, image, dtype):
        if self.safety_checker is None:
            has_nsfw_concept = None
        else:
            if ops.is_tensor(image):
                feature_extractor_input = self.image_processor.postprocess(image, output_type="pil")
            else:
                feature_extractor_input = self.image_processor.numpy_to_pil(image)
            safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="np")
            image, has_nsfw_concept = self.safety_checker(
                images=image, clip_input=ms.tensor(safety_checker_input.pixel_values).to(dtype)
            )
        return image, has_nsfw_concept

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.decode_latents
    def decode_latents(self, latents):
        deprecation_message = "The decode_latents method is deprecated and will be removed in 1.0.0. Please use VaeImageProcessor.postprocess(...) instead"
        deprecate("decode_latents", "1.0.0", deprecation_message, standard_warn=False)

        latents = 1 / self.vae.config.scaling_factor * latents
        image = self.vae.decode(latents, return_dict=False)[0]
        image = (image / 2 + 0.5).clamp(0, 1)
        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
        image = image.permute(0, 2, 3, 1).float().numpy()
        return image

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
    def prepare_extra_step_kwargs(self, eta, generator=None):
        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
        # and should be between [0, 1]

        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
        extra_step_kwargs = {}
        if accepts_eta:
            extra_step_kwargs["eta"] = eta

        # check if the scheduler accepts generator
        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
        if accepts_generator:
            extra_step_kwargs["generator"] = generator
        return extra_step_kwargs

    # Modified from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.check_inputs
    def check_inputs(
        self,
        negative_prompt=None,
        editing_prompt_embeddings=None,
        negative_prompt_embeds=None,
        callback_on_step_end_tensor_inputs=None,
    ):
        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"  # noqa: E501
            )
        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

        if editing_prompt_embeddings is not None and negative_prompt_embeds is not None:
            if editing_prompt_embeddings.shape != negative_prompt_embeds.shape:
                raise ValueError(
                    "`editing_prompt_embeddings` and `negative_prompt_embeds` must have the same shape when passed directly, but"
                    f" got: `editing_prompt_embeddings` {editing_prompt_embeddings.shape} != `negative_prompt_embeds`"
                    f" {negative_prompt_embeds.shape}."
                )

    # Modified from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
    def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, latents):
        # shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)

        # if latents.shape != shape:
        #    raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = latents

        # scale the initial noise by the standard deviation required by the scheduler
        latents = latents * self.scheduler.init_noise_sigma
        return latents

    def prepare_unet(self, attention_store, PnP: bool = False):
        attn_procs = {}
        for name in self.unet.attn_processors.keys():
            if name.startswith("mid_block"):
                place_in_unet = "mid"
            elif name.startswith("up_blocks"):
                place_in_unet = "up"
            elif name.startswith("down_blocks"):
                place_in_unet = "down"
            else:
                continue

            if "attn2" in name and place_in_unet != "mid":
                attn_procs[name] = LEDITSCrossAttnProcessor(
                    attention_store=attention_store,
                    place_in_unet=place_in_unet,
                    pnp=PnP,
                    editing_prompts=self.enabled_editing_prompts,
                )
            else:
                attn_procs[name] = AttnProcessor()

        self.unet.set_attn_processor(attn_procs)

    def encode_prompt(
        self,
        num_images_per_prompt,
        enable_edit_guidance,
        negative_prompt=None,
        editing_prompt=None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        editing_prompt_embeds: Optional[ms.Tensor] = None,
        lora_scale: Optional[float] = None,
        clip_skip: Optional[int] = None,
    ):
        r"""
        Encodes the prompt into text encoder hidden states.

        Args:
            num_images_per_prompt (`int`):
                number of images that should be generated per prompt
            enable_edit_guidance (`bool`):
                whether to perform any editing or reconstruct the input image instead
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
                less than `1`).
            editing_prompt (`str` or `List[str]`, *optional*):
                Editing prompt(s) to be encoded. If not defined, one has to pass `editing_prompt_embeds` instead.
            editing_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
                provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            lora_scale (`float`, *optional*):
                A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
        """
        # set lora scale so that monkey patched LoRA
        # function of text encoder can correctly access it
        if lora_scale is not None and isinstance(self, StableDiffusionLoraLoaderMixin):
            self._lora_scale = lora_scale

            # dynamically adjust the LoRA scale
            scale_lora_layers(self.text_encoder, lora_scale)

        batch_size = self.batch_size
        num_edit_tokens = None

        if negative_prompt_embeds is None:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            elif batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but exoected"
                    f"{batch_size} based on the input images. Please make sure that passed `negative_prompt` matches"
                    " the batch size of `prompt`."
                )
            else:
                uncond_tokens = negative_prompt

            # textual inversion: procecss multi-vector tokens if necessary
            if isinstance(self, TextualInversionLoaderMixin):
                uncond_tokens = self.maybe_convert_prompt(uncond_tokens, self.tokenizer)

            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=self.tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )

            if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
                attention_mask = ms.tensor(uncond_input.attention_mask)
            else:
                attention_mask = None

            negative_prompt_embeds = self.text_encoder(
                ms.tensor(uncond_input.input_ids),
                attention_mask=attention_mask,
            )
            negative_prompt_embeds = negative_prompt_embeds[0]

        if self.text_encoder is not None:
            prompt_embeds_dtype = self.text_encoder.dtype
        elif self.unet is not None:
            prompt_embeds_dtype = self.unet.dtype
        else:
            prompt_embeds_dtype = negative_prompt_embeds.dtype

        negative_prompt_embeds = negative_prompt_embeds.to(dtype=prompt_embeds_dtype)

        if enable_edit_guidance:
            if editing_prompt_embeds is None:
                # textual inversion: procecss multi-vector tokens if necessary
                # if isinstance(self, TextualInversionLoaderMixin):
                #    prompt = self.maybe_convert_prompt(prompt, self.tokenizer)
                if isinstance(editing_prompt, str):
                    editing_prompt = [editing_prompt]

                max_length = negative_prompt_embeds.shape[1]
                text_inputs = self.tokenizer(
                    [x for item in editing_prompt for x in repeat(item, batch_size)],
                    padding="max_length",
                    max_length=max_length,
                    truncation=True,
                    return_tensors="np",
                    return_length=True,
                )

                num_edit_tokens = text_inputs.length - 2  # not counting startoftext and endoftext
                text_input_ids = text_inputs.input_ids
                untruncated_ids = self.tokenizer(
                    [x for item in editing_prompt for x in repeat(item, batch_size)],
                    padding="longest",
                    return_tensors="np",
                ).input_ids

                if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                    text_input_ids, untruncated_ids
                ):
                    removed_text = self.tokenizer.batch_decode(
                        untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
                    )
                    logger.warning(
                        "The following part of your input was truncated because CLIP can only handle sequences up to"
                        f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                    )

                if (
                    hasattr(self.text_encoder.config, "use_attention_mask")
                    and self.text_encoder.config.use_attention_mask
                ):
                    attention_mask = ms.tensor(text_inputs.attention_mask)
                else:
                    attention_mask = None

                text_input_ids = ms.tensor(text_input_ids)
                if clip_skip is None:
                    editing_prompt_embeds = self.text_encoder(text_input_ids, attention_mask=attention_mask)
                    editing_prompt_embeds = editing_prompt_embeds[0]
                else:
                    editing_prompt_embeds = self.text_encoder(
                        text_input_ids, attention_mask=attention_mask, output_hidden_states=True
                    )
                    # Access the `hidden_states` first, that contains a tuple of
                    # all the hidden states from the encoder layers. Then index into
                    # the tuple to access the hidden states from the desired layer.
                    editing_prompt_embeds = editing_prompt_embeds[-1][-(clip_skip + 1)]
                    # We also need to apply the final LayerNorm here to not mess with the
                    # representations. The `last_hidden_states` that we typically use for
                    # obtaining the final prompt representations passes through the LayerNorm
                    # layer.
                    editing_prompt_embeds = self.text_encoder.text_model.final_layer_norm(editing_prompt_embeds)

            editing_prompt_embeds = editing_prompt_embeds.to(dtype=negative_prompt_embeds.dtype)

            bs_embed_edit, seq_len, _ = editing_prompt_embeds.shape
            editing_prompt_embeds = editing_prompt_embeds.to(dtype=negative_prompt_embeds.dtype)
            editing_prompt_embeds = editing_prompt_embeds.tile((1, num_images_per_prompt, 1))
            editing_prompt_embeds = editing_prompt_embeds.view(bs_embed_edit * num_images_per_prompt, seq_len, -1)

        # get unconditional embeddings for classifier free guidance

        # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
        seq_len = negative_prompt_embeds.shape[1]

        negative_prompt_embeds = negative_prompt_embeds.to(dtype=prompt_embeds_dtype)

        negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
        negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

        if isinstance(self, StableDiffusionLoraLoaderMixin):
            # Retrieve the original scale by scaling back the LoRA layers
            unscale_lora_layers(self.text_encoder, lora_scale)

        return editing_prompt_embeds, negative_prompt_embeds, num_edit_tokens

    @property
    def guidance_rescale(self):
        return self._guidance_rescale

    @property
    def clip_skip(self):
        return self._clip_skip

    @property
    def cross_attention_kwargs(self):
        return self._cross_attention_kwargs

    def enable_vae_slicing(self):
        r"""
        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()

    def disable_vae_slicing(self):
        r"""
        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()

    def enable_vae_tiling(self):
        r"""
        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
        processing larger images.
        """
        self.vae.enable_tiling()

    def disable_vae_tiling(self):
        r"""
        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()

    def __call__(
        self,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        editing_prompt: Optional[Union[str, List[str]]] = None,
        editing_prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        reverse_editing_direction: Optional[Union[bool, List[bool]]] = False,
        edit_guidance_scale: Optional[Union[float, List[float]]] = 5,
        edit_warmup_steps: Optional[Union[int, List[int]]] = 0,
        edit_cooldown_steps: Optional[Union[int, List[int]]] = None,
        edit_threshold: Optional[Union[float, List[float]]] = 0.9,
        user_mask: Optional[ms.Tensor] = None,
        sem_guidance: Optional[List[ms.Tensor]] = None,
        use_cross_attn_mask: bool = False,
        use_intersect_mask: bool = True,
        attn_store_steps: Optional[List[int]] = [],
        store_averaged_over_steps: bool = True,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
        guidance_rescale: float = 0.0,
        clip_skip: Optional[int] = None,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        r"""
        The call function to the pipeline for editing. The
        [`~pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.invert`] method has to be called beforehand. Edits will
        always be performed for the last inverted image(s).

        Args:
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
                if `guidance_scale` is less than `1`).
            generator (`np.random.Generator`, *optional*):
                One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
                to make generation deterministic.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between
                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] instead of a plain
                tuple.
            editing_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts to guide the image generation. The image is reconstructed by setting
                `editing_prompt = None`. Guidance direction of prompt should be specified via
                `reverse_editing_direction`.
            editing_prompt_embeds (`ms.Tensor>`, *optional*):
                Pre-computed embeddings to use for guiding the image generation. Guidance direction of embedding should
                be specified via `reverse_editing_direction`.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
            reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`):
                Whether the corresponding prompt in `editing_prompt` should be increased or decreased.
            edit_guidance_scale (`float` or `List[float]`, *optional*, defaults to 5):
                Guidance scale for guiding the image generation. If provided as list values should correspond to
                `editing_prompt`. `edit_guidance_scale` is defined as `s_e` of equation 12 of [LEDITS++
                Paper](https://arxiv.org/abs/2301.12247).
            edit_warmup_steps (`float` or `List[float]`, *optional*, defaults to 10):
                Number of diffusion steps (for each prompt) for which guidance will not be applied.
            edit_cooldown_steps (`float` or `List[float]`, *optional*, defaults to `None`):
                Number of diffusion steps (for each prompt) after which guidance will no longer be applied.
            edit_threshold (`float` or `List[float]`, *optional*, defaults to 0.9):
                Masking threshold of guidance. Threshold should be proportional to the image region that is modified.
                'edit_threshold' is defined as 'λ' of equation 12 of [LEDITS++
                Paper](https://arxiv.org/abs/2301.12247).
            user_mask (`ms.Tensor`, *optional*):
                User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s
                implicit masks do not meet user preferences.
            sem_guidance (`List[ms.Tensor]`, *optional*):
                List of pre-generated guidance vectors to be applied at generation. Length of the list has to
                correspond to `num_inference_steps`.
            use_cross_attn_mask (`bool`, defaults to `False`):
                Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask
                is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of [LEDITS++
                paper](https://arxiv.org/pdf/2311.16711.pdf).
            use_intersect_mask (`bool`, defaults to `True`):
                Whether the masking term is calculated as intersection of cross-attention masks and masks derived from
                the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate
                are defined as 'M^2' of equation 12 of [LEDITS++ paper](https://arxiv.org/pdf/2311.16711.pdf).
            attn_store_steps (`List[int]`, *optional*):
                Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.
            store_averaged_over_steps (`bool`, defaults to `True`):
                Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If
                False, attention maps for each step are stores separately. Just for visualization purposes.
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
            guidance_rescale (`float`, *optional*, defaults to 0.0):
                Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
                Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
                using zero terminal SNR.
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] or `tuple`:
            [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. When
            returning a tuple, the first element is a list with the generated images, and the second element is a list
            of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw)
            content, according to the `safety_checker`.
        """

        if self.inversion_steps is None:
            raise ValueError(
                "You need to invert an input image first before calling the pipeline. The `invert` method has to be called beforehand. Edits will always be performed for the last inverted image(s)."  # noqa: E501
            )

        eta = self.eta
        num_images_per_prompt = 1
        latents = self.init_latents

        zs = self.zs
        self.scheduler.set_timesteps(len(self.scheduler.timesteps))

        if use_intersect_mask:
            use_cross_attn_mask = True

        if use_cross_attn_mask:
            self.smoothing = LeditsGaussianSmoothing()

        org_prompt = ""

        # 1. Check inputs. Raise error if not correct
        self.check_inputs(
            negative_prompt,
            editing_prompt_embeds,
            negative_prompt_embeds,
            callback_on_step_end_tensor_inputs,
        )

        self._guidance_rescale = guidance_rescale
        self._clip_skip = clip_skip
        self._cross_attention_kwargs = cross_attention_kwargs

        # 2. Define call parameters
        batch_size = self.batch_size

        if editing_prompt:
            enable_edit_guidance = True
            if isinstance(editing_prompt, str):
                editing_prompt = [editing_prompt]
            self.enabled_editing_prompts = len(editing_prompt)
        elif editing_prompt_embeds is not None:
            enable_edit_guidance = True
            self.enabled_editing_prompts = editing_prompt_embeds.shape[0]
        else:
            self.enabled_editing_prompts = 0
            enable_edit_guidance = False

        # 3. Encode input prompt
        lora_scale = self.cross_attention_kwargs.get("scale", None) if self.cross_attention_kwargs is not None else None

        edit_concepts, uncond_embeddings, num_edit_tokens = self.encode_prompt(
            editing_prompt=editing_prompt,
            num_images_per_prompt=num_images_per_prompt,
            enable_edit_guidance=enable_edit_guidance,
            negative_prompt=negative_prompt,
            editing_prompt_embeds=editing_prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
            lora_scale=lora_scale,
            clip_skip=self.clip_skip,
        )

        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
        # to avoid doing two forward passes
        if enable_edit_guidance:
            text_embeddings = mint.cat([uncond_embeddings, edit_concepts])
            self.text_cross_attention_maps = [editing_prompt] if isinstance(editing_prompt, str) else editing_prompt
        else:
            text_embeddings = mint.cat([uncond_embeddings])

        # 4. Prepare timesteps
        # self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.inversion_steps
        t_to_idx = {int(v): k for k, v in enumerate(timesteps[-zs.shape[0] :])}

        if use_cross_attn_mask:
            self.attention_store = LeditsAttentionStore(
                average=store_averaged_over_steps,
                batch_size=batch_size,
                max_size=(latents.shape[-2] / 4.0) * (latents.shape[-1] / 4.0),
                max_resolution=None,
            )
            self.prepare_unet(self.attention_store, PnP=False)
            resolution = latents.shape[-2:]
            att_res = (int(resolution[0] / 4), int(resolution[1] / 4))

        # 5. Prepare latent variables
        num_channels_latents = self.unet.config.in_channels
        latents = self.prepare_latents(
            batch_size * num_images_per_prompt,
            num_channels_latents,
            None,
            None,
            text_embeddings.dtype,
            latents,
        )

        # 6. Prepare extra step kwargs.
        extra_step_kwargs = self.prepare_extra_step_kwargs(eta)

        self.sem_guidance = None
        self.activation_mask = None

        # 7. Denoising loop
        num_warmup_steps = 0
        with pynative_context():
            with self.progress_bar(total=len(timesteps)) as progress_bar:
                for i, t in enumerate(timesteps):
                    # expand the latents if we are doing classifier free guidance

                    if enable_edit_guidance:
                        latent_model_input = mint.cat([latents] * (1 + self.enabled_editing_prompts))
                    else:
                        latent_model_input = latents

                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                    text_embed_input = text_embeddings

                    # predict the noise residual
                    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embed_input)[0]

                    noise_pred_out = noise_pred.chunk(1 + self.enabled_editing_prompts)  # [b,4, 64, 64]
                    noise_pred_uncond = noise_pred_out[0]
                    noise_pred_edit_concepts = noise_pred_out[1:]

                    noise_guidance_edit = mint.zeros(
                        noise_pred_uncond.shape,
                        dtype=noise_pred_uncond.dtype,
                    )

                    if sem_guidance is not None and len(sem_guidance) > i:
                        noise_guidance_edit += sem_guidance[i]

                    elif enable_edit_guidance:
                        if self.activation_mask is None:
                            self.activation_mask = mint.zeros(
                                (len(timesteps), len(noise_pred_edit_concepts), *noise_pred_edit_concepts[0].shape)
                            )

                        if self.sem_guidance is None:
                            self.sem_guidance = mint.zeros((len(timesteps), *noise_pred_uncond.shape))

                        for c, noise_pred_edit_concept in enumerate(noise_pred_edit_concepts):
                            if isinstance(edit_warmup_steps, list):
                                edit_warmup_steps_c = edit_warmup_steps[c]
                            else:
                                edit_warmup_steps_c = edit_warmup_steps
                            if i < edit_warmup_steps_c:
                                continue

                            if isinstance(edit_guidance_scale, list):
                                edit_guidance_scale_c = edit_guidance_scale[c]
                            else:
                                edit_guidance_scale_c = edit_guidance_scale

                            if isinstance(edit_threshold, list):
                                edit_threshold_c = edit_threshold[c]
                            else:
                                edit_threshold_c = edit_threshold
                            if isinstance(reverse_editing_direction, list):
                                reverse_editing_direction_c = reverse_editing_direction[c]
                            else:
                                reverse_editing_direction_c = reverse_editing_direction

                            if isinstance(edit_cooldown_steps, list):
                                edit_cooldown_steps_c = edit_cooldown_steps[c]
                            elif edit_cooldown_steps is None:
                                edit_cooldown_steps_c = i + 1
                            else:
                                edit_cooldown_steps_c = edit_cooldown_steps

                            if i >= edit_cooldown_steps_c:
                                continue

                            noise_guidance_edit_tmp = noise_pred_edit_concept - noise_pred_uncond

                            if reverse_editing_direction_c:
                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * -1

                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * edit_guidance_scale_c

                            if user_mask is not None:
                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * user_mask

                            if use_cross_attn_mask:
                                out = self.attention_store.aggregate_attention(
                                    attention_maps=self.attention_store.step_store,
                                    prompts=self.text_cross_attention_maps,
                                    res=att_res,
                                    from_where=["up", "down"],
                                    is_cross=True,
                                    select=self.text_cross_attention_maps.index(editing_prompt[c]),
                                )
                                attn_map = out[:, :, :, 1 : 1 + num_edit_tokens[c].item()]  # 0 -> startoftext

                                # average over all tokens
                                if attn_map.shape[3] != num_edit_tokens[c]:
                                    raise ValueError(
                                        f"Incorrect shape of attention_map. Expected size {num_edit_tokens[c]}, but found {attn_map.shape[3]}!"
                                    )

                                attn_map = mint.sum(attn_map, dim=3)

                                # gaussian_smoothing
                                attn_map = mint.nn.functional.pad(attn_map.unsqueeze(1), (1, 1, 1, 1), mode="reflect")
                                attn_map = self.smoothing(attn_map).squeeze(1)

                                # ops.quantile function expects float32
                                # TODO: ops.quantile is not supported
                                if attn_map.dtype == ms.float32:
                                    tmp = ms.tensor(
                                        np.quantile(attn_map.flatten(start_dim=1).numpy(), edit_threshold_c, axis=1)
                                    )
                                else:
                                    tmp = ms.tensor(
                                        np.quantile(
                                            attn_map.flatten(start_dim=1).to(ms.float32).numpy(),
                                            edit_threshold_c,
                                            axis=1,
                                        )
                                    ).to(attn_map.dtype)
                                attn_mask = mint.where(
                                    attn_map >= tmp.unsqueeze(1).unsqueeze(1).tile((1, *att_res)), 1.0, 0.0
                                )

                                # resolution must match latent space dimension
                                attn_mask = mint.nn.functional.interpolate(
                                    attn_mask.unsqueeze(1),
                                    noise_guidance_edit_tmp.shape[-2:],  # 64,64
                                ).tile((1, 4, 1, 1))
                                self.activation_mask[i, c] = ops.stop_gradient(attn_mask)
                                if not use_intersect_mask:
                                    noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                            if use_intersect_mask:
                                if t <= 800:
                                    noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                                    noise_guidance_edit_tmp_quantile = mint.sum(
                                        noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                                    )
                                    noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile(
                                        (1, self.unet.config.in_channels, 1, 1)
                                    )

                                    # ops.quantile function expects float32
                                    # TODO: ops.quantile is not supported
                                    if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                        tmp = ms.tensor(
                                            np.quantile(
                                                noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                                edit_threshold_c,
                                                axis=2,
                                                keepdims=False,
                                            )
                                        )
                                    else:
                                        tmp = ms.tensor(
                                            np.quantile(
                                                noise_guidance_edit_tmp_quantile.flatten(start_dim=2)
                                                .to(ms.float32)
                                                .numpy(),
                                                edit_threshold_c,
                                                axis=2,
                                                keepdims=False,
                                            )
                                        ).to(noise_guidance_edit_tmp_quantile.dtype)

                                    intersect_mask = (
                                        mint.where(
                                            noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                            mint.ones_like(noise_guidance_edit_tmp),
                                            mint.zeros_like(noise_guidance_edit_tmp),
                                        )
                                        * attn_mask
                                    )

                                    self.activation_mask[i, c] = ops.stop_gradient(intersect_mask)

                                    noise_guidance_edit_tmp = noise_guidance_edit_tmp * intersect_mask

                                else:
                                    # print(f"only attention mask for step {i}")
                                    noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                            elif not use_cross_attn_mask:
                                # calculate quantile
                                noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                                noise_guidance_edit_tmp_quantile = mint.sum(
                                    noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                                )
                                noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile((1, 4, 1, 1))

                                # ops.quantile function expects float32
                                # TODO: ops.quantile is not supported
                                if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                    tmp = ms.tensor(
                                        np.quantile(
                                            noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                            edit_threshold_c,
                                            axis=2,
                                            keepdims=False,
                                        )
                                    )
                                else:
                                    tmp = ms.tensor(
                                        np.quantile(
                                            noise_guidance_edit_tmp_quantile.flatten(start_dim=2)
                                            .to(ms.float32)
                                            .numpy(),
                                            edit_threshold_c,
                                            axis=2,
                                            keepdims=False,
                                        )
                                    ).to(noise_guidance_edit_tmp_quantile.dtype)

                                self.activation_mask[i, c] = ops.stop_gradient(
                                    mint.where(
                                        noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                        mint.ones_like(noise_guidance_edit_tmp),
                                        mint.zeros_like(noise_guidance_edit_tmp),
                                    )
                                )

                                noise_guidance_edit_tmp = mint.where(
                                    noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                    noise_guidance_edit_tmp,
                                    mint.zeros_like(noise_guidance_edit_tmp),
                                )

                            noise_guidance_edit += noise_guidance_edit_tmp

                        self.sem_guidance[i] = ops.stop_gradient(noise_guidance_edit)

                    noise_pred = noise_pred_uncond + noise_guidance_edit

                    if enable_edit_guidance and self.guidance_rescale > 0.0:
                        # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                        noise_pred = rescale_noise_cfg(
                            noise_pred,
                            noise_pred_edit_concepts.mean(dim=0, keepdim=False),
                            guidance_rescale=self.guidance_rescale,
                        )

                    idx = t_to_idx[int(t)]
                    latents = self.scheduler.step(noise_pred, t, latents, variance_noise=zs[idx], **extra_step_kwargs)[
                        0
                    ]

                    # step callback
                    if use_cross_attn_mask:
                        store_step = i in attn_store_steps
                        self.attention_store.between_steps(store_step)

                    if callback_on_step_end is not None:
                        callback_kwargs = {}
                        for k in callback_on_step_end_tensor_inputs:
                            callback_kwargs[k] = locals()[k]
                        callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                        latents = callback_outputs.pop("latents", latents)
                        # prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                        negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

                    # call the callback, if provided
                    if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                        progress_bar.update()

        # 8. Post-processing
        if not output_type == "latent":
            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]
            image, has_nsfw_concept = self.run_safety_checker(image, text_embeddings.dtype)
        else:
            image = latents
            has_nsfw_concept = None

        if has_nsfw_concept is None:
            do_denormalize = [True] * image.shape[0]
        else:
            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]

        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)

        if not return_dict:
            return (image, has_nsfw_concept)

        return LEditsPPDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)

    def invert(
        self,
        image: PipelineImageInput,
        source_prompt: str = "",
        source_guidance_scale: float = 3.5,
        num_inversion_steps: int = 30,
        skip: float = 0.15,
        generator: Optional[np.random.Generator] = None,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
        clip_skip: Optional[int] = None,
        height: Optional[int] = None,
        width: Optional[int] = None,
        resize_mode: Optional[str] = "default",
        crops_coords: Optional[Tuple[int, int, int, int]] = None,
    ):
        r"""
        The function to the pipeline for image inversion as described by the [LEDITS++
        Paper](https://arxiv.org/abs/2301.12247). If the scheduler is set to [`~schedulers.DDIMScheduler`] the
        inversion proposed by [edit-friendly DPDM](https://arxiv.org/abs/2304.06140) will be performed instead.

        Args:
            image (`PipelineImageInput`):
                Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect
                ratio.
            source_prompt (`str`, defaults to `""`):
                Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled
                if the `source_prompt` is `""`.
            source_guidance_scale (`float`, defaults to `3.5`):
                Strength of guidance during inversion.
            num_inversion_steps (`int`, defaults to `30`):
                Number of total performed inversion steps after discarding the initial `skip` steps.
            skip (`float`, defaults to `0.15`):
                Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values
                will lead to stronger changes to the input image. `skip` has to be between `0` and `1`.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make inversion
                deterministic.
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
            height (`int`, *optional*, defaults to `None`):
                The height in preprocessed image. If `None`, will use the `get_default_height_width()` to get default
                height.
            width (`int`, *optional*`, defaults to `None`):
                The width in preprocessed. If `None`, will use get_default_height_width()` to get the default width.
            resize_mode (`str`, *optional*, defaults to `default`):
                The resize mode, can be one of `default` or `fill`. If `default`, will resize the image to fit within
                the specified width and height, and it may not maintaining the original aspect ratio. If `fill`, will
                resize the image to fit within the specified width and height, maintaining the aspect ratio, and then
                center the image within the dimensions, filling empty with data from image. If `crop`, will resize the
                image to fit within the specified width and height, maintaining the aspect ratio, and then center the
                image within the dimensions, cropping the excess. Note that resize_mode `fill` and `crop` are only
                supported for PIL image input.
            crops_coords (`List[Tuple[int, int, int, int]]`, *optional*, defaults to `None`):
                The crop coordinates for each image in the batch. If `None`, will not crop the image.

        Returns:
            [`~pipelines.ledits_pp.LEditsPPInversionPipelineOutput`]: Output will contain the resized input image(s)
            and respective VAE reconstruction(s).
        """
        if height is not None and height % 32 != 0 or width is not None and width % 32 != 0:
            raise ValueError("height and width must be a factor of 32.")
        # Reset attn processor, we do not want to store attn maps during inversion
        self.unet.set_attn_processor(AttnProcessor())

        self.eta = 1.0

        self.scheduler.config.timestep_spacing = "leading"
        self.scheduler.set_timesteps(int(num_inversion_steps * (1 + skip)))
        self.inversion_steps = self.scheduler.timesteps[-num_inversion_steps:]
        timesteps = self.inversion_steps

        # 1. encode image
        x0, resized = self.encode_image(
            image,
            dtype=self.text_encoder.dtype,
            height=height,
            width=width,
            resize_mode=resize_mode,
            crops_coords=crops_coords,
        )
        self.batch_size = x0.shape[0]

        # autoencoder reconstruction
        image_rec = self.vae.decode(x0 / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]
        image_rec = self.image_processor.postprocess(image_rec, output_type="pil")

        # 2. get embeddings
        do_classifier_free_guidance = source_guidance_scale > 1.0

        lora_scale = cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None

        uncond_embedding, text_embeddings, _ = self.encode_prompt(
            num_images_per_prompt=1,
            negative_prompt=None,
            enable_edit_guidance=do_classifier_free_guidance,
            editing_prompt=source_prompt,
            lora_scale=lora_scale,
            clip_skip=clip_skip,
        )

        # 3. find zs and xts
        variance_noise_shape = (num_inversion_steps, *x0.shape)

        # intermediate latents
        t_to_idx = {int(v): k for k, v in enumerate(timesteps)}
        xts = mint.zeros(size=variance_noise_shape, dtype=uncond_embedding.dtype)

        for t in reversed(timesteps):
            idx = num_inversion_steps - t_to_idx[int(t)] - 1
            noise = randn_tensor(shape=x0.shape, generator=generator, dtype=x0.dtype)
            xts[idx] = self.scheduler.add_noise(x0, noise, ms.tensor([t]))
        xts = mint.cat([x0.unsqueeze(0), xts], dim=0)

        self.scheduler.set_timesteps(len(self.scheduler.timesteps))
        # noise maps
        zs = mint.zeros(size=variance_noise_shape, dtype=uncond_embedding.dtype)

        with self.progress_bar(total=len(timesteps)) as progress_bar:
            for t in timesteps:
                idx = num_inversion_steps - t_to_idx[int(t)] - 1
                # 1. predict noise residual
                xt = xts[idx + 1]

                noise_pred = self.unet(xt, timestep=t, encoder_hidden_states=uncond_embedding)[0]

                if not source_prompt == "":
                    noise_pred_cond = self.unet(xt, timestep=t, encoder_hidden_states=text_embeddings)[0]
                    noise_pred = noise_pred + source_guidance_scale * (noise_pred_cond - noise_pred)

                xtm1 = xts[idx]
                z, xtm1_corrected = compute_noise(self.scheduler, xtm1, xt, t, noise_pred, self.eta)
                zs[idx] = z

                # correction to avoid error accumulation
                xts[idx] = xtm1_corrected

                progress_bar.update()

        self.init_latents = xts[-1].broadcast_to((self.batch_size, -1, -1, -1))
        zs = zs.flip((0,))
        self.zs = zs

        return LEditsPPInversionPipelineOutput(images=resized, vae_reconstruction_images=image_rec)

    def encode_image(self, image, dtype=None, height=None, width=None, resize_mode="default", crops_coords=None):
        image = self.image_processor.preprocess(
            image=image, height=height, width=width, resize_mode=resize_mode, crops_coords=crops_coords
        )
        height, width = image.shape[-2:]
        if height % 32 != 0 or width % 32 != 0:
            raise ValueError(
                "Image height and width must be a factor of 32. "
                "Consider down-sampling the input using the `height` and `width` parameters"
            )
        resized = self.image_processor.postprocess(image=image, output_type="pil")

        if max(image.shape[-2:]) > self.vae.config["sample_size"] * 1.5:
            logger.warning(
                "Your input images far exceed the default resolution of the underlying diffusion model. "
                "The output images may contain severe artifacts! "
                "Consider down-sampling the input using the `height` and `width` parameters"
            )
        image = image.to(dtype)

        x0 = self.vae.diag_gauss_dist.mode(self.vae.encode(image)[0])
        x0 = x0.to(dtype)
        x0 = self.vae.config.scaling_factor * x0
        return x0, resized

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.__call__(negative_prompt=None, generator=None, output_type='pil', return_dict=False, editing_prompt=None, editing_prompt_embeds=None, negative_prompt_embeds=None, reverse_editing_direction=False, edit_guidance_scale=5, edit_warmup_steps=0, edit_cooldown_steps=None, edit_threshold=0.9, user_mask=None, sem_guidance=None, use_cross_attn_mask=False, use_intersect_mask=True, attn_store_steps=[], store_averaged_over_steps=True, cross_attention_kwargs=None, guidance_rescale=0.0, clip_skip=None, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)

The call function to the pipeline for editing. The [~pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.invert] method has to be called beforehand. Edits will always be performed for the last inverted image(s).

PARAMETER DESCRIPTION
negative_prompt

The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

generator

One or a list of np.random.Generator(s) to make generation deterministic.

TYPE: `np.random.Generator`, *optional* DEFAULT: None

output_type

The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

editing_prompt

The prompt or prompts to guide the image generation. The image is reconstructed by setting editing_prompt = None. Guidance direction of prompt should be specified via reverse_editing_direction.

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

editing_prompt_embeds

Pre-computed embeddings to use for guiding the image generation. Guidance direction of embedding should be specified via reverse_editing_direction.

TYPE: `ms.Tensor>`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

reverse_editing_direction

Whether the corresponding prompt in editing_prompt should be increased or decreased.

TYPE: `bool` or `List[bool]`, *optional*, defaults to `False` DEFAULT: False

edit_guidance_scale

Guidance scale for guiding the image generation. If provided as list values should correspond to editing_prompt. edit_guidance_scale is defined as s_e of equation 12 of LEDITS++ Paper.

TYPE: `float` or `List[float]`, *optional*, defaults to 5 DEFAULT: 5

edit_warmup_steps

Number of diffusion steps (for each prompt) for which guidance will not be applied.

TYPE: `float` or `List[float]`, *optional*, defaults to 10 DEFAULT: 0

edit_cooldown_steps

Number of diffusion steps (for each prompt) after which guidance will no longer be applied.

TYPE: `float` or `List[float]`, *optional*, defaults to `None` DEFAULT: None

edit_threshold

Masking threshold of guidance. Threshold should be proportional to the image region that is modified. 'edit_threshold' is defined as 'λ' of equation 12 of LEDITS++ Paper.

TYPE: `float` or `List[float]`, *optional*, defaults to 0.9 DEFAULT: 0.9

user_mask

User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s implicit masks do not meet user preferences.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

sem_guidance

List of pre-generated guidance vectors to be applied at generation. Length of the list has to correspond to num_inference_steps.

TYPE: `List[ms.Tensor]`, *optional* DEFAULT: None

use_cross_attn_mask

Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of LEDITS++ paper.

TYPE: `bool`, defaults to `False` DEFAULT: False

use_intersect_mask

Whether the masking term is calculated as intersection of cross-attention masks and masks derived from the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate are defined as 'M^2' of equation 12 of LEDITS++ paper.

TYPE: `bool`, defaults to `True` DEFAULT: True

attn_store_steps

Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.

TYPE: `List[int]`, *optional* DEFAULT: []

store_averaged_over_steps

Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If False, attention maps for each step are stores separately. Just for visualization purposes.

TYPE: `bool`, defaults to `True` DEFAULT: True

cross_attention_kwargs

A kwargs dictionary that if specified is passed along to the [AttentionProcessor] as defined in self.processor.

TYPE: `dict`, *optional* DEFAULT: None

guidance_rescale

Guidance rescale factor from Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

clip_skip

Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

TYPE: `int`, *optional* DEFAULT: None

callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.

TYPE: `Callable`, *optional* DEFAULT: None

callback_on_step_end_tensor_inputs

The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

RETURNS DESCRIPTION

[~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput] or tuple:

[~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput] if return_dict is True, otherwise a `tuple. When

returning a tuple, the first element is a list with the generated images, and the second element is a list

of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw)

content, according to the safety_checker.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
def __call__(
    self,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    editing_prompt: Optional[Union[str, List[str]]] = None,
    editing_prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    reverse_editing_direction: Optional[Union[bool, List[bool]]] = False,
    edit_guidance_scale: Optional[Union[float, List[float]]] = 5,
    edit_warmup_steps: Optional[Union[int, List[int]]] = 0,
    edit_cooldown_steps: Optional[Union[int, List[int]]] = None,
    edit_threshold: Optional[Union[float, List[float]]] = 0.9,
    user_mask: Optional[ms.Tensor] = None,
    sem_guidance: Optional[List[ms.Tensor]] = None,
    use_cross_attn_mask: bool = False,
    use_intersect_mask: bool = True,
    attn_store_steps: Optional[List[int]] = [],
    store_averaged_over_steps: bool = True,
    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    guidance_rescale: float = 0.0,
    clip_skip: Optional[int] = None,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    r"""
    The call function to the pipeline for editing. The
    [`~pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.invert`] method has to be called beforehand. Edits will
    always be performed for the last inverted image(s).

    Args:
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
            if `guidance_scale` is less than `1`).
        generator (`np.random.Generator`, *optional*):
            One or a list of [np.random.Generator(s)](https://numpy.org/doc/stable/reference/random/generator.html)
            to make generation deterministic.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between
            [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] instead of a plain
            tuple.
        editing_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts to guide the image generation. The image is reconstructed by setting
            `editing_prompt = None`. Guidance direction of prompt should be specified via
            `reverse_editing_direction`.
        editing_prompt_embeds (`ms.Tensor>`, *optional*):
            Pre-computed embeddings to use for guiding the image generation. Guidance direction of embedding should
            be specified via `reverse_editing_direction`.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
            not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
        reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`):
            Whether the corresponding prompt in `editing_prompt` should be increased or decreased.
        edit_guidance_scale (`float` or `List[float]`, *optional*, defaults to 5):
            Guidance scale for guiding the image generation. If provided as list values should correspond to
            `editing_prompt`. `edit_guidance_scale` is defined as `s_e` of equation 12 of [LEDITS++
            Paper](https://arxiv.org/abs/2301.12247).
        edit_warmup_steps (`float` or `List[float]`, *optional*, defaults to 10):
            Number of diffusion steps (for each prompt) for which guidance will not be applied.
        edit_cooldown_steps (`float` or `List[float]`, *optional*, defaults to `None`):
            Number of diffusion steps (for each prompt) after which guidance will no longer be applied.
        edit_threshold (`float` or `List[float]`, *optional*, defaults to 0.9):
            Masking threshold of guidance. Threshold should be proportional to the image region that is modified.
            'edit_threshold' is defined as 'λ' of equation 12 of [LEDITS++
            Paper](https://arxiv.org/abs/2301.12247).
        user_mask (`ms.Tensor`, *optional*):
            User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s
            implicit masks do not meet user preferences.
        sem_guidance (`List[ms.Tensor]`, *optional*):
            List of pre-generated guidance vectors to be applied at generation. Length of the list has to
            correspond to `num_inference_steps`.
        use_cross_attn_mask (`bool`, defaults to `False`):
            Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask
            is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of [LEDITS++
            paper](https://arxiv.org/pdf/2311.16711.pdf).
        use_intersect_mask (`bool`, defaults to `True`):
            Whether the masking term is calculated as intersection of cross-attention masks and masks derived from
            the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate
            are defined as 'M^2' of equation 12 of [LEDITS++ paper](https://arxiv.org/pdf/2311.16711.pdf).
        attn_store_steps (`List[int]`, *optional*):
            Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.
        store_averaged_over_steps (`bool`, defaults to `True`):
            Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If
            False, attention maps for each step are stores separately. Just for visualization purposes.
        cross_attention_kwargs (`dict`, *optional*):
            A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
            [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
        guidance_rescale (`float`, *optional*, defaults to 0.0):
            Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
            Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
            using zero terminal SNR.
        clip_skip (`int`, *optional*):
            Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
            the output of the pre-final layer will be used for computing the prompt embeddings.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] or `tuple`:
        [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. When
        returning a tuple, the first element is a list with the generated images, and the second element is a list
        of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw)
        content, according to the `safety_checker`.
    """

    if self.inversion_steps is None:
        raise ValueError(
            "You need to invert an input image first before calling the pipeline. The `invert` method has to be called beforehand. Edits will always be performed for the last inverted image(s)."  # noqa: E501
        )

    eta = self.eta
    num_images_per_prompt = 1
    latents = self.init_latents

    zs = self.zs
    self.scheduler.set_timesteps(len(self.scheduler.timesteps))

    if use_intersect_mask:
        use_cross_attn_mask = True

    if use_cross_attn_mask:
        self.smoothing = LeditsGaussianSmoothing()

    org_prompt = ""

    # 1. Check inputs. Raise error if not correct
    self.check_inputs(
        negative_prompt,
        editing_prompt_embeds,
        negative_prompt_embeds,
        callback_on_step_end_tensor_inputs,
    )

    self._guidance_rescale = guidance_rescale
    self._clip_skip = clip_skip
    self._cross_attention_kwargs = cross_attention_kwargs

    # 2. Define call parameters
    batch_size = self.batch_size

    if editing_prompt:
        enable_edit_guidance = True
        if isinstance(editing_prompt, str):
            editing_prompt = [editing_prompt]
        self.enabled_editing_prompts = len(editing_prompt)
    elif editing_prompt_embeds is not None:
        enable_edit_guidance = True
        self.enabled_editing_prompts = editing_prompt_embeds.shape[0]
    else:
        self.enabled_editing_prompts = 0
        enable_edit_guidance = False

    # 3. Encode input prompt
    lora_scale = self.cross_attention_kwargs.get("scale", None) if self.cross_attention_kwargs is not None else None

    edit_concepts, uncond_embeddings, num_edit_tokens = self.encode_prompt(
        editing_prompt=editing_prompt,
        num_images_per_prompt=num_images_per_prompt,
        enable_edit_guidance=enable_edit_guidance,
        negative_prompt=negative_prompt,
        editing_prompt_embeds=editing_prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        lora_scale=lora_scale,
        clip_skip=self.clip_skip,
    )

    # For classifier free guidance, we need to do two forward passes.
    # Here we concatenate the unconditional and text embeddings into a single batch
    # to avoid doing two forward passes
    if enable_edit_guidance:
        text_embeddings = mint.cat([uncond_embeddings, edit_concepts])
        self.text_cross_attention_maps = [editing_prompt] if isinstance(editing_prompt, str) else editing_prompt
    else:
        text_embeddings = mint.cat([uncond_embeddings])

    # 4. Prepare timesteps
    # self.scheduler.set_timesteps(num_inference_steps)
    timesteps = self.inversion_steps
    t_to_idx = {int(v): k for k, v in enumerate(timesteps[-zs.shape[0] :])}

    if use_cross_attn_mask:
        self.attention_store = LeditsAttentionStore(
            average=store_averaged_over_steps,
            batch_size=batch_size,
            max_size=(latents.shape[-2] / 4.0) * (latents.shape[-1] / 4.0),
            max_resolution=None,
        )
        self.prepare_unet(self.attention_store, PnP=False)
        resolution = latents.shape[-2:]
        att_res = (int(resolution[0] / 4), int(resolution[1] / 4))

    # 5. Prepare latent variables
    num_channels_latents = self.unet.config.in_channels
    latents = self.prepare_latents(
        batch_size * num_images_per_prompt,
        num_channels_latents,
        None,
        None,
        text_embeddings.dtype,
        latents,
    )

    # 6. Prepare extra step kwargs.
    extra_step_kwargs = self.prepare_extra_step_kwargs(eta)

    self.sem_guidance = None
    self.activation_mask = None

    # 7. Denoising loop
    num_warmup_steps = 0
    with pynative_context():
        with self.progress_bar(total=len(timesteps)) as progress_bar:
            for i, t in enumerate(timesteps):
                # expand the latents if we are doing classifier free guidance

                if enable_edit_guidance:
                    latent_model_input = mint.cat([latents] * (1 + self.enabled_editing_prompts))
                else:
                    latent_model_input = latents

                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                text_embed_input = text_embeddings

                # predict the noise residual
                noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embed_input)[0]

                noise_pred_out = noise_pred.chunk(1 + self.enabled_editing_prompts)  # [b,4, 64, 64]
                noise_pred_uncond = noise_pred_out[0]
                noise_pred_edit_concepts = noise_pred_out[1:]

                noise_guidance_edit = mint.zeros(
                    noise_pred_uncond.shape,
                    dtype=noise_pred_uncond.dtype,
                )

                if sem_guidance is not None and len(sem_guidance) > i:
                    noise_guidance_edit += sem_guidance[i]

                elif enable_edit_guidance:
                    if self.activation_mask is None:
                        self.activation_mask = mint.zeros(
                            (len(timesteps), len(noise_pred_edit_concepts), *noise_pred_edit_concepts[0].shape)
                        )

                    if self.sem_guidance is None:
                        self.sem_guidance = mint.zeros((len(timesteps), *noise_pred_uncond.shape))

                    for c, noise_pred_edit_concept in enumerate(noise_pred_edit_concepts):
                        if isinstance(edit_warmup_steps, list):
                            edit_warmup_steps_c = edit_warmup_steps[c]
                        else:
                            edit_warmup_steps_c = edit_warmup_steps
                        if i < edit_warmup_steps_c:
                            continue

                        if isinstance(edit_guidance_scale, list):
                            edit_guidance_scale_c = edit_guidance_scale[c]
                        else:
                            edit_guidance_scale_c = edit_guidance_scale

                        if isinstance(edit_threshold, list):
                            edit_threshold_c = edit_threshold[c]
                        else:
                            edit_threshold_c = edit_threshold
                        if isinstance(reverse_editing_direction, list):
                            reverse_editing_direction_c = reverse_editing_direction[c]
                        else:
                            reverse_editing_direction_c = reverse_editing_direction

                        if isinstance(edit_cooldown_steps, list):
                            edit_cooldown_steps_c = edit_cooldown_steps[c]
                        elif edit_cooldown_steps is None:
                            edit_cooldown_steps_c = i + 1
                        else:
                            edit_cooldown_steps_c = edit_cooldown_steps

                        if i >= edit_cooldown_steps_c:
                            continue

                        noise_guidance_edit_tmp = noise_pred_edit_concept - noise_pred_uncond

                        if reverse_editing_direction_c:
                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * -1

                        noise_guidance_edit_tmp = noise_guidance_edit_tmp * edit_guidance_scale_c

                        if user_mask is not None:
                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * user_mask

                        if use_cross_attn_mask:
                            out = self.attention_store.aggregate_attention(
                                attention_maps=self.attention_store.step_store,
                                prompts=self.text_cross_attention_maps,
                                res=att_res,
                                from_where=["up", "down"],
                                is_cross=True,
                                select=self.text_cross_attention_maps.index(editing_prompt[c]),
                            )
                            attn_map = out[:, :, :, 1 : 1 + num_edit_tokens[c].item()]  # 0 -> startoftext

                            # average over all tokens
                            if attn_map.shape[3] != num_edit_tokens[c]:
                                raise ValueError(
                                    f"Incorrect shape of attention_map. Expected size {num_edit_tokens[c]}, but found {attn_map.shape[3]}!"
                                )

                            attn_map = mint.sum(attn_map, dim=3)

                            # gaussian_smoothing
                            attn_map = mint.nn.functional.pad(attn_map.unsqueeze(1), (1, 1, 1, 1), mode="reflect")
                            attn_map = self.smoothing(attn_map).squeeze(1)

                            # ops.quantile function expects float32
                            # TODO: ops.quantile is not supported
                            if attn_map.dtype == ms.float32:
                                tmp = ms.tensor(
                                    np.quantile(attn_map.flatten(start_dim=1).numpy(), edit_threshold_c, axis=1)
                                )
                            else:
                                tmp = ms.tensor(
                                    np.quantile(
                                        attn_map.flatten(start_dim=1).to(ms.float32).numpy(),
                                        edit_threshold_c,
                                        axis=1,
                                    )
                                ).to(attn_map.dtype)
                            attn_mask = mint.where(
                                attn_map >= tmp.unsqueeze(1).unsqueeze(1).tile((1, *att_res)), 1.0, 0.0
                            )

                            # resolution must match latent space dimension
                            attn_mask = mint.nn.functional.interpolate(
                                attn_mask.unsqueeze(1),
                                noise_guidance_edit_tmp.shape[-2:],  # 64,64
                            ).tile((1, 4, 1, 1))
                            self.activation_mask[i, c] = ops.stop_gradient(attn_mask)
                            if not use_intersect_mask:
                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                        if use_intersect_mask:
                            if t <= 800:
                                noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                                noise_guidance_edit_tmp_quantile = mint.sum(
                                    noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                                )
                                noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile(
                                    (1, self.unet.config.in_channels, 1, 1)
                                )

                                # ops.quantile function expects float32
                                # TODO: ops.quantile is not supported
                                if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                    tmp = ms.tensor(
                                        np.quantile(
                                            noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                            edit_threshold_c,
                                            axis=2,
                                            keepdims=False,
                                        )
                                    )
                                else:
                                    tmp = ms.tensor(
                                        np.quantile(
                                            noise_guidance_edit_tmp_quantile.flatten(start_dim=2)
                                            .to(ms.float32)
                                            .numpy(),
                                            edit_threshold_c,
                                            axis=2,
                                            keepdims=False,
                                        )
                                    ).to(noise_guidance_edit_tmp_quantile.dtype)

                                intersect_mask = (
                                    mint.where(
                                        noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                        mint.ones_like(noise_guidance_edit_tmp),
                                        mint.zeros_like(noise_guidance_edit_tmp),
                                    )
                                    * attn_mask
                                )

                                self.activation_mask[i, c] = ops.stop_gradient(intersect_mask)

                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * intersect_mask

                            else:
                                # print(f"only attention mask for step {i}")
                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                        elif not use_cross_attn_mask:
                            # calculate quantile
                            noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                            noise_guidance_edit_tmp_quantile = mint.sum(
                                noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                            )
                            noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile((1, 4, 1, 1))

                            # ops.quantile function expects float32
                            # TODO: ops.quantile is not supported
                            if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                )
                            else:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2)
                                        .to(ms.float32)
                                        .numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                ).to(noise_guidance_edit_tmp_quantile.dtype)

                            self.activation_mask[i, c] = ops.stop_gradient(
                                mint.where(
                                    noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                    mint.ones_like(noise_guidance_edit_tmp),
                                    mint.zeros_like(noise_guidance_edit_tmp),
                                )
                            )

                            noise_guidance_edit_tmp = mint.where(
                                noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                noise_guidance_edit_tmp,
                                mint.zeros_like(noise_guidance_edit_tmp),
                            )

                        noise_guidance_edit += noise_guidance_edit_tmp

                    self.sem_guidance[i] = ops.stop_gradient(noise_guidance_edit)

                noise_pred = noise_pred_uncond + noise_guidance_edit

                if enable_edit_guidance and self.guidance_rescale > 0.0:
                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                    noise_pred = rescale_noise_cfg(
                        noise_pred,
                        noise_pred_edit_concepts.mean(dim=0, keepdim=False),
                        guidance_rescale=self.guidance_rescale,
                    )

                idx = t_to_idx[int(t)]
                latents = self.scheduler.step(noise_pred, t, latents, variance_noise=zs[idx], **extra_step_kwargs)[
                    0
                ]

                # step callback
                if use_cross_attn_mask:
                    store_step = i in attn_store_steps
                    self.attention_store.between_steps(store_step)

                if callback_on_step_end is not None:
                    callback_kwargs = {}
                    for k in callback_on_step_end_tensor_inputs:
                        callback_kwargs[k] = locals()[k]
                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                    latents = callback_outputs.pop("latents", latents)
                    # prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

                # call the callback, if provided
                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                    progress_bar.update()

    # 8. Post-processing
    if not output_type == "latent":
        image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]
        image, has_nsfw_concept = self.run_safety_checker(image, text_embeddings.dtype)
    else:
        image = latents
        has_nsfw_concept = None

    if has_nsfw_concept is None:
        do_denormalize = [True] * image.shape[0]
    else:
        do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]

    image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)

    if not return_dict:
        return (image, has_nsfw_concept)

    return LEditsPPDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.disable_vae_slicing()

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
708
709
710
711
712
713
def disable_vae_slicing(self):
    r"""
    Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
    computing decoding in one step.
    """
    self.vae.disable_slicing()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.disable_vae_tiling()

Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
723
724
725
726
727
728
def disable_vae_tiling(self):
    r"""
    Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
    computing decoding in one step.
    """
    self.vae.disable_tiling()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.enable_vae_slicing()

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
701
702
703
704
705
706
def enable_vae_slicing(self):
    r"""
    Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
    compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
    """
    self.vae.enable_slicing()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.enable_vae_tiling()

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
715
716
717
718
719
720
721
def enable_vae_tiling(self):
    r"""
    Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
    compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
    processing larger images.
    """
    self.vae.enable_tiling()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.encode_prompt(num_images_per_prompt, enable_edit_guidance, negative_prompt=None, editing_prompt=None, negative_prompt_embeds=None, editing_prompt_embeds=None, lora_scale=None, clip_skip=None)

Encodes the prompt into text encoder hidden states.

PARAMETER DESCRIPTION
num_images_per_prompt

number of images that should be generated per prompt

TYPE: `int`

enable_edit_guidance

whether to perform any editing or reconstruct the input image instead

TYPE: `bool`

negative_prompt

The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

editing_prompt

Editing prompt(s) to be encoded. If not defined, one has to pass editing_prompt_embeds instead.

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

editing_prompt_embeds

Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

lora_scale

A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.

TYPE: `float`, *optional* DEFAULT: None

clip_skip

Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

TYPE: `int`, *optional* DEFAULT: None

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
def encode_prompt(
    self,
    num_images_per_prompt,
    enable_edit_guidance,
    negative_prompt=None,
    editing_prompt=None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    editing_prompt_embeds: Optional[ms.Tensor] = None,
    lora_scale: Optional[float] = None,
    clip_skip: Optional[int] = None,
):
    r"""
    Encodes the prompt into text encoder hidden states.

    Args:
        num_images_per_prompt (`int`):
            number of images that should be generated per prompt
        enable_edit_guidance (`bool`):
            whether to perform any editing or reconstruct the input image instead
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. If not defined, one has to pass
            `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
            less than `1`).
        editing_prompt (`str` or `List[str]`, *optional*):
            Editing prompt(s) to be encoded. If not defined, one has to pass `editing_prompt_embeds` instead.
        editing_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
            provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        lora_scale (`float`, *optional*):
            A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
        clip_skip (`int`, *optional*):
            Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
            the output of the pre-final layer will be used for computing the prompt embeddings.
    """
    # set lora scale so that monkey patched LoRA
    # function of text encoder can correctly access it
    if lora_scale is not None and isinstance(self, StableDiffusionLoraLoaderMixin):
        self._lora_scale = lora_scale

        # dynamically adjust the LoRA scale
        scale_lora_layers(self.text_encoder, lora_scale)

    batch_size = self.batch_size
    num_edit_tokens = None

    if negative_prompt_embeds is None:
        uncond_tokens: List[str]
        if negative_prompt is None:
            uncond_tokens = [""] * batch_size
        elif isinstance(negative_prompt, str):
            uncond_tokens = [negative_prompt]
        elif batch_size != len(negative_prompt):
            raise ValueError(
                f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but exoected"
                f"{batch_size} based on the input images. Please make sure that passed `negative_prompt` matches"
                " the batch size of `prompt`."
            )
        else:
            uncond_tokens = negative_prompt

        # textual inversion: procecss multi-vector tokens if necessary
        if isinstance(self, TextualInversionLoaderMixin):
            uncond_tokens = self.maybe_convert_prompt(uncond_tokens, self.tokenizer)

        uncond_input = self.tokenizer(
            uncond_tokens,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True,
            return_tensors="np",
        )

        if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
            attention_mask = ms.tensor(uncond_input.attention_mask)
        else:
            attention_mask = None

        negative_prompt_embeds = self.text_encoder(
            ms.tensor(uncond_input.input_ids),
            attention_mask=attention_mask,
        )
        negative_prompt_embeds = negative_prompt_embeds[0]

    if self.text_encoder is not None:
        prompt_embeds_dtype = self.text_encoder.dtype
    elif self.unet is not None:
        prompt_embeds_dtype = self.unet.dtype
    else:
        prompt_embeds_dtype = negative_prompt_embeds.dtype

    negative_prompt_embeds = negative_prompt_embeds.to(dtype=prompt_embeds_dtype)

    if enable_edit_guidance:
        if editing_prompt_embeds is None:
            # textual inversion: procecss multi-vector tokens if necessary
            # if isinstance(self, TextualInversionLoaderMixin):
            #    prompt = self.maybe_convert_prompt(prompt, self.tokenizer)
            if isinstance(editing_prompt, str):
                editing_prompt = [editing_prompt]

            max_length = negative_prompt_embeds.shape[1]
            text_inputs = self.tokenizer(
                [x for item in editing_prompt for x in repeat(item, batch_size)],
                padding="max_length",
                max_length=max_length,
                truncation=True,
                return_tensors="np",
                return_length=True,
            )

            num_edit_tokens = text_inputs.length - 2  # not counting startoftext and endoftext
            text_input_ids = text_inputs.input_ids
            untruncated_ids = self.tokenizer(
                [x for item in editing_prompt for x in repeat(item, batch_size)],
                padding="longest",
                return_tensors="np",
            ).input_ids

            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
                text_input_ids, untruncated_ids
            ):
                removed_text = self.tokenizer.batch_decode(
                    untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
                )
                logger.warning(
                    "The following part of your input was truncated because CLIP can only handle sequences up to"
                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
                )

            if (
                hasattr(self.text_encoder.config, "use_attention_mask")
                and self.text_encoder.config.use_attention_mask
            ):
                attention_mask = ms.tensor(text_inputs.attention_mask)
            else:
                attention_mask = None

            text_input_ids = ms.tensor(text_input_ids)
            if clip_skip is None:
                editing_prompt_embeds = self.text_encoder(text_input_ids, attention_mask=attention_mask)
                editing_prompt_embeds = editing_prompt_embeds[0]
            else:
                editing_prompt_embeds = self.text_encoder(
                    text_input_ids, attention_mask=attention_mask, output_hidden_states=True
                )
                # Access the `hidden_states` first, that contains a tuple of
                # all the hidden states from the encoder layers. Then index into
                # the tuple to access the hidden states from the desired layer.
                editing_prompt_embeds = editing_prompt_embeds[-1][-(clip_skip + 1)]
                # We also need to apply the final LayerNorm here to not mess with the
                # representations. The `last_hidden_states` that we typically use for
                # obtaining the final prompt representations passes through the LayerNorm
                # layer.
                editing_prompt_embeds = self.text_encoder.text_model.final_layer_norm(editing_prompt_embeds)

        editing_prompt_embeds = editing_prompt_embeds.to(dtype=negative_prompt_embeds.dtype)

        bs_embed_edit, seq_len, _ = editing_prompt_embeds.shape
        editing_prompt_embeds = editing_prompt_embeds.to(dtype=negative_prompt_embeds.dtype)
        editing_prompt_embeds = editing_prompt_embeds.tile((1, num_images_per_prompt, 1))
        editing_prompt_embeds = editing_prompt_embeds.view(bs_embed_edit * num_images_per_prompt, seq_len, -1)

    # get unconditional embeddings for classifier free guidance

    # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
    seq_len = negative_prompt_embeds.shape[1]

    negative_prompt_embeds = negative_prompt_embeds.to(dtype=prompt_embeds_dtype)

    negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
    negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

    if isinstance(self, StableDiffusionLoraLoaderMixin):
        # Retrieve the original scale by scaling back the LoRA layers
        unscale_lora_layers(self.text_encoder, lora_scale)

    return editing_prompt_embeds, negative_prompt_embeds, num_edit_tokens

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusion.invert(image, source_prompt='', source_guidance_scale=3.5, num_inversion_steps=30, skip=0.15, generator=None, cross_attention_kwargs=None, clip_skip=None, height=None, width=None, resize_mode='default', crops_coords=None)

The function to the pipeline for image inversion as described by the LEDITS++ Paper. If the scheduler is set to [~schedulers.DDIMScheduler] the inversion proposed by edit-friendly DPDM will be performed instead.

PARAMETER DESCRIPTION
image

Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect ratio.

TYPE: `PipelineImageInput`

source_prompt

Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled if the source_prompt is "".

TYPE: `str`, defaults to `""` DEFAULT: ''

source_guidance_scale

Strength of guidance during inversion.

TYPE: `float`, defaults to `3.5` DEFAULT: 3.5

num_inversion_steps

Number of total performed inversion steps after discarding the initial skip steps.

TYPE: `int`, defaults to `30` DEFAULT: 30

skip

Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values will lead to stronger changes to the input image. skip has to be between 0 and 1.

TYPE: `float`, defaults to `0.15` DEFAULT: 0.15

generator

A np.random.Generator to make inversion deterministic.

TYPE: `np.random.Generator`, *optional* DEFAULT: None

cross_attention_kwargs

A kwargs dictionary that if specified is passed along to the [AttentionProcessor] as defined in self.processor.

TYPE: `dict`, *optional* DEFAULT: None

clip_skip

Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

TYPE: `int`, *optional* DEFAULT: None

height

The height in preprocessed image. If None, will use the get_default_height_width() to get default height.

TYPE: `int`, *optional*, defaults to `None` DEFAULT: None

width

The width in preprocessed. If None, will use get_default_height_width()` to get the default width.

TYPE: `int`, *optional*`, defaults to `None` DEFAULT: None

resize_mode

The resize mode, can be one of default or fill. If default, will resize the image to fit within the specified width and height, and it may not maintaining the original aspect ratio. If fill, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, filling empty with data from image. If crop, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, cropping the excess. Note that resize_mode fill and crop are only supported for PIL image input.

TYPE: `str`, *optional*, defaults to `default` DEFAULT: 'default'

crops_coords

The crop coordinates for each image in the batch. If None, will not crop the image.

TYPE: `List[Tuple[int, int, int, int]]`, *optional*, defaults to `None` DEFAULT: None

RETURNS DESCRIPTION

[~pipelines.ledits_pp.LEditsPPInversionPipelineOutput]: Output will contain the resized input image(s)

and respective VAE reconstruction(s).

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion.py
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
def invert(
    self,
    image: PipelineImageInput,
    source_prompt: str = "",
    source_guidance_scale: float = 3.5,
    num_inversion_steps: int = 30,
    skip: float = 0.15,
    generator: Optional[np.random.Generator] = None,
    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    clip_skip: Optional[int] = None,
    height: Optional[int] = None,
    width: Optional[int] = None,
    resize_mode: Optional[str] = "default",
    crops_coords: Optional[Tuple[int, int, int, int]] = None,
):
    r"""
    The function to the pipeline for image inversion as described by the [LEDITS++
    Paper](https://arxiv.org/abs/2301.12247). If the scheduler is set to [`~schedulers.DDIMScheduler`] the
    inversion proposed by [edit-friendly DPDM](https://arxiv.org/abs/2304.06140) will be performed instead.

    Args:
        image (`PipelineImageInput`):
            Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect
            ratio.
        source_prompt (`str`, defaults to `""`):
            Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled
            if the `source_prompt` is `""`.
        source_guidance_scale (`float`, defaults to `3.5`):
            Strength of guidance during inversion.
        num_inversion_steps (`int`, defaults to `30`):
            Number of total performed inversion steps after discarding the initial `skip` steps.
        skip (`float`, defaults to `0.15`):
            Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values
            will lead to stronger changes to the input image. `skip` has to be between `0` and `1`.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make inversion
            deterministic.
        cross_attention_kwargs (`dict`, *optional*):
            A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
            [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
        clip_skip (`int`, *optional*):
            Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
            the output of the pre-final layer will be used for computing the prompt embeddings.
        height (`int`, *optional*, defaults to `None`):
            The height in preprocessed image. If `None`, will use the `get_default_height_width()` to get default
            height.
        width (`int`, *optional*`, defaults to `None`):
            The width in preprocessed. If `None`, will use get_default_height_width()` to get the default width.
        resize_mode (`str`, *optional*, defaults to `default`):
            The resize mode, can be one of `default` or `fill`. If `default`, will resize the image to fit within
            the specified width and height, and it may not maintaining the original aspect ratio. If `fill`, will
            resize the image to fit within the specified width and height, maintaining the aspect ratio, and then
            center the image within the dimensions, filling empty with data from image. If `crop`, will resize the
            image to fit within the specified width and height, maintaining the aspect ratio, and then center the
            image within the dimensions, cropping the excess. Note that resize_mode `fill` and `crop` are only
            supported for PIL image input.
        crops_coords (`List[Tuple[int, int, int, int]]`, *optional*, defaults to `None`):
            The crop coordinates for each image in the batch. If `None`, will not crop the image.

    Returns:
        [`~pipelines.ledits_pp.LEditsPPInversionPipelineOutput`]: Output will contain the resized input image(s)
        and respective VAE reconstruction(s).
    """
    if height is not None and height % 32 != 0 or width is not None and width % 32 != 0:
        raise ValueError("height and width must be a factor of 32.")
    # Reset attn processor, we do not want to store attn maps during inversion
    self.unet.set_attn_processor(AttnProcessor())

    self.eta = 1.0

    self.scheduler.config.timestep_spacing = "leading"
    self.scheduler.set_timesteps(int(num_inversion_steps * (1 + skip)))
    self.inversion_steps = self.scheduler.timesteps[-num_inversion_steps:]
    timesteps = self.inversion_steps

    # 1. encode image
    x0, resized = self.encode_image(
        image,
        dtype=self.text_encoder.dtype,
        height=height,
        width=width,
        resize_mode=resize_mode,
        crops_coords=crops_coords,
    )
    self.batch_size = x0.shape[0]

    # autoencoder reconstruction
    image_rec = self.vae.decode(x0 / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]
    image_rec = self.image_processor.postprocess(image_rec, output_type="pil")

    # 2. get embeddings
    do_classifier_free_guidance = source_guidance_scale > 1.0

    lora_scale = cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None

    uncond_embedding, text_embeddings, _ = self.encode_prompt(
        num_images_per_prompt=1,
        negative_prompt=None,
        enable_edit_guidance=do_classifier_free_guidance,
        editing_prompt=source_prompt,
        lora_scale=lora_scale,
        clip_skip=clip_skip,
    )

    # 3. find zs and xts
    variance_noise_shape = (num_inversion_steps, *x0.shape)

    # intermediate latents
    t_to_idx = {int(v): k for k, v in enumerate(timesteps)}
    xts = mint.zeros(size=variance_noise_shape, dtype=uncond_embedding.dtype)

    for t in reversed(timesteps):
        idx = num_inversion_steps - t_to_idx[int(t)] - 1
        noise = randn_tensor(shape=x0.shape, generator=generator, dtype=x0.dtype)
        xts[idx] = self.scheduler.add_noise(x0, noise, ms.tensor([t]))
    xts = mint.cat([x0.unsqueeze(0), xts], dim=0)

    self.scheduler.set_timesteps(len(self.scheduler.timesteps))
    # noise maps
    zs = mint.zeros(size=variance_noise_shape, dtype=uncond_embedding.dtype)

    with self.progress_bar(total=len(timesteps)) as progress_bar:
        for t in timesteps:
            idx = num_inversion_steps - t_to_idx[int(t)] - 1
            # 1. predict noise residual
            xt = xts[idx + 1]

            noise_pred = self.unet(xt, timestep=t, encoder_hidden_states=uncond_embedding)[0]

            if not source_prompt == "":
                noise_pred_cond = self.unet(xt, timestep=t, encoder_hidden_states=text_embeddings)[0]
                noise_pred = noise_pred + source_guidance_scale * (noise_pred_cond - noise_pred)

            xtm1 = xts[idx]
            z, xtm1_corrected = compute_noise(self.scheduler, xtm1, xt, t, noise_pred, self.eta)
            zs[idx] = z

            # correction to avoid error accumulation
            xts[idx] = xtm1_corrected

            progress_bar.update()

    self.init_latents = xts[-1].broadcast_to((self.batch_size, -1, -1, -1))
    zs = zs.flip((0,))
    self.zs = zs

    return LEditsPPInversionPipelineOutput(images=resized, vae_reconstruction_images=image_rec)

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL

Bases: DiffusionPipeline, FromSingleFileMixin, StableDiffusionXLLoraLoaderMixin, TextualInversionLoaderMixin, IPAdapterMixin

Pipeline for textual image editing using LEDits++ with Stable Diffusion XL.

This model inherits from [DiffusionPipeline] and builds on the [StableDiffusionXLPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

In addition the pipeline inherits the following loading methods
  • LoRA: [LEditsPPPipelineStableDiffusionXL.load_lora_weights]
  • Ckpt: [loaders.FromSingleFileMixin.from_single_file]
as well as the following saving methods
  • LoRA: [loaders.StableDiffusionXLPipeline.save_lora_weights]
PARAMETER DESCRIPTION
vae

Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.

TYPE: [`AutoencoderKL`]

text_encoder

Frozen text-encoder. Stable Diffusion XL uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.

TYPE: [`~transformers.CLIPTextModel`]

text_encoder_2

Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of CLIP, specifically the laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant.

TYPE: [`~transformers.CLIPTextModelWithProjection`]

tokenizer

Tokenizer of class CLIPTokenizer.

TYPE: [`~transformers.CLIPTokenizer`]

tokenizer_2

Second Tokenizer of class CLIPTokenizer.

TYPE: [`~transformers.CLIPTokenizer`]

unet

Conditional U-Net architecture to denoise the encoded image latents.

TYPE: [`UNet2DConditionModel`]

scheduler

A scheduler to be used in combination with unet to denoise the encoded image latens. Can be one of [DPMSolverMultistepScheduler] or [DDIMScheduler]. If any other scheduler is passed it will automatically be set to [DPMSolverMultistepScheduler].

TYPE: [`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]

force_zeros_for_empty_prompt

Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config of stabilityai/stable-diffusion-xl-base-1-0.

TYPE: `bool`, *optional*, defaults to `"True"` DEFAULT: True

add_watermarker

Whether to use the invisible_watermark library to watermark output images. If not defined, it will default to True if the package is installed, otherwise no watermarker will be used.

TYPE: `bool`, *optional* DEFAULT: None

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
class LEditsPPPipelineStableDiffusionXL(
    DiffusionPipeline,
    FromSingleFileMixin,
    StableDiffusionXLLoraLoaderMixin,
    TextualInversionLoaderMixin,
    IPAdapterMixin,
):
    """
    Pipeline for textual image editing using LEDits++ with Stable Diffusion XL.

    This model inherits from [`DiffusionPipeline`] and builds on the [`StableDiffusionXLPipeline`]. Check the
    superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a
    particular device, etc.).

    In addition the pipeline inherits the following loading methods:
        - *LoRA*: [`LEditsPPPipelineStableDiffusionXL.load_lora_weights`]
        - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`]

    as well as the following saving methods:
        - *LoRA*: [`loaders.StableDiffusionXLPipeline.save_lora_weights`]

    Args:
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
        text_encoder ([`~transformers.CLIPTextModel`]):
            Frozen text-encoder. Stable Diffusion XL uses the text portion of
            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
        text_encoder_2 ([`~transformers.CLIPTextModelWithProjection`]):
            Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of
            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
            specifically the
            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
            variant.
        tokenizer ([`~transformers.CLIPTokenizer`]):
            Tokenizer of class
            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
        tokenizer_2 ([`~transformers.CLIPTokenizer`]):
            Second Tokenizer of class
            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
        scheduler ([`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]):
            A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of
            [`DPMSolverMultistepScheduler`] or [`DDIMScheduler`]. If any other scheduler is passed it will
            automatically be set to [`DPMSolverMultistepScheduler`].
        force_zeros_for_empty_prompt (`bool`, *optional*, defaults to `"True"`):
            Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config of
            `stabilityai/stable-diffusion-xl-base-1-0`.
        add_watermarker (`bool`, *optional*):
            Whether to use the [invisible_watermark library](https://github.com/ShieldMnt/invisible-watermark/) to
            watermark output images. If not defined, it will default to True if the package is installed, otherwise no
            watermarker will be used.
    """

    model_cpu_offload_seq = "text_encoder->text_encoder_2->unet->vae"
    _optional_components = [
        "tokenizer",
        "tokenizer_2",
        "text_encoder",
        "text_encoder_2",
        "image_encoder",
        "feature_extractor",
    ]
    _callback_tensor_inputs = [
        "latents",
        "prompt_embeds",
        "negative_prompt_embeds",
        "add_text_embeds",
        "add_time_ids",
        "negative_pooled_prompt_embeds",
        "negative_add_time_ids",
    ]

    def __init__(
        self,
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        text_encoder_2: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        tokenizer_2: CLIPTokenizer,
        unet: UNet2DConditionModel,
        scheduler: Union[DPMSolverMultistepScheduler, DDIMScheduler],
        image_encoder: CLIPVisionModelWithProjection = None,
        feature_extractor: CLIPImageProcessor = None,
        force_zeros_for_empty_prompt: bool = True,
        add_watermarker: Optional[bool] = None,
    ):
        super().__init__()

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            text_encoder_2=text_encoder_2,
            tokenizer=tokenizer,
            tokenizer_2=tokenizer_2,
            unet=unet,
            scheduler=scheduler,
            image_encoder=image_encoder,
            feature_extractor=feature_extractor,
        )
        self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)

        if not isinstance(scheduler, DDIMScheduler) and not isinstance(scheduler, DPMSolverMultistepScheduler):
            self.scheduler = DPMSolverMultistepScheduler.from_config(
                scheduler.config, algorithm_type="sde-dpmsolver++", solver_order=2
            )
            logger.warning(
                "This pipeline only supports DDIMScheduler and DPMSolverMultistepScheduler. "
                "The scheduler has been changed to DPMSolverMultistepScheduler."
            )

        self.default_sample_size = (
            self.unet.config.sample_size
            if hasattr(self, "unet") and self.unet is not None and hasattr(self.unet.config, "sample_size")
            else 128
        )

        add_watermarker = add_watermarker if add_watermarker is not None else is_invisible_watermark_available()

        if add_watermarker:
            self.watermark = StableDiffusionXLWatermarker()
        else:
            self.watermark = None
        self.inversion_steps = None

    def encode_prompt(
        self,
        num_images_per_prompt: int = 1,
        negative_prompt: Optional[str] = None,
        negative_prompt_2: Optional[str] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        negative_pooled_prompt_embeds: Optional[ms.Tensor] = None,
        lora_scale: Optional[float] = None,
        clip_skip: Optional[int] = None,
        enable_edit_guidance: bool = True,
        editing_prompt: Optional[str] = None,
        editing_prompt_embeds: Optional[ms.Tensor] = None,
        editing_pooled_prompt_embeds: Optional[ms.Tensor] = None,
    ) -> object:
        r"""
        Encodes the prompt into text encoder hidden states.

        Args:
            num_images_per_prompt (`int`):
                number of images that should be generated per prompt
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead.
            negative_prompt_2 (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            negative_pooled_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
                input argument.
            lora_scale (`float`, *optional*):
                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
            enable_edit_guidance (`bool`):
                Whether to guide towards an editing prompt or not.
            editing_prompt (`str` or `List[str]`, *optional*):
                Editing prompt(s) to be encoded. If not defined and 'enable_edit_guidance' is True, one has to pass
                `editing_prompt_embeds` instead.
            editing_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
                If not provided and 'enable_edit_guidance' is True, editing_prompt_embeds will be generated from
                `editing_prompt` input argument.
            editing_pooled_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated edit pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, pooled editing_pooled_prompt_embeds will be generated from `editing_prompt`
                input argument.
        """
        # set lora scale so that monkey patched LoRA
        # function of text encoder can correctly access it
        if lora_scale is not None and isinstance(self, StableDiffusionXLLoraLoaderMixin):
            self._lora_scale = lora_scale

            # dynamically adjust the LoRA scale
            if self.text_encoder is not None:
                scale_lora_layers(self.text_encoder, lora_scale)

            if self.text_encoder_2 is not None:
                scale_lora_layers(self.text_encoder_2, lora_scale)

        batch_size = self.batch_size

        # Define tokenizers and text encoders
        tokenizers = [self.tokenizer, self.tokenizer_2] if self.tokenizer is not None else [self.tokenizer_2]
        text_encoders = (
            [self.text_encoder, self.text_encoder_2] if self.text_encoder is not None else [self.text_encoder_2]
        )
        num_edit_tokens = 0

        # get unconditional embeddings for classifier free guidance
        zero_out_negative_prompt = negative_prompt is None and self.config.force_zeros_for_empty_prompt

        if negative_prompt_embeds is None:
            negative_prompt = negative_prompt or ""
            negative_prompt_2 = negative_prompt_2 or negative_prompt

            # normalize str to list
            negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
            negative_prompt_2 = (
                batch_size * [negative_prompt_2] if isinstance(negative_prompt_2, str) else negative_prompt_2
            )

            uncond_tokens: List[str]

            if batch_size != len(negative_prompt):
                raise ValueError(
                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but image inversion "
                    f" has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                    " the batch size of the input images."
                )
            else:
                uncond_tokens = [negative_prompt, negative_prompt_2]

            negative_prompt_embeds_list = []
            for negative_prompt, tokenizer, text_encoder in zip(uncond_tokens, tokenizers, text_encoders):
                if isinstance(self, TextualInversionLoaderMixin):
                    negative_prompt = self.maybe_convert_prompt(negative_prompt, tokenizer)

                uncond_input = tokenizer(
                    negative_prompt,
                    padding="max_length",
                    max_length=tokenizer.model_max_length,
                    truncation=True,
                    return_tensors="np",
                )

                negative_prompt_embeds = text_encoder(
                    ms.tensor(uncond_input.input_ids),
                    output_hidden_states=True,
                )
                # We are only ALWAYS interested in the pooled output of the final text encoder
                negative_pooled_prompt_embeds = negative_prompt_embeds[0]
                negative_prompt_embeds = negative_prompt_embeds[2][-2]

                negative_prompt_embeds_list.append(negative_prompt_embeds)

            negative_prompt_embeds = mint.concat(negative_prompt_embeds_list, dim=-1)

            if zero_out_negative_prompt:
                negative_prompt_embeds = mint.zeros_like(negative_prompt_embeds)
                negative_pooled_prompt_embeds = mint.zeros_like(negative_pooled_prompt_embeds)

        if enable_edit_guidance and editing_prompt_embeds is None:
            editing_prompt_2 = editing_prompt

            editing_prompts = [editing_prompt, editing_prompt_2]
            edit_prompt_embeds_list = []

            for editing_prompt, tokenizer, text_encoder in zip(editing_prompts, tokenizers, text_encoders):
                if isinstance(self, TextualInversionLoaderMixin):
                    editing_prompt = self.maybe_convert_prompt(editing_prompt, tokenizer)

                max_length = negative_prompt_embeds.shape[1]
                edit_concepts_input = tokenizer(
                    # [x for item in editing_prompt for x in repeat(item, batch_size)],
                    editing_prompt,
                    padding="max_length",
                    max_length=max_length,
                    truncation=True,
                    return_tensors="np",
                    return_length=True,
                )
                num_edit_tokens = edit_concepts_input.length - 2

                edit_concepts_embeds = text_encoder(
                    ms.tensor(edit_concepts_input.input_ids),
                    output_hidden_states=True,
                )
                # We are only ALWAYS interested in the pooled output of the final text encoder
                editing_pooled_prompt_embeds = edit_concepts_embeds[0]
                if clip_skip is None:
                    edit_concepts_embeds = edit_concepts_embeds[2][-2]
                else:
                    # "2" because SDXL always indexes from the penultimate layer.
                    edit_concepts_embeds = edit_concepts_embeds[-2][-(clip_skip + 2)]

                edit_prompt_embeds_list.append(edit_concepts_embeds)

            edit_concepts_embeds = mint.concat(edit_prompt_embeds_list, dim=-1)
        elif not enable_edit_guidance:
            edit_concepts_embeds = None
            editing_pooled_prompt_embeds = None

        negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder_2.dtype)
        bs_embed, seq_len, _ = negative_prompt_embeds.shape
        # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
        seq_len = negative_prompt_embeds.shape[1]
        negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder_2.dtype)
        negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
        negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

        if enable_edit_guidance:
            bs_embed_edit, seq_len, _ = edit_concepts_embeds.shape
            edit_concepts_embeds = edit_concepts_embeds.to(dtype=self.text_encoder_2.dtype)
            edit_concepts_embeds = edit_concepts_embeds.tile((1, num_images_per_prompt, 1))
            edit_concepts_embeds = edit_concepts_embeds.view(bs_embed_edit * num_images_per_prompt, seq_len, -1)

        negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.tile((1, num_images_per_prompt)).view(
            bs_embed * num_images_per_prompt, -1
        )

        if enable_edit_guidance:
            editing_pooled_prompt_embeds = editing_pooled_prompt_embeds.tile((1, num_images_per_prompt)).view(
                bs_embed_edit * num_images_per_prompt, -1
            )

        if self.text_encoder is not None:
            if isinstance(self, StableDiffusionXLLoraLoaderMixin):
                # Retrieve the original scale by scaling back the LoRA layers
                unscale_lora_layers(self.text_encoder, lora_scale)

        if self.text_encoder_2 is not None:
            if isinstance(self, StableDiffusionXLLoraLoaderMixin):
                # Retrieve the original scale by scaling back the LoRA layers
                unscale_lora_layers(self.text_encoder_2, lora_scale)

        return (
            negative_prompt_embeds,
            edit_concepts_embeds,
            negative_pooled_prompt_embeds,
            editing_pooled_prompt_embeds,
            num_edit_tokens,
        )

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
    def prepare_extra_step_kwargs(self, eta, generator=None):
        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
        # and should be between [0, 1]

        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
        extra_step_kwargs = {}
        if accepts_eta:
            extra_step_kwargs["eta"] = eta

        # check if the scheduler accepts generator
        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
        if accepts_generator:
            extra_step_kwargs["generator"] = generator
        return extra_step_kwargs

    def check_inputs(
        self,
        negative_prompt=None,
        negative_prompt_2=None,
        negative_prompt_embeds=None,
        negative_pooled_prompt_embeds=None,
    ):
        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )
        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

        if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
            raise ValueError(
                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."  # noqa: E501
            )

    # Modified from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
    def prepare_latents(self, latents):
        dtype = latents.dtype

        # scale the initial noise by the standard deviation required by the scheduler
        latents = (latents * self.scheduler.init_noise_sigma).to(dtype)
        return latents

    def _get_add_time_ids(
        self, original_size, crops_coords_top_left, target_size, dtype, text_encoder_projection_dim=None
    ):
        add_time_ids = list(original_size + crops_coords_top_left + target_size)

        passed_add_embed_dim = (
            self.unet.config.addition_time_embed_dim * len(add_time_ids) + text_encoder_projection_dim
        )
        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features

        if expected_add_embed_dim != passed_add_embed_dim:
            raise ValueError(
                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. The model has an incorrect config. Please check `unet.config.time_embedding_type` and `text_encoder_2.config.projection_dim`."  # noqa: E501
            )

        add_time_ids = ms.tensor([add_time_ids], dtype=dtype)
        return add_time_ids

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_upscale.StableDiffusionUpscalePipeline.upcast_vae
    def upcast_vae(self):
        self.vae.to(dtype=ms.float32)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
    def get_guidance_scale_embedding(
        self, w: ms.Tensor, embedding_dim: int = 512, dtype: ms.Type = ms.float32
    ) -> ms.Tensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
            w (`ms.Tensor`):
                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
                Dimension of the embeddings to generate.
            dtype (`ms.Type`, *optional*, defaults to `ms.float32`):
                Data type of the generated embeddings.

        Returns:
            `ms.Tensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0

        half_dim = embedding_dim // 2
        emb = mint.log(ms.tensor(10000.0)) / (half_dim - 1)
        emb = mint.exp(mint.arange(half_dim, dtype=dtype) * -emb)
        emb = w.to(dtype)[:, None] * emb[None, :]
        emb = mint.cat([mint.sin(emb), mint.cos(emb)], dim=1)
        if embedding_dim % 2 == 1:  # zero pad
            emb = mint.nn.functional.pad(emb, (0, 1))
        assert emb.shape == (w.shape[0], embedding_dim)
        return emb

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def guidance_rescale(self):
        return self._guidance_rescale

    @property
    def clip_skip(self):
        return self._clip_skip

    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
    # corresponds to doing no classifier free guidance.
    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None

    @property
    def cross_attention_kwargs(self):
        return self._cross_attention_kwargs

    @property
    def denoising_end(self):
        return self._denoising_end

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def enable_vae_slicing(self):
        r"""
        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()

    def disable_vae_slicing(self):
        r"""
        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()

    def enable_vae_tiling(self):
        r"""
        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
        processing larger images.
        """
        self.vae.enable_tiling()

    def disable_vae_tiling(self):
        r"""
        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()

    # Copied from diffusers.pipelines.ledits_pp.pipeline_leditspp_stable_diffusion.LEditsPPPipelineStableDiffusion.prepare_unet
    def prepare_unet(self, attention_store, PnP: bool = False):
        attn_procs = {}
        for name in self.unet.attn_processors.keys():
            if name.startswith("mid_block"):
                place_in_unet = "mid"
            elif name.startswith("up_blocks"):
                place_in_unet = "up"
            elif name.startswith("down_blocks"):
                place_in_unet = "down"
            else:
                continue

            if "attn2" in name and place_in_unet != "mid":
                attn_procs[name] = LEDITSCrossAttnProcessor(
                    attention_store=attention_store,
                    place_in_unet=place_in_unet,
                    pnp=PnP,
                    editing_prompts=self.enabled_editing_prompts,
                )
            else:
                attn_procs[name] = AttnProcessor()

        self.unet.set_attn_processor(attn_procs)

    def __call__(
        self,
        denoising_end: Optional[float] = None,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        negative_prompt_2: Optional[Union[str, List[str]]] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        negative_pooled_prompt_embeds: Optional[ms.Tensor] = None,
        ip_adapter_image: Optional[PipelineImageInput] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
        guidance_rescale: float = 0.0,
        crops_coords_top_left: Tuple[int, int] = (0, 0),
        target_size: Optional[Tuple[int, int]] = None,
        editing_prompt: Optional[Union[str, List[str]]] = None,
        editing_prompt_embeddings: Optional[ms.Tensor] = None,
        editing_pooled_prompt_embeds: Optional[ms.Tensor] = None,
        reverse_editing_direction: Optional[Union[bool, List[bool]]] = False,
        edit_guidance_scale: Optional[Union[float, List[float]]] = 5,
        edit_warmup_steps: Optional[Union[int, List[int]]] = 0,
        edit_cooldown_steps: Optional[Union[int, List[int]]] = None,
        edit_threshold: Optional[Union[float, List[float]]] = 0.9,
        sem_guidance: Optional[List[ms.Tensor]] = None,
        use_cross_attn_mask: bool = False,
        use_intersect_mask: bool = False,
        user_mask: Optional[ms.Tensor] = None,
        attn_store_steps: Optional[List[int]] = [],
        store_averaged_over_steps: bool = True,
        clip_skip: Optional[int] = None,
        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        **kwargs,
    ):
        r"""
        The call function to the pipeline for editing. The
        [`~pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.invert`] method has to be called beforehand. Edits
        will always be performed for the last inverted image(s).

        Args:
            denoising_end (`float`, *optional*):
                When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be
                completed before it is intentionally prematurely terminated. As a result, the returned sample will
                still retain a substantial amount of noise as determined by the discrete timesteps selected by the
                scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a
                "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
                less than `1`).
            negative_prompt_2 (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            negative_pooled_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
                input argument.
            ip_adapter_image: (`PipelineImageInput`, *optional*):
                Optional image input to work with IP Adapters.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between
                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
                of a plain tuple.
            callback (`Callable`, *optional*):
                A function that will be called every `callback_steps` steps during inference. The function will be
                called with the following arguments: `callback(step: int, timestep: int, latents: ms.Tensor)`.
            callback_steps (`int`, *optional*, defaults to 1):
                The frequency at which the `callback` function will be called. If not specified, the callback will be
                called at every step.
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
                `self.processor` in
                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
            guidance_rescale (`float`, *optional*, defaults to 0.7):
                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
                Guidance rescale factor should fix overexposure when using zero terminal SNR.
            crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
                `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
                `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
                `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
            target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
                For most cases, `target_size` should be set to the desired height and width of the generated image. If
                not specified it will default to `(width, height)`. Part of SDXL's micro-conditioning as explained in
                section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
            editing_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts to guide the image generation. The image is reconstructed by setting
                `editing_prompt = None`. Guidance direction of prompt should be specified via
                `reverse_editing_direction`.
            editing_prompt_embeddings (`ms.Tensor`, *optional*):
                Pre-generated edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
                If not provided, editing_prompt_embeddings will be generated from `editing_prompt` input argument.
            editing_pooled_prompt_embeddings (`ms.Tensor`, *optional*):
                Pre-generated pooled edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, editing_prompt_embeddings will be generated from `editing_prompt` input
                argument.
            reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`):
                Whether the corresponding prompt in `editing_prompt` should be increased or decreased.
            edit_guidance_scale (`float` or `List[float]`, *optional*, defaults to 5):
                Guidance scale for guiding the image generation. If provided as list values should correspond to
                `editing_prompt`. `edit_guidance_scale` is defined as `s_e` of equation 12 of [LEDITS++
                Paper](https://arxiv.org/abs/2301.12247).
            edit_warmup_steps (`float` or `List[float]`, *optional*, defaults to 10):
                Number of diffusion steps (for each prompt) for which guidance is not applied.
            edit_cooldown_steps (`float` or `List[float]`, *optional*, defaults to `None`):
                Number of diffusion steps (for each prompt) after which guidance is no longer applied.
            edit_threshold (`float` or `List[float]`, *optional*, defaults to 0.9):
                Masking threshold of guidance. Threshold should be proportional to the image region that is modified.
                'edit_threshold' is defined as 'λ' of equation 12 of [LEDITS++
                Paper](https://arxiv.org/abs/2301.12247).
            sem_guidance (`List[ms.Tensor]`, *optional*):
                List of pre-generated guidance vectors to be applied at generation. Length of the list has to
                correspond to `num_inference_steps`.
            use_cross_attn_mask:
                Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask
                is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of [LEDITS++
                paper](https://arxiv.org/pdf/2311.16711.pdf).
            use_intersect_mask:
                Whether the masking term is calculated as intersection of cross-attention masks and masks derived from
                the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate
                are defined as 'M^2' of equation 12 of [LEDITS++ paper](https://arxiv.org/pdf/2311.16711.pdf).
            user_mask:
                User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s
                implicit masks do not meet user preferences.
            attn_store_steps:
                Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.
            store_averaged_over_steps:
                Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If
                False, attention maps for each step are stores separately. Just for visualization purposes.
            clip_skip (`int`, *optional*):
                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
                the output of the pre-final layer will be used for computing the prompt embeddings.
            callback_on_step_end (`Callable`, *optional*):
                A function that calls at the end of each denoising steps during the inference. The function is called
                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
                `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] or `tuple`:
            [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. When
            returning a tuple, the first element is a list with the generated images.
        """
        if self.inversion_steps is None:
            raise ValueError(
                "You need to invert an input image first before calling the pipeline. The `invert` method has to be called beforehand. Edits will always be performed for the last inverted image(s)."  # noqa: E501
            )

        eta = self.eta
        num_images_per_prompt = 1
        latents = self.init_latents

        zs = self.zs
        self.scheduler.set_timesteps(len(self.scheduler.timesteps))

        if use_intersect_mask:
            use_cross_attn_mask = True

        if use_cross_attn_mask:
            self.smoothing = LeditsGaussianSmoothing()

        # TODO: Check inputs
        # 1. Check inputs. Raise error if not correct
        # self.check_inputs(
        #    callback_steps,
        #    negative_prompt,
        #    negative_prompt_2,
        #    prompt_embeds,
        #    negative_prompt_embeds,
        #    pooled_prompt_embeds,
        #    negative_pooled_prompt_embeds,
        # )
        self._guidance_rescale = guidance_rescale
        self._clip_skip = clip_skip
        self._cross_attention_kwargs = cross_attention_kwargs
        self._denoising_end = denoising_end

        # 2. Define call parameters
        batch_size = self.batch_size

        if editing_prompt:
            enable_edit_guidance = True
            if isinstance(editing_prompt, str):
                editing_prompt = [editing_prompt]
            self.enabled_editing_prompts = len(editing_prompt)
        elif editing_prompt_embeddings is not None:
            enable_edit_guidance = True
            self.enabled_editing_prompts = editing_prompt_embeddings.shape[0]
        else:
            self.enabled_editing_prompts = 0
            enable_edit_guidance = False

        # 3. Encode input prompt
        text_encoder_lora_scale = (
            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
        )
        (
            prompt_embeds,
            edit_prompt_embeds,
            negative_pooled_prompt_embeds,
            pooled_edit_embeds,
            num_edit_tokens,
        ) = self.encode_prompt(
            num_images_per_prompt=num_images_per_prompt,
            negative_prompt=negative_prompt,
            negative_prompt_2=negative_prompt_2,
            negative_prompt_embeds=negative_prompt_embeds,
            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
            lora_scale=text_encoder_lora_scale,
            clip_skip=self.clip_skip,
            enable_edit_guidance=enable_edit_guidance,
            editing_prompt=editing_prompt,
            editing_prompt_embeds=editing_prompt_embeddings,
            editing_pooled_prompt_embeds=editing_pooled_prompt_embeds,
        )

        # 4. Prepare timesteps
        # self.scheduler.set_timesteps(num_inference_steps)

        timesteps = self.inversion_steps
        t_to_idx = {int(v): k for k, v in enumerate(timesteps)}

        if use_cross_attn_mask:
            self.attention_store = LeditsAttentionStore(
                average=store_averaged_over_steps,
                batch_size=batch_size,
                max_size=(latents.shape[-2] / 4.0) * (latents.shape[-1] / 4.0),
                max_resolution=None,
            )
            self.prepare_unet(self.attention_store)
            resolution = latents.shape[-2:]
            att_res = (int(resolution[0] / 4), int(resolution[1] / 4))

        # 5. Prepare latent variables
        latents = self.prepare_latents(latents=latents)

        # 6. Prepare extra step kwargs.
        extra_step_kwargs = self.prepare_extra_step_kwargs(eta)

        if self.text_encoder_2 is None:
            text_encoder_projection_dim = int(negative_pooled_prompt_embeds.shape[-1])
        else:
            text_encoder_projection_dim = self.text_encoder_2.config.projection_dim

        # 7. Prepare added time ids & embeddings
        add_text_embeds = negative_pooled_prompt_embeds
        add_time_ids = self._get_add_time_ids(
            self.size,
            crops_coords_top_left,
            self.size,
            dtype=negative_pooled_prompt_embeds.dtype,
            text_encoder_projection_dim=text_encoder_projection_dim,
        )

        if enable_edit_guidance:
            prompt_embeds = mint.cat([prompt_embeds, edit_prompt_embeds], dim=0)
            add_text_embeds = mint.cat([add_text_embeds, pooled_edit_embeds], dim=0)
            edit_concepts_time_ids = add_time_ids.tile((edit_prompt_embeds.shape[0], 1))
            add_time_ids = mint.cat([add_time_ids, edit_concepts_time_ids], dim=0)
            self.text_cross_attention_maps = [editing_prompt] if isinstance(editing_prompt, str) else editing_prompt

        prompt_embeds = prompt_embeds
        add_text_embeds = add_text_embeds
        add_time_ids = add_time_ids.tile((batch_size * num_images_per_prompt, 1))

        if ip_adapter_image is not None:
            # TODO: fix image encoding
            image_embeds, negative_image_embeds = self.encode_image(ip_adapter_image, num_images_per_prompt)
            if self.do_classifier_free_guidance:
                image_embeds = mint.cat([negative_image_embeds, image_embeds])
                image_embeds = image_embeds

        # 8. Denoising loop
        self.sem_guidance = None
        self.activation_mask = None

        if (
            self.denoising_end is not None
            and isinstance(self.denoising_end, float)
            and self.denoising_end > 0
            and self.denoising_end < 1
        ):
            discrete_timestep_cutoff = int(
                round(
                    self.scheduler.config.num_train_timesteps
                    - (self.denoising_end * self.scheduler.config.num_train_timesteps)
                )
            )
            num_inference_steps = len(list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps)))
            timesteps = timesteps[:num_inference_steps]

        # 9. Optionally get Guidance Scale Embedding
        timestep_cond = None
        if self.unet.config.time_cond_proj_dim is not None:
            guidance_scale_tensor = ms.tensor(self.guidance_scale - 1).tile((batch_size * num_images_per_prompt))
            timestep_cond = self.get_guidance_scale_embedding(
                guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim
            ).to(dtype=latents.dtype)

        self._num_timesteps = len(timesteps)
        with self.progress_bar(total=self._num_timesteps) as progress_bar:
            for i, t in enumerate(timesteps):
                # expand the latents if we are doing classifier free guidance
                latent_model_input = mint.cat([latents] * (1 + self.enabled_editing_prompts))
                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
                # predict the noise residual
                added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}
                if ip_adapter_image is not None:
                    added_cond_kwargs["image_embeds"] = image_embeds
                noise_pred = self.unet(
                    latent_model_input,
                    t,
                    encoder_hidden_states=prompt_embeds,
                    cross_attention_kwargs=cross_attention_kwargs,
                    added_cond_kwargs=added_cond_kwargs,
                    return_dict=False,
                )[0]

                noise_pred_out = noise_pred.chunk(1 + self.enabled_editing_prompts)  # [b,4, 64, 64]
                noise_pred_uncond = noise_pred_out[0]
                noise_pred_edit_concepts = noise_pred_out[1:]

                noise_guidance_edit = mint.zeros(
                    noise_pred_uncond.shape,
                    dtype=noise_pred_uncond.dtype,
                )

                if sem_guidance is not None and len(sem_guidance) > i:
                    noise_guidance_edit += sem_guidance[i]

                elif enable_edit_guidance:
                    if self.activation_mask is None:
                        self.activation_mask = mint.zeros(
                            (len(timesteps), self.enabled_editing_prompts, *noise_pred_edit_concepts[0].shape)
                        )
                    if self.sem_guidance is None:
                        self.sem_guidance = mint.zeros((len(timesteps), *noise_pred_uncond.shape))

                    # noise_guidance_edit = mint.zeros_like(noise_guidance)
                    for c, noise_pred_edit_concept in enumerate(noise_pred_edit_concepts):
                        if isinstance(edit_warmup_steps, list):
                            edit_warmup_steps_c = edit_warmup_steps[c]
                        else:
                            edit_warmup_steps_c = edit_warmup_steps
                        if i < edit_warmup_steps_c:
                            continue

                        if isinstance(edit_guidance_scale, list):
                            edit_guidance_scale_c = edit_guidance_scale[c]
                        else:
                            edit_guidance_scale_c = edit_guidance_scale

                        if isinstance(edit_threshold, list):
                            edit_threshold_c = edit_threshold[c]
                        else:
                            edit_threshold_c = edit_threshold
                        if isinstance(reverse_editing_direction, list):
                            reverse_editing_direction_c = reverse_editing_direction[c]
                        else:
                            reverse_editing_direction_c = reverse_editing_direction

                        if isinstance(edit_cooldown_steps, list):
                            edit_cooldown_steps_c = edit_cooldown_steps[c]
                        elif edit_cooldown_steps is None:
                            edit_cooldown_steps_c = i + 1
                        else:
                            edit_cooldown_steps_c = edit_cooldown_steps

                        if i >= edit_cooldown_steps_c:
                            continue

                        noise_guidance_edit_tmp = noise_pred_edit_concept - noise_pred_uncond

                        if reverse_editing_direction_c:
                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * -1

                        noise_guidance_edit_tmp = noise_guidance_edit_tmp * edit_guidance_scale_c

                        if user_mask is not None:
                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * user_mask

                        if use_cross_attn_mask:
                            out = self.attention_store.aggregate_attention(
                                attention_maps=self.attention_store.step_store,
                                prompts=self.text_cross_attention_maps,
                                res=att_res,
                                from_where=["up", "down"],
                                is_cross=True,
                                select=self.text_cross_attention_maps.index(editing_prompt[c]),
                            )
                            attn_map = out[:, :, :, 1 : 1 + num_edit_tokens[c]]  # 0 -> startoftext

                            # average over all tokens
                            if attn_map.shape[3] != num_edit_tokens[c]:
                                raise ValueError(
                                    f"Incorrect shape of attention_map. Expected size {num_edit_tokens[c]}, but found {attn_map.shape[3]}!"
                                )
                            attn_map = mint.sum(attn_map, dim=3)

                            # gaussian_smoothing
                            attn_map = mint.nn.functional.pad(attn_map.unsqueeze(1), (1, 1, 1, 1), mode="reflect")
                            attn_map = self.smoothing(attn_map).squeeze(1)

                            # ops.quantile function expects float32
                            # TODO: ops.quantile is not supported
                            if attn_map.dtype == ms.float32:
                                tmp = ms.tensor(
                                    np.quantile(attn_map.flatten(start_dim=1).numpy(), edit_threshold_c, axis=1)
                                )
                            else:
                                tmp = ms.tensor(
                                    np.quantile(
                                        attn_map.flatten(start_dim=1).to(ms.float32).numpy(), edit_threshold_c, axis=1
                                    )
                                ).to(attn_map.dtype)
                            attn_mask = mint.where(
                                attn_map >= tmp.unsqueeze(1).unsqueeze(1).tile((1, *att_res)), 1.0, 0.0
                            )

                            # resolution must match latent space dimension
                            attn_mask = mint.nn.functional.interpolate(
                                attn_mask.unsqueeze(1),
                                noise_guidance_edit_tmp.shape[-2:],  # 64,64
                            ).tile((1, 4, 1, 1))
                            self.activation_mask[i, c] = ops.stop_gradient(attn_mask)
                            if not use_intersect_mask:
                                noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                        if use_intersect_mask:
                            noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                            noise_guidance_edit_tmp_quantile = mint.sum(
                                noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                            )
                            noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile(
                                (1, self.unet.config.in_channels, 1, 1)
                            )

                            # ops.quantile function expects float32
                            # TODO: ops.quantile is not supported
                            if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                )
                            else:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2).to(ms.float32).numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                ).to(noise_guidance_edit_tmp_quantile.dtype)

                            intersect_mask = (
                                mint.where(
                                    noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                    mint.ones_like(noise_guidance_edit_tmp),
                                    mint.zeros_like(noise_guidance_edit_tmp),
                                )
                                * attn_mask
                            )

                            self.activation_mask[i, c] = ops.stop_gradient(intersect_mask)

                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * intersect_mask

                        elif not use_cross_attn_mask:
                            # calculate quantile
                            noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                            noise_guidance_edit_tmp_quantile = mint.sum(
                                noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                            )
                            noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile((1, 4, 1, 1))

                            # ops.quantile function expects float32
                            # TODO: ops.quantile is not supported
                            if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                )
                            else:
                                tmp = ms.tensor(
                                    np.quantile(
                                        noise_guidance_edit_tmp_quantile.flatten(start_dim=2).to(ms.float32).numpy(),
                                        edit_threshold_c,
                                        axis=2,
                                        keepdims=False,
                                    )
                                ).to(noise_guidance_edit_tmp_quantile.dtype)

                            self.activation_mask[i, c] = ops.stop_gradient(
                                mint.where(
                                    noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                    mint.ones_like(noise_guidance_edit_tmp),
                                    mint.zeros_like(noise_guidance_edit_tmp),
                                )
                            )

                            noise_guidance_edit_tmp = mint.where(
                                noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                noise_guidance_edit_tmp,
                                mint.zeros_like(noise_guidance_edit_tmp),
                            )

                        noise_guidance_edit += noise_guidance_edit_tmp

                    self.sem_guidance[i] = ops.stop_gradient(noise_guidance_edit)

                noise_pred = noise_pred_uncond + noise_guidance_edit

                # compute the previous noisy sample x_t -> x_t-1
                if enable_edit_guidance and self.guidance_rescale > 0.0:
                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                    noise_pred = rescale_noise_cfg(
                        noise_pred,
                        noise_pred_edit_concepts.mean(dim=0, keepdim=False),
                        guidance_rescale=self.guidance_rescale,
                    )

                idx = t_to_idx[int(t)]
                latents = self.scheduler.step(
                    noise_pred, t, latents, variance_noise=zs[idx], **extra_step_kwargs, return_dict=False
                )[0]

                # step callback
                if use_cross_attn_mask:
                    store_step = i in attn_store_steps
                    self.attention_store.between_steps(store_step)

                if callback_on_step_end is not None:
                    callback_kwargs = {}
                    for k in callback_on_step_end_tensor_inputs:
                        callback_kwargs[k] = locals()[k]
                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                    latents = callback_outputs.pop("latents", latents)
                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
                    add_text_embeds = callback_outputs.pop("add_text_embeds", add_text_embeds)
                    negative_pooled_prompt_embeds = callback_outputs.pop(
                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
                    )
                    add_time_ids = callback_outputs.pop("add_time_ids", add_time_ids)
                    # negative_add_time_ids = callback_outputs.pop("negative_add_time_ids", negative_add_time_ids)

                # call the callback, if provided
                if i == len(timesteps) - 1 or ((i + 1) > 0 and (i + 1) % self.scheduler.order == 0):
                    progress_bar.update()

        if not output_type == "latent":
            # make sure the VAE is in float32 mode, as it overflows in float16
            needs_upcasting = self.vae.dtype == ms.float16 and self.vae.config.force_upcast

            if needs_upcasting:
                self.upcast_vae()
                latents = latents.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)

            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]

            # cast back to fp16 if needed
            if needs_upcasting:
                self.vae.to(dtype=ms.float16)
        else:
            image = latents

        if not output_type == "latent":
            # apply watermark if available
            if self.watermark is not None:
                image = self.watermark.apply_watermark(image)

            image = self.image_processor.postprocess(image, output_type=output_type)

        if not return_dict:
            return (image,)

        return LEditsPPDiffusionPipelineOutput(images=image, nsfw_content_detected=None)

    # Modified from diffusers.pipelines.ledits_pp.pipeline_leditspp_stable_diffusion.LEditsPPPipelineStableDiffusion.encode_image
    def encode_image(self, image, dtype=None, height=None, width=None, resize_mode="default", crops_coords=None):
        image = self.image_processor.preprocess(
            image=image, height=height, width=width, resize_mode=resize_mode, crops_coords=crops_coords
        )
        height, width = image.shape[-2:]
        if height % 32 != 0 or width % 32 != 0:
            raise ValueError(
                "Image height and width must be a factor of 32. "
                "Consider down-sampling the input using the `height` and `width` parameters"
            )
        resized = self.image_processor.postprocess(image=image, output_type="pil")

        if max(image.shape[-2:]) > self.vae.config["sample_size"] * 1.5:
            logger.warning(
                "Your input images far exceed the default resolution of the underlying diffusion model. "
                "The output images may contain severe artifacts! "
                "Consider down-sampling the input using the `height` and `width` parameters"
            )
        image = image.to(dtype=dtype)
        needs_upcasting = self.vae.dtype == ms.float16 and self.vae.config.force_upcast

        if needs_upcasting:
            image = image.float()
            self.upcast_vae()

        x0 = self.vae.diag_gauss_dist.mode(self.vae.encode(image)[0])
        x0 = x0.to(dtype)
        # cast back to fp16 if needed
        if needs_upcasting:
            self.vae.to(dtype=ms.float16)

        x0 = self.vae.config.scaling_factor * x0
        return x0, resized

    def invert(
        self,
        image: PipelineImageInput,
        source_prompt: str = "",
        source_guidance_scale=3.5,
        negative_prompt: str = None,
        negative_prompt_2: str = None,
        num_inversion_steps: int = 50,
        skip: float = 0.15,
        generator: Optional[np.random.Generator] = None,
        crops_coords_top_left: Tuple[int, int] = (0, 0),
        num_zero_noise_steps: int = 3,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
        height: Optional[int] = None,
        width: Optional[int] = None,
        resize_mode: Optional[str] = "default",
        crops_coords: Optional[Tuple[int, int, int, int]] = None,
    ):
        r"""
        The function to the pipeline for image inversion as described by the [LEDITS++
        Paper](https://arxiv.org/abs/2301.12247). If the scheduler is set to [`~schedulers.DDIMScheduler`] the
        inversion proposed by [edit-friendly DPDM](https://arxiv.org/abs/2304.06140) will be performed instead.

        Args:
            image (`PipelineImageInput`):
                Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect
                ratio.
            source_prompt (`str`, defaults to `""`):
                Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled
                if the `source_prompt` is `""`.
            source_guidance_scale (`float`, defaults to `3.5`):
                Strength of guidance during inversion.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
                less than `1`).
            negative_prompt_2 (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
            num_inversion_steps (`int`, defaults to `50`):
                Number of total performed inversion steps after discarding the initial `skip` steps.
            skip (`float`, defaults to `0.15`):
                Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values
                will lead to stronger changes to the input image. `skip` has to be between `0` and `1`.
            generator (`np.random.Generator`, *optional*):
                A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make inversion
                deterministic.
            crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
                `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
                `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
                `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
            num_zero_noise_steps (`int`, defaults to `3`):
                Number of final diffusion steps that will not renoise the current image. If no steps are set to zero
                SD-XL in combination with [`DPMSolverMultistepScheduler`] will produce noise artifacts.
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
                `self.processor` in
                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).

        Returns:
            [`~pipelines.ledits_pp.LEditsPPInversionPipelineOutput`]: Output will contain the resized input image(s)
            and respective VAE reconstruction(s).
        """
        if height is not None and height % 32 != 0 or width is not None and width % 32 != 0:
            raise ValueError("height and width must be a factor of 32.")

        # Reset attn processor, we do not want to store attn maps during inversion
        self.unet.set_attn_processor(AttnProcessor())

        self.eta = 1.0

        self.scheduler.config.timestep_spacing = "leading"
        self.scheduler.set_timesteps(int(num_inversion_steps * (1 + skip)))
        self.inversion_steps = self.scheduler.timesteps[-num_inversion_steps:]
        timesteps = self.inversion_steps

        num_images_per_prompt = 1

        # 0. Ensure that only uncond embedding is used if prompt = ""
        if source_prompt == "":
            # noise pred should only be noise_pred_uncond
            source_guidance_scale = 0.0
            do_classifier_free_guidance = False
        else:
            do_classifier_free_guidance = source_guidance_scale > 1.0

        # 1. prepare image
        x0, resized = self.encode_image(
            image,
            dtype=self.text_encoder_2.dtype,
            height=height,
            width=width,
            resize_mode=resize_mode,
            crops_coords=crops_coords,
        )
        width = x0.shape[2] * self.vae_scale_factor
        height = x0.shape[3] * self.vae_scale_factor
        self.size = (height, width)

        self.batch_size = x0.shape[0]

        # 2. get embeddings
        text_encoder_lora_scale = (
            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
        )

        if isinstance(source_prompt, str):
            source_prompt = [source_prompt] * self.batch_size

        (
            negative_prompt_embeds,
            prompt_embeds,
            negative_pooled_prompt_embeds,
            edit_pooled_prompt_embeds,
            _,
        ) = self.encode_prompt(
            num_images_per_prompt=num_images_per_prompt,
            negative_prompt=negative_prompt,
            negative_prompt_2=negative_prompt_2,
            editing_prompt=source_prompt,
            lora_scale=text_encoder_lora_scale,
            enable_edit_guidance=do_classifier_free_guidance,
        )
        if self.text_encoder_2 is None:
            text_encoder_projection_dim = int(negative_pooled_prompt_embeds.shape[-1])
        else:
            text_encoder_projection_dim = self.text_encoder_2.config.projection_dim

        # 3. Prepare added time ids & embeddings
        add_text_embeds = negative_pooled_prompt_embeds
        add_time_ids = self._get_add_time_ids(
            self.size,
            crops_coords_top_left,
            self.size,
            dtype=negative_prompt_embeds.dtype,
            text_encoder_projection_dim=text_encoder_projection_dim,
        )

        if do_classifier_free_guidance:
            negative_prompt_embeds = mint.cat([negative_prompt_embeds, prompt_embeds], dim=0)
            add_text_embeds = mint.cat([add_text_embeds, edit_pooled_prompt_embeds], dim=0)
            add_time_ids = mint.cat([add_time_ids, add_time_ids], dim=0)

        negative_prompt_embeds = negative_prompt_embeds

        add_text_embeds = add_text_embeds
        add_time_ids = add_time_ids.tile((self.batch_size * num_images_per_prompt, 1))

        # autoencoder reconstruction
        if self.vae.dtype == ms.float16 and self.vae.config.force_upcast:
            self.upcast_vae()
            x0_tmp = x0.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)
            image_rec = self.vae.decode(
                x0_tmp / self.vae.config.scaling_factor, return_dict=False, generator=generator
            )[0]
            # cast back to fp16 if needed
            # TODO: upcast_vae in MS
            self.vae.to(dtype=ms.float16)
        elif self.vae.config.force_upcast:
            x0_tmp = x0.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)
            image_rec = self.vae.decode(
                x0_tmp / self.vae.config.scaling_factor, return_dict=False, generator=generator
            )[0]
        else:
            image_rec = self.vae.decode(x0 / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

        image_rec = self.image_processor.postprocess(image_rec, output_type="pil")

        # 5. find zs and xts
        variance_noise_shape = (num_inversion_steps, *x0.shape)

        # intermediate latents
        t_to_idx = {int(v): k for k, v in enumerate(timesteps)}
        xts = mint.zeros(size=variance_noise_shape, dtype=negative_prompt_embeds.dtype)

        for t in reversed(timesteps):
            idx = num_inversion_steps - t_to_idx[int(t)] - 1
            noise = randn_tensor(shape=x0.shape, generator=generator, dtype=x0.dtype)
            xts[idx] = self.scheduler.add_noise(x0, noise, t.unsqueeze(0))
        xts = mint.cat([x0.unsqueeze(0), xts], dim=0)

        # noise maps
        zs = mint.zeros(size=variance_noise_shape, dtype=negative_prompt_embeds.dtype)

        self.scheduler.set_timesteps(len(self.scheduler.timesteps))

        for t in self.progress_bar(timesteps):
            idx = num_inversion_steps - t_to_idx[int(t)] - 1
            # 1. predict noise residual
            xt = xts[idx + 1]

            latent_model_input = mint.cat([xt] * 2) if do_classifier_free_guidance else xt
            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

            added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}

            noise_pred = self.unet(
                latent_model_input,
                t,
                encoder_hidden_states=negative_prompt_embeds,
                cross_attention_kwargs=cross_attention_kwargs,
                added_cond_kwargs=added_cond_kwargs,
                return_dict=False,
            )[0]

            # 2. perform guidance
            if do_classifier_free_guidance:
                noise_pred_out = noise_pred.chunk(2)
                noise_pred_uncond, noise_pred_text = noise_pred_out[0], noise_pred_out[1]
                noise_pred = noise_pred_uncond + source_guidance_scale * (noise_pred_text - noise_pred_uncond)

            xtm1 = xts[idx]
            z, xtm1_corrected = compute_noise(self.scheduler, xtm1, xt, t, noise_pred, self.eta)
            zs[idx] = z

            # correction to avoid error accumulation
            xts[idx] = xtm1_corrected

        self.init_latents = xts[-1]
        zs = zs.flip((0,))

        if num_zero_noise_steps > 0:
            zs[-num_zero_noise_steps:] = mint.zeros_like(zs[-num_zero_noise_steps:])
        self.zs = zs
        return LEditsPPInversionPipelineOutput(images=resized, vae_reconstruction_images=image_rec)

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.__call__(denoising_end=None, negative_prompt=None, negative_prompt_2=None, negative_prompt_embeds=None, negative_pooled_prompt_embeds=None, ip_adapter_image=None, output_type='pil', return_dict=False, cross_attention_kwargs=None, guidance_rescale=0.0, crops_coords_top_left=(0, 0), target_size=None, editing_prompt=None, editing_prompt_embeddings=None, editing_pooled_prompt_embeds=None, reverse_editing_direction=False, edit_guidance_scale=5, edit_warmup_steps=0, edit_cooldown_steps=None, edit_threshold=0.9, sem_guidance=None, use_cross_attn_mask=False, use_intersect_mask=False, user_mask=None, attn_store_steps=[], store_averaged_over_steps=True, clip_skip=None, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], **kwargs)

The call function to the pipeline for editing. The [~pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.invert] method has to be called beforehand. Edits will always be performed for the last inverted image(s).

PARAMETER DESCRIPTION
denoising_end

When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image

TYPE: `float`, *optional* DEFAULT: None

negative_prompt

The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

negative_prompt_2

The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

negative_pooled_prompt_embeds

Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

ip_adapter_image

(PipelineImageInput, optional): Optional image input to work with IP Adapters.

TYPE: Optional[PipelineImageInput] DEFAULT: None

output_type

The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

callback

A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: ms.Tensor).

TYPE: `Callable`, *optional*

callback_steps

The frequency at which the callback function will be called. If not specified, the callback will be called at every step.

TYPE: `int`, *optional*, defaults to 1

cross_attention_kwargs

A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.

TYPE: `dict`, *optional* DEFAULT: None

guidance_rescale

Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.

TYPE: `float`, *optional*, defaults to 0.7 DEFAULT: 0.0

crops_coords_top_left

crops_coords_top_left can be used to generate an image that appears to be "cropped" from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.

TYPE: `Tuple[int]`, *optional*, defaults to (0, 0 DEFAULT: (0, 0)

target_size

For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (width, height). Part of SDXL's micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.

TYPE: `Tuple[int]`, *optional*, defaults to (1024, 1024 DEFAULT: None

editing_prompt

The prompt or prompts to guide the image generation. The image is reconstructed by setting editing_prompt = None. Guidance direction of prompt should be specified via reverse_editing_direction.

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

editing_prompt_embeddings

Pre-generated edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, editing_prompt_embeddings will be generated from editing_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

editing_pooled_prompt_embeddings

Pre-generated pooled edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, editing_prompt_embeddings will be generated from editing_prompt input argument.

TYPE: `ms.Tensor`, *optional*

reverse_editing_direction

Whether the corresponding prompt in editing_prompt should be increased or decreased.

TYPE: `bool` or `List[bool]`, *optional*, defaults to `False` DEFAULT: False

edit_guidance_scale

Guidance scale for guiding the image generation. If provided as list values should correspond to editing_prompt. edit_guidance_scale is defined as s_e of equation 12 of LEDITS++ Paper.

TYPE: `float` or `List[float]`, *optional*, defaults to 5 DEFAULT: 5

edit_warmup_steps

Number of diffusion steps (for each prompt) for which guidance is not applied.

TYPE: `float` or `List[float]`, *optional*, defaults to 10 DEFAULT: 0

edit_cooldown_steps

Number of diffusion steps (for each prompt) after which guidance is no longer applied.

TYPE: `float` or `List[float]`, *optional*, defaults to `None` DEFAULT: None

edit_threshold

Masking threshold of guidance. Threshold should be proportional to the image region that is modified. 'edit_threshold' is defined as 'λ' of equation 12 of LEDITS++ Paper.

TYPE: `float` or `List[float]`, *optional*, defaults to 0.9 DEFAULT: 0.9

sem_guidance

List of pre-generated guidance vectors to be applied at generation. Length of the list has to correspond to num_inference_steps.

TYPE: `List[ms.Tensor]`, *optional* DEFAULT: None

use_cross_attn_mask

Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of LEDITS++ paper.

TYPE: bool DEFAULT: False

use_intersect_mask

Whether the masking term is calculated as intersection of cross-attention masks and masks derived from the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate are defined as 'M^2' of equation 12 of LEDITS++ paper.

TYPE: bool DEFAULT: False

user_mask

User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s implicit masks do not meet user preferences.

TYPE: Optional[Tensor] DEFAULT: None

attn_store_steps

Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.

TYPE: Optional[List[int]] DEFAULT: []

store_averaged_over_steps

Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If False, attention maps for each step are stores separately. Just for visualization purposes.

TYPE: bool DEFAULT: True

clip_skip

Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

TYPE: `int`, *optional* DEFAULT: None

callback_on_step_end

A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.

TYPE: `Callable`, *optional* DEFAULT: None

callback_on_step_end_tensor_inputs

The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.

TYPE: `List`, *optional* DEFAULT: ['latents']

RETURNS DESCRIPTION

[~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput] or tuple:

[~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput] if return_dict is True, otherwise a `tuple. When

returning a tuple, the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
def __call__(
    self,
    denoising_end: Optional[float] = None,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    negative_prompt_2: Optional[Union[str, List[str]]] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    negative_pooled_prompt_embeds: Optional[ms.Tensor] = None,
    ip_adapter_image: Optional[PipelineImageInput] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    guidance_rescale: float = 0.0,
    crops_coords_top_left: Tuple[int, int] = (0, 0),
    target_size: Optional[Tuple[int, int]] = None,
    editing_prompt: Optional[Union[str, List[str]]] = None,
    editing_prompt_embeddings: Optional[ms.Tensor] = None,
    editing_pooled_prompt_embeds: Optional[ms.Tensor] = None,
    reverse_editing_direction: Optional[Union[bool, List[bool]]] = False,
    edit_guidance_scale: Optional[Union[float, List[float]]] = 5,
    edit_warmup_steps: Optional[Union[int, List[int]]] = 0,
    edit_cooldown_steps: Optional[Union[int, List[int]]] = None,
    edit_threshold: Optional[Union[float, List[float]]] = 0.9,
    sem_guidance: Optional[List[ms.Tensor]] = None,
    use_cross_attn_mask: bool = False,
    use_intersect_mask: bool = False,
    user_mask: Optional[ms.Tensor] = None,
    attn_store_steps: Optional[List[int]] = [],
    store_averaged_over_steps: bool = True,
    clip_skip: Optional[int] = None,
    callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    **kwargs,
):
    r"""
    The call function to the pipeline for editing. The
    [`~pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.invert`] method has to be called beforehand. Edits
    will always be performed for the last inverted image(s).

    Args:
        denoising_end (`float`, *optional*):
            When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be
            completed before it is intentionally prematurely terminated. As a result, the returned sample will
            still retain a substantial amount of noise as determined by the discrete timesteps selected by the
            scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a
            "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. If not defined, one has to pass
            `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
            less than `1`).
        negative_prompt_2 (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
            `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        negative_pooled_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
            input argument.
        ip_adapter_image: (`PipelineImageInput`, *optional*):
            Optional image input to work with IP Adapters.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generate image. Choose between
            [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
            of a plain tuple.
        callback (`Callable`, *optional*):
            A function that will be called every `callback_steps` steps during inference. The function will be
            called with the following arguments: `callback(step: int, timestep: int, latents: ms.Tensor)`.
        callback_steps (`int`, *optional*, defaults to 1):
            The frequency at which the `callback` function will be called. If not specified, the callback will be
            called at every step.
        cross_attention_kwargs (`dict`, *optional*):
            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
            `self.processor` in
            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
        guidance_rescale (`float`, *optional*, defaults to 0.7):
            Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
            Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
            [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
            Guidance rescale factor should fix overexposure when using zero terminal SNR.
        crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
            `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
            `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
            `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
            [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
        target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
            For most cases, `target_size` should be set to the desired height and width of the generated image. If
            not specified it will default to `(width, height)`. Part of SDXL's micro-conditioning as explained in
            section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
        editing_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts to guide the image generation. The image is reconstructed by setting
            `editing_prompt = None`. Guidance direction of prompt should be specified via
            `reverse_editing_direction`.
        editing_prompt_embeddings (`ms.Tensor`, *optional*):
            Pre-generated edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
            If not provided, editing_prompt_embeddings will be generated from `editing_prompt` input argument.
        editing_pooled_prompt_embeddings (`ms.Tensor`, *optional*):
            Pre-generated pooled edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, editing_prompt_embeddings will be generated from `editing_prompt` input
            argument.
        reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`):
            Whether the corresponding prompt in `editing_prompt` should be increased or decreased.
        edit_guidance_scale (`float` or `List[float]`, *optional*, defaults to 5):
            Guidance scale for guiding the image generation. If provided as list values should correspond to
            `editing_prompt`. `edit_guidance_scale` is defined as `s_e` of equation 12 of [LEDITS++
            Paper](https://arxiv.org/abs/2301.12247).
        edit_warmup_steps (`float` or `List[float]`, *optional*, defaults to 10):
            Number of diffusion steps (for each prompt) for which guidance is not applied.
        edit_cooldown_steps (`float` or `List[float]`, *optional*, defaults to `None`):
            Number of diffusion steps (for each prompt) after which guidance is no longer applied.
        edit_threshold (`float` or `List[float]`, *optional*, defaults to 0.9):
            Masking threshold of guidance. Threshold should be proportional to the image region that is modified.
            'edit_threshold' is defined as 'λ' of equation 12 of [LEDITS++
            Paper](https://arxiv.org/abs/2301.12247).
        sem_guidance (`List[ms.Tensor]`, *optional*):
            List of pre-generated guidance vectors to be applied at generation. Length of the list has to
            correspond to `num_inference_steps`.
        use_cross_attn_mask:
            Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask
            is set to true. Cross-attention masks are defined as 'M^1' of equation 12 of [LEDITS++
            paper](https://arxiv.org/pdf/2311.16711.pdf).
        use_intersect_mask:
            Whether the masking term is calculated as intersection of cross-attention masks and masks derived from
            the noise estimate. Cross-attention mask are defined as 'M^1' and masks derived from the noise estimate
            are defined as 'M^2' of equation 12 of [LEDITS++ paper](https://arxiv.org/pdf/2311.16711.pdf).
        user_mask:
            User-provided mask for even better control over the editing process. This is helpful when LEDITS++'s
            implicit masks do not meet user preferences.
        attn_store_steps:
            Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.
        store_averaged_over_steps:
            Whether the attention maps for the 'attn_store_steps' are stored averaged over the diffusion steps. If
            False, attention maps for each step are stores separately. Just for visualization purposes.
        clip_skip (`int`, *optional*):
            Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
            the output of the pre-final layer will be used for computing the prompt embeddings.
        callback_on_step_end (`Callable`, *optional*):
            A function that calls at the end of each denoising steps during the inference. The function is called
            with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
            callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
            `callback_on_step_end_tensor_inputs`.
        callback_on_step_end_tensor_inputs (`List`, *optional*):
            The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
            will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
            `._callback_tensor_inputs` attribute of your pipeline class.

    Examples:

    Returns:
        [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] or `tuple`:
        [`~pipelines.ledits_pp.LEditsPPDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. When
        returning a tuple, the first element is a list with the generated images.
    """
    if self.inversion_steps is None:
        raise ValueError(
            "You need to invert an input image first before calling the pipeline. The `invert` method has to be called beforehand. Edits will always be performed for the last inverted image(s)."  # noqa: E501
        )

    eta = self.eta
    num_images_per_prompt = 1
    latents = self.init_latents

    zs = self.zs
    self.scheduler.set_timesteps(len(self.scheduler.timesteps))

    if use_intersect_mask:
        use_cross_attn_mask = True

    if use_cross_attn_mask:
        self.smoothing = LeditsGaussianSmoothing()

    # TODO: Check inputs
    # 1. Check inputs. Raise error if not correct
    # self.check_inputs(
    #    callback_steps,
    #    negative_prompt,
    #    negative_prompt_2,
    #    prompt_embeds,
    #    negative_prompt_embeds,
    #    pooled_prompt_embeds,
    #    negative_pooled_prompt_embeds,
    # )
    self._guidance_rescale = guidance_rescale
    self._clip_skip = clip_skip
    self._cross_attention_kwargs = cross_attention_kwargs
    self._denoising_end = denoising_end

    # 2. Define call parameters
    batch_size = self.batch_size

    if editing_prompt:
        enable_edit_guidance = True
        if isinstance(editing_prompt, str):
            editing_prompt = [editing_prompt]
        self.enabled_editing_prompts = len(editing_prompt)
    elif editing_prompt_embeddings is not None:
        enable_edit_guidance = True
        self.enabled_editing_prompts = editing_prompt_embeddings.shape[0]
    else:
        self.enabled_editing_prompts = 0
        enable_edit_guidance = False

    # 3. Encode input prompt
    text_encoder_lora_scale = (
        cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
    )
    (
        prompt_embeds,
        edit_prompt_embeds,
        negative_pooled_prompt_embeds,
        pooled_edit_embeds,
        num_edit_tokens,
    ) = self.encode_prompt(
        num_images_per_prompt=num_images_per_prompt,
        negative_prompt=negative_prompt,
        negative_prompt_2=negative_prompt_2,
        negative_prompt_embeds=negative_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
        lora_scale=text_encoder_lora_scale,
        clip_skip=self.clip_skip,
        enable_edit_guidance=enable_edit_guidance,
        editing_prompt=editing_prompt,
        editing_prompt_embeds=editing_prompt_embeddings,
        editing_pooled_prompt_embeds=editing_pooled_prompt_embeds,
    )

    # 4. Prepare timesteps
    # self.scheduler.set_timesteps(num_inference_steps)

    timesteps = self.inversion_steps
    t_to_idx = {int(v): k for k, v in enumerate(timesteps)}

    if use_cross_attn_mask:
        self.attention_store = LeditsAttentionStore(
            average=store_averaged_over_steps,
            batch_size=batch_size,
            max_size=(latents.shape[-2] / 4.0) * (latents.shape[-1] / 4.0),
            max_resolution=None,
        )
        self.prepare_unet(self.attention_store)
        resolution = latents.shape[-2:]
        att_res = (int(resolution[0] / 4), int(resolution[1] / 4))

    # 5. Prepare latent variables
    latents = self.prepare_latents(latents=latents)

    # 6. Prepare extra step kwargs.
    extra_step_kwargs = self.prepare_extra_step_kwargs(eta)

    if self.text_encoder_2 is None:
        text_encoder_projection_dim = int(negative_pooled_prompt_embeds.shape[-1])
    else:
        text_encoder_projection_dim = self.text_encoder_2.config.projection_dim

    # 7. Prepare added time ids & embeddings
    add_text_embeds = negative_pooled_prompt_embeds
    add_time_ids = self._get_add_time_ids(
        self.size,
        crops_coords_top_left,
        self.size,
        dtype=negative_pooled_prompt_embeds.dtype,
        text_encoder_projection_dim=text_encoder_projection_dim,
    )

    if enable_edit_guidance:
        prompt_embeds = mint.cat([prompt_embeds, edit_prompt_embeds], dim=0)
        add_text_embeds = mint.cat([add_text_embeds, pooled_edit_embeds], dim=0)
        edit_concepts_time_ids = add_time_ids.tile((edit_prompt_embeds.shape[0], 1))
        add_time_ids = mint.cat([add_time_ids, edit_concepts_time_ids], dim=0)
        self.text_cross_attention_maps = [editing_prompt] if isinstance(editing_prompt, str) else editing_prompt

    prompt_embeds = prompt_embeds
    add_text_embeds = add_text_embeds
    add_time_ids = add_time_ids.tile((batch_size * num_images_per_prompt, 1))

    if ip_adapter_image is not None:
        # TODO: fix image encoding
        image_embeds, negative_image_embeds = self.encode_image(ip_adapter_image, num_images_per_prompt)
        if self.do_classifier_free_guidance:
            image_embeds = mint.cat([negative_image_embeds, image_embeds])
            image_embeds = image_embeds

    # 8. Denoising loop
    self.sem_guidance = None
    self.activation_mask = None

    if (
        self.denoising_end is not None
        and isinstance(self.denoising_end, float)
        and self.denoising_end > 0
        and self.denoising_end < 1
    ):
        discrete_timestep_cutoff = int(
            round(
                self.scheduler.config.num_train_timesteps
                - (self.denoising_end * self.scheduler.config.num_train_timesteps)
            )
        )
        num_inference_steps = len(list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps)))
        timesteps = timesteps[:num_inference_steps]

    # 9. Optionally get Guidance Scale Embedding
    timestep_cond = None
    if self.unet.config.time_cond_proj_dim is not None:
        guidance_scale_tensor = ms.tensor(self.guidance_scale - 1).tile((batch_size * num_images_per_prompt))
        timestep_cond = self.get_guidance_scale_embedding(
            guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim
        ).to(dtype=latents.dtype)

    self._num_timesteps = len(timesteps)
    with self.progress_bar(total=self._num_timesteps) as progress_bar:
        for i, t in enumerate(timesteps):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = mint.cat([latents] * (1 + self.enabled_editing_prompts))
            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
            # predict the noise residual
            added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}
            if ip_adapter_image is not None:
                added_cond_kwargs["image_embeds"] = image_embeds
            noise_pred = self.unet(
                latent_model_input,
                t,
                encoder_hidden_states=prompt_embeds,
                cross_attention_kwargs=cross_attention_kwargs,
                added_cond_kwargs=added_cond_kwargs,
                return_dict=False,
            )[0]

            noise_pred_out = noise_pred.chunk(1 + self.enabled_editing_prompts)  # [b,4, 64, 64]
            noise_pred_uncond = noise_pred_out[0]
            noise_pred_edit_concepts = noise_pred_out[1:]

            noise_guidance_edit = mint.zeros(
                noise_pred_uncond.shape,
                dtype=noise_pred_uncond.dtype,
            )

            if sem_guidance is not None and len(sem_guidance) > i:
                noise_guidance_edit += sem_guidance[i]

            elif enable_edit_guidance:
                if self.activation_mask is None:
                    self.activation_mask = mint.zeros(
                        (len(timesteps), self.enabled_editing_prompts, *noise_pred_edit_concepts[0].shape)
                    )
                if self.sem_guidance is None:
                    self.sem_guidance = mint.zeros((len(timesteps), *noise_pred_uncond.shape))

                # noise_guidance_edit = mint.zeros_like(noise_guidance)
                for c, noise_pred_edit_concept in enumerate(noise_pred_edit_concepts):
                    if isinstance(edit_warmup_steps, list):
                        edit_warmup_steps_c = edit_warmup_steps[c]
                    else:
                        edit_warmup_steps_c = edit_warmup_steps
                    if i < edit_warmup_steps_c:
                        continue

                    if isinstance(edit_guidance_scale, list):
                        edit_guidance_scale_c = edit_guidance_scale[c]
                    else:
                        edit_guidance_scale_c = edit_guidance_scale

                    if isinstance(edit_threshold, list):
                        edit_threshold_c = edit_threshold[c]
                    else:
                        edit_threshold_c = edit_threshold
                    if isinstance(reverse_editing_direction, list):
                        reverse_editing_direction_c = reverse_editing_direction[c]
                    else:
                        reverse_editing_direction_c = reverse_editing_direction

                    if isinstance(edit_cooldown_steps, list):
                        edit_cooldown_steps_c = edit_cooldown_steps[c]
                    elif edit_cooldown_steps is None:
                        edit_cooldown_steps_c = i + 1
                    else:
                        edit_cooldown_steps_c = edit_cooldown_steps

                    if i >= edit_cooldown_steps_c:
                        continue

                    noise_guidance_edit_tmp = noise_pred_edit_concept - noise_pred_uncond

                    if reverse_editing_direction_c:
                        noise_guidance_edit_tmp = noise_guidance_edit_tmp * -1

                    noise_guidance_edit_tmp = noise_guidance_edit_tmp * edit_guidance_scale_c

                    if user_mask is not None:
                        noise_guidance_edit_tmp = noise_guidance_edit_tmp * user_mask

                    if use_cross_attn_mask:
                        out = self.attention_store.aggregate_attention(
                            attention_maps=self.attention_store.step_store,
                            prompts=self.text_cross_attention_maps,
                            res=att_res,
                            from_where=["up", "down"],
                            is_cross=True,
                            select=self.text_cross_attention_maps.index(editing_prompt[c]),
                        )
                        attn_map = out[:, :, :, 1 : 1 + num_edit_tokens[c]]  # 0 -> startoftext

                        # average over all tokens
                        if attn_map.shape[3] != num_edit_tokens[c]:
                            raise ValueError(
                                f"Incorrect shape of attention_map. Expected size {num_edit_tokens[c]}, but found {attn_map.shape[3]}!"
                            )
                        attn_map = mint.sum(attn_map, dim=3)

                        # gaussian_smoothing
                        attn_map = mint.nn.functional.pad(attn_map.unsqueeze(1), (1, 1, 1, 1), mode="reflect")
                        attn_map = self.smoothing(attn_map).squeeze(1)

                        # ops.quantile function expects float32
                        # TODO: ops.quantile is not supported
                        if attn_map.dtype == ms.float32:
                            tmp = ms.tensor(
                                np.quantile(attn_map.flatten(start_dim=1).numpy(), edit_threshold_c, axis=1)
                            )
                        else:
                            tmp = ms.tensor(
                                np.quantile(
                                    attn_map.flatten(start_dim=1).to(ms.float32).numpy(), edit_threshold_c, axis=1
                                )
                            ).to(attn_map.dtype)
                        attn_mask = mint.where(
                            attn_map >= tmp.unsqueeze(1).unsqueeze(1).tile((1, *att_res)), 1.0, 0.0
                        )

                        # resolution must match latent space dimension
                        attn_mask = mint.nn.functional.interpolate(
                            attn_mask.unsqueeze(1),
                            noise_guidance_edit_tmp.shape[-2:],  # 64,64
                        ).tile((1, 4, 1, 1))
                        self.activation_mask[i, c] = ops.stop_gradient(attn_mask)
                        if not use_intersect_mask:
                            noise_guidance_edit_tmp = noise_guidance_edit_tmp * attn_mask

                    if use_intersect_mask:
                        noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                        noise_guidance_edit_tmp_quantile = mint.sum(
                            noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                        )
                        noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile(
                            (1, self.unet.config.in_channels, 1, 1)
                        )

                        # ops.quantile function expects float32
                        # TODO: ops.quantile is not supported
                        if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                            tmp = ms.tensor(
                                np.quantile(
                                    noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                    edit_threshold_c,
                                    axis=2,
                                    keepdims=False,
                                )
                            )
                        else:
                            tmp = ms.tensor(
                                np.quantile(
                                    noise_guidance_edit_tmp_quantile.flatten(start_dim=2).to(ms.float32).numpy(),
                                    edit_threshold_c,
                                    axis=2,
                                    keepdims=False,
                                )
                            ).to(noise_guidance_edit_tmp_quantile.dtype)

                        intersect_mask = (
                            mint.where(
                                noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                mint.ones_like(noise_guidance_edit_tmp),
                                mint.zeros_like(noise_guidance_edit_tmp),
                            )
                            * attn_mask
                        )

                        self.activation_mask[i, c] = ops.stop_gradient(intersect_mask)

                        noise_guidance_edit_tmp = noise_guidance_edit_tmp * intersect_mask

                    elif not use_cross_attn_mask:
                        # calculate quantile
                        noise_guidance_edit_tmp_quantile = mint.abs(noise_guidance_edit_tmp)
                        noise_guidance_edit_tmp_quantile = mint.sum(
                            noise_guidance_edit_tmp_quantile, dim=1, keepdim=True
                        )
                        noise_guidance_edit_tmp_quantile = noise_guidance_edit_tmp_quantile.tile((1, 4, 1, 1))

                        # ops.quantile function expects float32
                        # TODO: ops.quantile is not supported
                        if noise_guidance_edit_tmp_quantile.dtype == ms.float32:
                            tmp = ms.tensor(
                                np.quantile(
                                    noise_guidance_edit_tmp_quantile.flatten(start_dim=2).numpy(),
                                    edit_threshold_c,
                                    axis=2,
                                    keepdims=False,
                                )
                            )
                        else:
                            tmp = ms.tensor(
                                np.quantile(
                                    noise_guidance_edit_tmp_quantile.flatten(start_dim=2).to(ms.float32).numpy(),
                                    edit_threshold_c,
                                    axis=2,
                                    keepdims=False,
                                )
                            ).to(noise_guidance_edit_tmp_quantile.dtype)

                        self.activation_mask[i, c] = ops.stop_gradient(
                            mint.where(
                                noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                                mint.ones_like(noise_guidance_edit_tmp),
                                mint.zeros_like(noise_guidance_edit_tmp),
                            )
                        )

                        noise_guidance_edit_tmp = mint.where(
                            noise_guidance_edit_tmp_quantile >= tmp[:, :, None, None],
                            noise_guidance_edit_tmp,
                            mint.zeros_like(noise_guidance_edit_tmp),
                        )

                    noise_guidance_edit += noise_guidance_edit_tmp

                self.sem_guidance[i] = ops.stop_gradient(noise_guidance_edit)

            noise_pred = noise_pred_uncond + noise_guidance_edit

            # compute the previous noisy sample x_t -> x_t-1
            if enable_edit_guidance and self.guidance_rescale > 0.0:
                # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                noise_pred = rescale_noise_cfg(
                    noise_pred,
                    noise_pred_edit_concepts.mean(dim=0, keepdim=False),
                    guidance_rescale=self.guidance_rescale,
                )

            idx = t_to_idx[int(t)]
            latents = self.scheduler.step(
                noise_pred, t, latents, variance_noise=zs[idx], **extra_step_kwargs, return_dict=False
            )[0]

            # step callback
            if use_cross_attn_mask:
                store_step = i in attn_store_steps
                self.attention_store.between_steps(store_step)

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
                add_text_embeds = callback_outputs.pop("add_text_embeds", add_text_embeds)
                negative_pooled_prompt_embeds = callback_outputs.pop(
                    "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
                )
                add_time_ids = callback_outputs.pop("add_time_ids", add_time_ids)
                # negative_add_time_ids = callback_outputs.pop("negative_add_time_ids", negative_add_time_ids)

            # call the callback, if provided
            if i == len(timesteps) - 1 or ((i + 1) > 0 and (i + 1) % self.scheduler.order == 0):
                progress_bar.update()

    if not output_type == "latent":
        # make sure the VAE is in float32 mode, as it overflows in float16
        needs_upcasting = self.vae.dtype == ms.float16 and self.vae.config.force_upcast

        if needs_upcasting:
            self.upcast_vae()
            latents = latents.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)

        image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]

        # cast back to fp16 if needed
        if needs_upcasting:
            self.vae.to(dtype=ms.float16)
    else:
        image = latents

    if not output_type == "latent":
        # apply watermark if available
        if self.watermark is not None:
            image = self.watermark.apply_watermark(image)

        image = self.image_processor.postprocess(image, output_type=output_type)

    if not return_dict:
        return (image,)

    return LEditsPPDiffusionPipelineOutput(images=image, nsfw_content_detected=None)

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.disable_vae_slicing()

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
727
728
729
730
731
732
def disable_vae_slicing(self):
    r"""
    Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
    computing decoding in one step.
    """
    self.vae.disable_slicing()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.disable_vae_tiling()

Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
742
743
744
745
746
747
def disable_vae_tiling(self):
    r"""
    Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
    computing decoding in one step.
    """
    self.vae.disable_tiling()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.enable_vae_slicing()

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
720
721
722
723
724
725
def enable_vae_slicing(self):
    r"""
    Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
    compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
    """
    self.vae.enable_slicing()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.enable_vae_tiling()

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
734
735
736
737
738
739
740
def enable_vae_tiling(self):
    r"""
    Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
    compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
    processing larger images.
    """
    self.vae.enable_tiling()

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.encode_prompt(num_images_per_prompt=1, negative_prompt=None, negative_prompt_2=None, negative_prompt_embeds=None, negative_pooled_prompt_embeds=None, lora_scale=None, clip_skip=None, enable_edit_guidance=True, editing_prompt=None, editing_prompt_embeds=None, editing_pooled_prompt_embeds=None)

Encodes the prompt into text encoder hidden states.

PARAMETER DESCRIPTION
num_images_per_prompt

number of images that should be generated per prompt

TYPE: `int` DEFAULT: 1

negative_prompt

The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead.

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

negative_prompt_2

The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

negative_prompt_embeds

Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

negative_pooled_prompt_embeds

Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

lora_scale

A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.

TYPE: `float`, *optional* DEFAULT: None

clip_skip

Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.

TYPE: `int`, *optional* DEFAULT: None

enable_edit_guidance

Whether to guide towards an editing prompt or not.

TYPE: `bool` DEFAULT: True

editing_prompt

Editing prompt(s) to be encoded. If not defined and 'enable_edit_guidance' is True, one has to pass editing_prompt_embeds instead.

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

editing_prompt_embeds

Pre-generated edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided and 'enable_edit_guidance' is True, editing_prompt_embeds will be generated from editing_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

editing_pooled_prompt_embeds

Pre-generated edit pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled editing_pooled_prompt_embeds will be generated from editing_prompt input argument.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
def encode_prompt(
    self,
    num_images_per_prompt: int = 1,
    negative_prompt: Optional[str] = None,
    negative_prompt_2: Optional[str] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    negative_pooled_prompt_embeds: Optional[ms.Tensor] = None,
    lora_scale: Optional[float] = None,
    clip_skip: Optional[int] = None,
    enable_edit_guidance: bool = True,
    editing_prompt: Optional[str] = None,
    editing_prompt_embeds: Optional[ms.Tensor] = None,
    editing_pooled_prompt_embeds: Optional[ms.Tensor] = None,
) -> object:
    r"""
    Encodes the prompt into text encoder hidden states.

    Args:
        num_images_per_prompt (`int`):
            number of images that should be generated per prompt
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. If not defined, one has to pass
            `negative_prompt_embeds` instead.
        negative_prompt_2 (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
            `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        negative_pooled_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
            input argument.
        lora_scale (`float`, *optional*):
            A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
        clip_skip (`int`, *optional*):
            Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
            the output of the pre-final layer will be used for computing the prompt embeddings.
        enable_edit_guidance (`bool`):
            Whether to guide towards an editing prompt or not.
        editing_prompt (`str` or `List[str]`, *optional*):
            Editing prompt(s) to be encoded. If not defined and 'enable_edit_guidance' is True, one has to pass
            `editing_prompt_embeds` instead.
        editing_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated edit text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
            If not provided and 'enable_edit_guidance' is True, editing_prompt_embeds will be generated from
            `editing_prompt` input argument.
        editing_pooled_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated edit pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, pooled editing_pooled_prompt_embeds will be generated from `editing_prompt`
            input argument.
    """
    # set lora scale so that monkey patched LoRA
    # function of text encoder can correctly access it
    if lora_scale is not None and isinstance(self, StableDiffusionXLLoraLoaderMixin):
        self._lora_scale = lora_scale

        # dynamically adjust the LoRA scale
        if self.text_encoder is not None:
            scale_lora_layers(self.text_encoder, lora_scale)

        if self.text_encoder_2 is not None:
            scale_lora_layers(self.text_encoder_2, lora_scale)

    batch_size = self.batch_size

    # Define tokenizers and text encoders
    tokenizers = [self.tokenizer, self.tokenizer_2] if self.tokenizer is not None else [self.tokenizer_2]
    text_encoders = (
        [self.text_encoder, self.text_encoder_2] if self.text_encoder is not None else [self.text_encoder_2]
    )
    num_edit_tokens = 0

    # get unconditional embeddings for classifier free guidance
    zero_out_negative_prompt = negative_prompt is None and self.config.force_zeros_for_empty_prompt

    if negative_prompt_embeds is None:
        negative_prompt = negative_prompt or ""
        negative_prompt_2 = negative_prompt_2 or negative_prompt

        # normalize str to list
        negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
        negative_prompt_2 = (
            batch_size * [negative_prompt_2] if isinstance(negative_prompt_2, str) else negative_prompt_2
        )

        uncond_tokens: List[str]

        if batch_size != len(negative_prompt):
            raise ValueError(
                f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but image inversion "
                f" has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
                " the batch size of the input images."
            )
        else:
            uncond_tokens = [negative_prompt, negative_prompt_2]

        negative_prompt_embeds_list = []
        for negative_prompt, tokenizer, text_encoder in zip(uncond_tokens, tokenizers, text_encoders):
            if isinstance(self, TextualInversionLoaderMixin):
                negative_prompt = self.maybe_convert_prompt(negative_prompt, tokenizer)

            uncond_input = tokenizer(
                negative_prompt,
                padding="max_length",
                max_length=tokenizer.model_max_length,
                truncation=True,
                return_tensors="np",
            )

            negative_prompt_embeds = text_encoder(
                ms.tensor(uncond_input.input_ids),
                output_hidden_states=True,
            )
            # We are only ALWAYS interested in the pooled output of the final text encoder
            negative_pooled_prompt_embeds = negative_prompt_embeds[0]
            negative_prompt_embeds = negative_prompt_embeds[2][-2]

            negative_prompt_embeds_list.append(negative_prompt_embeds)

        negative_prompt_embeds = mint.concat(negative_prompt_embeds_list, dim=-1)

        if zero_out_negative_prompt:
            negative_prompt_embeds = mint.zeros_like(negative_prompt_embeds)
            negative_pooled_prompt_embeds = mint.zeros_like(negative_pooled_prompt_embeds)

    if enable_edit_guidance and editing_prompt_embeds is None:
        editing_prompt_2 = editing_prompt

        editing_prompts = [editing_prompt, editing_prompt_2]
        edit_prompt_embeds_list = []

        for editing_prompt, tokenizer, text_encoder in zip(editing_prompts, tokenizers, text_encoders):
            if isinstance(self, TextualInversionLoaderMixin):
                editing_prompt = self.maybe_convert_prompt(editing_prompt, tokenizer)

            max_length = negative_prompt_embeds.shape[1]
            edit_concepts_input = tokenizer(
                # [x for item in editing_prompt for x in repeat(item, batch_size)],
                editing_prompt,
                padding="max_length",
                max_length=max_length,
                truncation=True,
                return_tensors="np",
                return_length=True,
            )
            num_edit_tokens = edit_concepts_input.length - 2

            edit_concepts_embeds = text_encoder(
                ms.tensor(edit_concepts_input.input_ids),
                output_hidden_states=True,
            )
            # We are only ALWAYS interested in the pooled output of the final text encoder
            editing_pooled_prompt_embeds = edit_concepts_embeds[0]
            if clip_skip is None:
                edit_concepts_embeds = edit_concepts_embeds[2][-2]
            else:
                # "2" because SDXL always indexes from the penultimate layer.
                edit_concepts_embeds = edit_concepts_embeds[-2][-(clip_skip + 2)]

            edit_prompt_embeds_list.append(edit_concepts_embeds)

        edit_concepts_embeds = mint.concat(edit_prompt_embeds_list, dim=-1)
    elif not enable_edit_guidance:
        edit_concepts_embeds = None
        editing_pooled_prompt_embeds = None

    negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder_2.dtype)
    bs_embed, seq_len, _ = negative_prompt_embeds.shape
    # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
    seq_len = negative_prompt_embeds.shape[1]
    negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder_2.dtype)
    negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
    negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

    if enable_edit_guidance:
        bs_embed_edit, seq_len, _ = edit_concepts_embeds.shape
        edit_concepts_embeds = edit_concepts_embeds.to(dtype=self.text_encoder_2.dtype)
        edit_concepts_embeds = edit_concepts_embeds.tile((1, num_images_per_prompt, 1))
        edit_concepts_embeds = edit_concepts_embeds.view(bs_embed_edit * num_images_per_prompt, seq_len, -1)

    negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.tile((1, num_images_per_prompt)).view(
        bs_embed * num_images_per_prompt, -1
    )

    if enable_edit_guidance:
        editing_pooled_prompt_embeds = editing_pooled_prompt_embeds.tile((1, num_images_per_prompt)).view(
            bs_embed_edit * num_images_per_prompt, -1
        )

    if self.text_encoder is not None:
        if isinstance(self, StableDiffusionXLLoraLoaderMixin):
            # Retrieve the original scale by scaling back the LoRA layers
            unscale_lora_layers(self.text_encoder, lora_scale)

    if self.text_encoder_2 is not None:
        if isinstance(self, StableDiffusionXLLoraLoaderMixin):
            # Retrieve the original scale by scaling back the LoRA layers
            unscale_lora_layers(self.text_encoder_2, lora_scale)

    return (
        negative_prompt_embeds,
        edit_concepts_embeds,
        negative_pooled_prompt_embeds,
        editing_pooled_prompt_embeds,
        num_edit_tokens,
    )

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.get_guidance_scale_embedding(w, embedding_dim=512, dtype=ms.float32)

See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

PARAMETER DESCRIPTION
w

Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.

TYPE: `ms.Tensor`

embedding_dim

Dimension of the embeddings to generate.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

dtype

Data type of the generated embeddings.

TYPE: `ms.Type`, *optional*, defaults to `ms.float32` DEFAULT: float32

RETURNS DESCRIPTION
Tensor

ms.Tensor: Embedding vectors with shape (len(w), embedding_dim).

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
def get_guidance_scale_embedding(
    self, w: ms.Tensor, embedding_dim: int = 512, dtype: ms.Type = ms.float32
) -> ms.Tensor:
    """
    See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

    Args:
        w (`ms.Tensor`):
            Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
        embedding_dim (`int`, *optional*, defaults to 512):
            Dimension of the embeddings to generate.
        dtype (`ms.Type`, *optional*, defaults to `ms.float32`):
            Data type of the generated embeddings.

    Returns:
        `ms.Tensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
    """
    assert len(w.shape) == 1
    w = w * 1000.0

    half_dim = embedding_dim // 2
    emb = mint.log(ms.tensor(10000.0)) / (half_dim - 1)
    emb = mint.exp(mint.arange(half_dim, dtype=dtype) * -emb)
    emb = w.to(dtype)[:, None] * emb[None, :]
    emb = mint.cat([mint.sin(emb), mint.cos(emb)], dim=1)
    if embedding_dim % 2 == 1:  # zero pad
        emb = mint.nn.functional.pad(emb, (0, 1))
    assert emb.shape == (w.shape[0], embedding_dim)
    return emb

mindone.diffusers.pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL.invert(image, source_prompt='', source_guidance_scale=3.5, negative_prompt=None, negative_prompt_2=None, num_inversion_steps=50, skip=0.15, generator=None, crops_coords_top_left=(0, 0), num_zero_noise_steps=3, cross_attention_kwargs=None, height=None, width=None, resize_mode='default', crops_coords=None)

The function to the pipeline for image inversion as described by the LEDITS++ Paper. If the scheduler is set to [~schedulers.DDIMScheduler] the inversion proposed by edit-friendly DPDM will be performed instead.

PARAMETER DESCRIPTION
image

Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect ratio.

TYPE: `PipelineImageInput`

source_prompt

Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled if the source_prompt is "".

TYPE: `str`, defaults to `""` DEFAULT: ''

source_guidance_scale

Strength of guidance during inversion.

TYPE: `float`, defaults to `3.5` DEFAULT: 3.5

negative_prompt

The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

negative_prompt_2

The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders

TYPE: `str` or `List[str]`, *optional* DEFAULT: None

num_inversion_steps

Number of total performed inversion steps after discarding the initial skip steps.

TYPE: `int`, defaults to `50` DEFAULT: 50

skip

Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values will lead to stronger changes to the input image. skip has to be between 0 and 1.

TYPE: `float`, defaults to `0.15` DEFAULT: 0.15

generator

A np.random.Generator to make inversion deterministic.

TYPE: `np.random.Generator`, *optional* DEFAULT: None

crops_coords_top_left

crops_coords_top_left can be used to generate an image that appears to be "cropped" from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.

TYPE: `Tuple[int]`, *optional*, defaults to (0, 0 DEFAULT: (0, 0)

num_zero_noise_steps

Number of final diffusion steps that will not renoise the current image. If no steps are set to zero SD-XL in combination with [DPMSolverMultistepScheduler] will produce noise artifacts.

TYPE: `int`, defaults to `3` DEFAULT: 3

cross_attention_kwargs

A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.

TYPE: `dict`, *optional* DEFAULT: None

RETURNS DESCRIPTION

[~pipelines.ledits_pp.LEditsPPInversionPipelineOutput]: Output will contain the resized input image(s)

and respective VAE reconstruction(s).

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_leditspp_stable_diffusion_xl.py
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
def invert(
    self,
    image: PipelineImageInput,
    source_prompt: str = "",
    source_guidance_scale=3.5,
    negative_prompt: str = None,
    negative_prompt_2: str = None,
    num_inversion_steps: int = 50,
    skip: float = 0.15,
    generator: Optional[np.random.Generator] = None,
    crops_coords_top_left: Tuple[int, int] = (0, 0),
    num_zero_noise_steps: int = 3,
    cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    height: Optional[int] = None,
    width: Optional[int] = None,
    resize_mode: Optional[str] = "default",
    crops_coords: Optional[Tuple[int, int, int, int]] = None,
):
    r"""
    The function to the pipeline for image inversion as described by the [LEDITS++
    Paper](https://arxiv.org/abs/2301.12247). If the scheduler is set to [`~schedulers.DDIMScheduler`] the
    inversion proposed by [edit-friendly DPDM](https://arxiv.org/abs/2304.06140) will be performed instead.

    Args:
        image (`PipelineImageInput`):
            Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect
            ratio.
        source_prompt (`str`, defaults to `""`):
            Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled
            if the `source_prompt` is `""`.
        source_guidance_scale (`float`, defaults to `3.5`):
            Strength of guidance during inversion.
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. If not defined, one has to pass
            `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
            less than `1`).
        negative_prompt_2 (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
            `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
        num_inversion_steps (`int`, defaults to `50`):
            Number of total performed inversion steps after discarding the initial `skip` steps.
        skip (`float`, defaults to `0.15`):
            Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values
            will lead to stronger changes to the input image. `skip` has to be between `0` and `1`.
        generator (`np.random.Generator`, *optional*):
            A [`np.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) to make inversion
            deterministic.
        crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
            `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
            `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
            `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
            [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
        num_zero_noise_steps (`int`, defaults to `3`):
            Number of final diffusion steps that will not renoise the current image. If no steps are set to zero
            SD-XL in combination with [`DPMSolverMultistepScheduler`] will produce noise artifacts.
        cross_attention_kwargs (`dict`, *optional*):
            A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
            `self.processor` in
            [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).

    Returns:
        [`~pipelines.ledits_pp.LEditsPPInversionPipelineOutput`]: Output will contain the resized input image(s)
        and respective VAE reconstruction(s).
    """
    if height is not None and height % 32 != 0 or width is not None and width % 32 != 0:
        raise ValueError("height and width must be a factor of 32.")

    # Reset attn processor, we do not want to store attn maps during inversion
    self.unet.set_attn_processor(AttnProcessor())

    self.eta = 1.0

    self.scheduler.config.timestep_spacing = "leading"
    self.scheduler.set_timesteps(int(num_inversion_steps * (1 + skip)))
    self.inversion_steps = self.scheduler.timesteps[-num_inversion_steps:]
    timesteps = self.inversion_steps

    num_images_per_prompt = 1

    # 0. Ensure that only uncond embedding is used if prompt = ""
    if source_prompt == "":
        # noise pred should only be noise_pred_uncond
        source_guidance_scale = 0.0
        do_classifier_free_guidance = False
    else:
        do_classifier_free_guidance = source_guidance_scale > 1.0

    # 1. prepare image
    x0, resized = self.encode_image(
        image,
        dtype=self.text_encoder_2.dtype,
        height=height,
        width=width,
        resize_mode=resize_mode,
        crops_coords=crops_coords,
    )
    width = x0.shape[2] * self.vae_scale_factor
    height = x0.shape[3] * self.vae_scale_factor
    self.size = (height, width)

    self.batch_size = x0.shape[0]

    # 2. get embeddings
    text_encoder_lora_scale = (
        cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
    )

    if isinstance(source_prompt, str):
        source_prompt = [source_prompt] * self.batch_size

    (
        negative_prompt_embeds,
        prompt_embeds,
        negative_pooled_prompt_embeds,
        edit_pooled_prompt_embeds,
        _,
    ) = self.encode_prompt(
        num_images_per_prompt=num_images_per_prompt,
        negative_prompt=negative_prompt,
        negative_prompt_2=negative_prompt_2,
        editing_prompt=source_prompt,
        lora_scale=text_encoder_lora_scale,
        enable_edit_guidance=do_classifier_free_guidance,
    )
    if self.text_encoder_2 is None:
        text_encoder_projection_dim = int(negative_pooled_prompt_embeds.shape[-1])
    else:
        text_encoder_projection_dim = self.text_encoder_2.config.projection_dim

    # 3. Prepare added time ids & embeddings
    add_text_embeds = negative_pooled_prompt_embeds
    add_time_ids = self._get_add_time_ids(
        self.size,
        crops_coords_top_left,
        self.size,
        dtype=negative_prompt_embeds.dtype,
        text_encoder_projection_dim=text_encoder_projection_dim,
    )

    if do_classifier_free_guidance:
        negative_prompt_embeds = mint.cat([negative_prompt_embeds, prompt_embeds], dim=0)
        add_text_embeds = mint.cat([add_text_embeds, edit_pooled_prompt_embeds], dim=0)
        add_time_ids = mint.cat([add_time_ids, add_time_ids], dim=0)

    negative_prompt_embeds = negative_prompt_embeds

    add_text_embeds = add_text_embeds
    add_time_ids = add_time_ids.tile((self.batch_size * num_images_per_prompt, 1))

    # autoencoder reconstruction
    if self.vae.dtype == ms.float16 and self.vae.config.force_upcast:
        self.upcast_vae()
        x0_tmp = x0.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)
        image_rec = self.vae.decode(
            x0_tmp / self.vae.config.scaling_factor, return_dict=False, generator=generator
        )[0]
        # cast back to fp16 if needed
        # TODO: upcast_vae in MS
        self.vae.to(dtype=ms.float16)
    elif self.vae.config.force_upcast:
        x0_tmp = x0.to(next(iter(self.vae.post_quant_conv.get_parameters())).dtype)
        image_rec = self.vae.decode(
            x0_tmp / self.vae.config.scaling_factor, return_dict=False, generator=generator
        )[0]
    else:
        image_rec = self.vae.decode(x0 / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

    image_rec = self.image_processor.postprocess(image_rec, output_type="pil")

    # 5. find zs and xts
    variance_noise_shape = (num_inversion_steps, *x0.shape)

    # intermediate latents
    t_to_idx = {int(v): k for k, v in enumerate(timesteps)}
    xts = mint.zeros(size=variance_noise_shape, dtype=negative_prompt_embeds.dtype)

    for t in reversed(timesteps):
        idx = num_inversion_steps - t_to_idx[int(t)] - 1
        noise = randn_tensor(shape=x0.shape, generator=generator, dtype=x0.dtype)
        xts[idx] = self.scheduler.add_noise(x0, noise, t.unsqueeze(0))
    xts = mint.cat([x0.unsqueeze(0), xts], dim=0)

    # noise maps
    zs = mint.zeros(size=variance_noise_shape, dtype=negative_prompt_embeds.dtype)

    self.scheduler.set_timesteps(len(self.scheduler.timesteps))

    for t in self.progress_bar(timesteps):
        idx = num_inversion_steps - t_to_idx[int(t)] - 1
        # 1. predict noise residual
        xt = xts[idx + 1]

        latent_model_input = mint.cat([xt] * 2) if do_classifier_free_guidance else xt
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}

        noise_pred = self.unet(
            latent_model_input,
            t,
            encoder_hidden_states=negative_prompt_embeds,
            cross_attention_kwargs=cross_attention_kwargs,
            added_cond_kwargs=added_cond_kwargs,
            return_dict=False,
        )[0]

        # 2. perform guidance
        if do_classifier_free_guidance:
            noise_pred_out = noise_pred.chunk(2)
            noise_pred_uncond, noise_pred_text = noise_pred_out[0], noise_pred_out[1]
            noise_pred = noise_pred_uncond + source_guidance_scale * (noise_pred_text - noise_pred_uncond)

        xtm1 = xts[idx]
        z, xtm1_corrected = compute_noise(self.scheduler, xtm1, xt, t, noise_pred, self.eta)
        zs[idx] = z

        # correction to avoid error accumulation
        xts[idx] = xtm1_corrected

    self.init_latents = xts[-1]
    zs = zs.flip((0,))

    if num_zero_noise_steps > 0:
        zs[-num_zero_noise_steps:] = mint.zeros_like(zs[-num_zero_noise_steps:])
    self.zs = zs
    return LEditsPPInversionPipelineOutput(images=resized, vae_reconstruction_images=image_rec)

mindone.diffusers.pipelines.ledits_pp.pipeline_output.LEditsPPDiffusionPipelineOutput dataclass

Bases: BaseOutput

Output class for LEdits++ Diffusion pipelines.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_output.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@dataclass
class LEditsPPDiffusionPipelineOutput(BaseOutput):
    """
    Output class for LEdits++ Diffusion pipelines.

    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
            num_channels)`.
        nsfw_content_detected (`List[bool]`)
            List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content or
            `None` if safety checking could not be performed.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]
    nsfw_content_detected: Optional[List[bool]]

mindone.diffusers.pipelines.ledits_pp.pipeline_output.LEditsPPInversionPipelineOutput dataclass

Bases: BaseOutput

Output class for LEdits++ Diffusion pipelines.

Source code in mindone/diffusers/pipelines/ledits_pp/pipeline_output.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
@dataclass
class LEditsPPInversionPipelineOutput(BaseOutput):
    """
    Output class for LEdits++ Diffusion pipelines.

    Args:
        input_images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of the cropped and resized input images as PIL images of length `batch_size` or NumPy array of shape `
            (batch_size, height, width, num_channels)`.
        vae_reconstruction_images (`List[PIL.Image.Image]` or `np.ndarray`)
            List of VAE reconstruction of all input images as PIL images of length `batch_size` or NumPy array of shape
            ` (batch_size, height, width, num_channels)`.
    """

    images: Union[List[PIL.Image.Image], np.ndarray]
    vae_reconstruction_images: Union[List[PIL.Image.Image], np.ndarray]