EasyAnimate¶

EasyAnimate by Alibaba PAI.

The description from it's GitHub page: EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.

This pipeline was contributed by bubbliiiing. The original codebase can be found here. The original weights can be found under hf.co/alibaba-pai.

There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh`	mindspore.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	mindspore.float16

There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	mindspore.float16

There are two official EasyAnimate checkpoints available for control-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`	mindspore.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`	mindspore.float16

For the EasyAnimateV5.1 series: - Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024. - Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

`mindone.diffusers.EasyAnimatePipeline` ¶

Bases: DiffusionPipeline

Pipeline for text-to-video generation using EasyAnimate.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

EasyAnimate uses one text encoder qwen2 vl in V5.1.

PARAMETER	DESCRIPTION
`vae`	Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations. TYPE: [`AutoencoderKLMagvit`]
`text_encoder`	EasyAnimate uses qwen2 vl in V5.1. TYPE: Optional[`~transformers.Qwen2VLForConditionalGeneration`, `~transformers.BertModel`]
`tokenizer`	A `Qwen2Tokenizer` or `BertTokenizer` to tokenize text. TYPE: Optional[`~transformers.Qwen2Tokenizer`, `~transformers.BertTokenizer`]
`transformer`	The EasyAnimate model designed by EasyAnimate Team. TYPE: [`EasyAnimateTransformer3DModel`]
`scheduler`	A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents. TYPE: [`FlowMatchEulerDiscreteScheduler`]

Source code in mindone/diffusers/pipelines/easyanimate/pipeline_easyanimate.py

class EasyAnimatePipeline(DiffusionPipeline):
    r"""
    Pipeline for text-to-video generation using EasyAnimate.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

    EasyAnimate uses one text encoder [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.

    Args:
        vae ([`AutoencoderKLMagvit`]):
            Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.
        text_encoder (Optional[`~transformers.Qwen2VLForConditionalGeneration`, `~transformers.BertModel`]):
            EasyAnimate uses [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.
        tokenizer (Optional[`~transformers.Qwen2Tokenizer`, `~transformers.BertTokenizer`]):
            A `Qwen2Tokenizer` or `BertTokenizer` to tokenize text.
        transformer ([`EasyAnimateTransformer3DModel`]):
            The EasyAnimate model designed by EasyAnimate Team.
        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
            A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.
    """

    model_cpu_offload_seq = "text_encoder->transformer->vae"
    _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]

    def __init__(
        self,
        vae: AutoencoderKLMagvit,
        text_encoder: Union[Qwen2VLForConditionalGeneration, BertModel],
        tokenizer: Union[Qwen2Tokenizer, BertTokenizer],
        transformer: EasyAnimateTransformer3DModel,
        scheduler: FlowMatchEulerDiscreteScheduler,
    ):
        super().__init__()

        self.register_modules(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            transformer=transformer,
            scheduler=scheduler,
        )
        self.enable_text_attention_mask = (
            self.transformer.config.enable_text_attention_mask
            if getattr(self, "transformer", None) is not None
            else True
        )
        self.vae_spatial_compression_ratio = (
            self.vae.spatial_compression_ratio if getattr(self, "vae", None) is not None else 8
        )
        self.vae_temporal_compression_ratio = (
            self.vae.temporal_compression_ratio if getattr(self, "vae", None) is not None else 4
        )
        self.video_processor = VideoProcessor(vae_scale_factor=self.vae_spatial_compression_ratio)

    def encode_prompt(
        self,
        prompt: Union[str, List[str]],
        num_images_per_prompt: int = 1,
        do_classifier_free_guidance: bool = True,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        prompt_attention_mask: Optional[ms.Tensor] = None,
        negative_prompt_attention_mask: Optional[ms.Tensor] = None,
        dtype: Optional[ms.Type] = None,
        max_sequence_length: int = 256,
    ):
        r"""
        Encodes the prompt into text encoder hidden states.

        Args:
            prompt (`str` or `List[str]`, *optional*):
                prompt to be encoded
            dtype (`ms.Type`):
                mindspore dtype
            num_images_per_prompt (`int`):
                number of images that should be generated per prompt
            do_classifier_free_guidance (`bool`):
                whether to use classifier free guidance or not
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
                less than `1`).
            prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
                provided, text embeddings will be generated from `prompt` input argument.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
                argument.
            prompt_attention_mask (`ms.Tensor`, *optional*):
                Attention mask for the prompt. Required when `prompt_embeds` is passed directly.
            negative_prompt_attention_mask (`ms.Tensor`, *optional*):
                Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly.
            max_sequence_length (`int`, *optional*): maximum sequence length to use for the prompt.
        """
        dtype = dtype or self.text_encoder.dtype

        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        if prompt_embeds is None:
            if isinstance(prompt, str):
                messages = [
                    {
                        "role": "user",
                        "content": [{"type": "text", "text": prompt}],
                    }
                ]
            else:
                messages = [
                    {
                        "role": "user",
                        "content": [{"type": "text", "text": _prompt}],
                    }
                    for _prompt in prompt
                ]
            text = [
                self.tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=True) for m in messages
            ]

            text_inputs = self.tokenizer(
                text=text,
                padding="max_length",
                max_length=max_sequence_length,
                truncation=True,
                return_attention_mask=True,
                padding_side="right",
                return_tensors="np",
            )

            text_input_ids = ms.tensor(text_inputs.input_ids)
            prompt_attention_mask = ms.tensor(text_inputs.attention_mask)
            if self.enable_text_attention_mask:
                # Inference: Generation of the output
                # text_encoder only support pynative
                with pynative_context():
                    prompt_embeds = self.text_encoder(
                        input_ids=text_input_ids, attention_mask=prompt_attention_mask, output_hidden_states=True
                    )[2][-2]
            else:
                raise ValueError("LLM needs attention_mask")
            prompt_attention_mask = prompt_attention_mask.tile((num_images_per_prompt, 1))

        prompt_embeds = prompt_embeds.to(dtype=dtype)

        bs_embed, seq_len, _ = prompt_embeds.shape
        # duplicate text embeddings for each generation per prompt, using mps friendly method
        prompt_embeds = prompt_embeds.tile((1, num_images_per_prompt, 1))
        prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)

        # get unconditional embeddings for classifier free guidance
        if do_classifier_free_guidance and negative_prompt_embeds is None:
            if negative_prompt is not None and isinstance(negative_prompt, str):
                messages = [
                    {
                        "role": "user",
                        "content": [{"type": "text", "text": negative_prompt}],
                    }
                ]
            else:
                messages = [
                    {
                        "role": "user",
                        "content": [{"type": "text", "text": _negative_prompt}],
                    }
                    for _negative_prompt in negative_prompt
                ]
            text = [
                self.tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=True) for m in messages
            ]

            text_inputs = self.tokenizer(
                text=text,
                padding="max_length",
                max_length=max_sequence_length,
                truncation=True,
                return_attention_mask=True,
                padding_side="right",
                return_tensors="np",
            )

            text_input_ids = ms.tensor(text_inputs.input_ids)
            negative_prompt_attention_mask = ms.tensor(text_inputs.attention_mask)
            if self.enable_text_attention_mask:
                # Inference: Generation of the output
                # text_encoder only support pynative
                with pynative_context():
                    negative_prompt_embeds = self.text_encoder(
                        input_ids=text_input_ids,
                        attention_mask=negative_prompt_attention_mask,
                        output_hidden_states=True,
                    )[2][-2]
            else:
                raise ValueError("LLM needs attention_mask")
            negative_prompt_attention_mask = negative_prompt_attention_mask.tile((num_images_per_prompt, 1))

        if do_classifier_free_guidance:
            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = negative_prompt_embeds.shape[1]

            negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype)

            negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

        return prompt_embeds, negative_prompt_embeds, prompt_attention_mask, negative_prompt_attention_mask

    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
    def prepare_extra_step_kwargs(self, generator, eta):
        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
        # eta corresponds to η in DDIM paper: https://huggingface.co/papers/2010.02502
        # and should be between [0, 1]

        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
        extra_step_kwargs = {}
        if accepts_eta:
            extra_step_kwargs["eta"] = eta

        # check if the scheduler accepts generator
        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
        if accepts_generator:
            extra_step_kwargs["generator"] = generator
        return extra_step_kwargs

    def check_inputs(
        self,
        prompt,
        height,
        width,
        negative_prompt=None,
        prompt_embeds=None,
        negative_prompt_embeds=None,
        prompt_attention_mask=None,
        negative_prompt_attention_mask=None,
        callback_on_step_end_tensor_inputs=None,
    ):
        if height % 16 != 0 or width % 16 != 0:
            raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")

        if callback_on_step_end_tensor_inputs is not None and not all(
            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
        ):
            raise ValueError(
                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"  # noqa
            )

        if prompt is not None and prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
                " only forward one of the two."
            )
        elif prompt is None and prompt_embeds is None:
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if prompt_embeds is not None and prompt_attention_mask is None:
            raise ValueError("Must provide `prompt_attention_mask` when specifying `prompt_embeds`.")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

        if negative_prompt_embeds is not None and negative_prompt_attention_mask is None:
            raise ValueError("Must provide `negative_prompt_attention_mask` when specifying `negative_prompt_embeds`.")

        if prompt_embeds is not None and negative_prompt_embeds is not None:
            if prompt_embeds.shape != negative_prompt_embeds.shape:
                raise ValueError(
                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
                    f" {negative_prompt_embeds.shape}."
                )

    def prepare_latents(
        self, batch_size, num_channels_latents, num_frames, height, width, dtype, generator, latents=None
    ):
        if latents is not None:
            return latents.to(dtype=dtype)

        shape = (
            batch_size,
            num_channels_latents,
            (num_frames - 1) // self.vae_temporal_compression_ratio + 1,
            height // self.vae_spatial_compression_ratio,
            width // self.vae_spatial_compression_ratio,
        )

        if isinstance(generator, list) and len(generator) != batch_size:
            raise ValueError(
                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
            )

        latents = randn_tensor(shape, generator=generator, dtype=dtype)
        # scale the initial noise by the standard deviation required by the scheduler
        if hasattr(self.scheduler, "init_noise_sigma"):
            latents = (latents * self.scheduler.init_noise_sigma).to(dtype)
        return latents

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def guidance_rescale(self):
        return self._guidance_rescale

    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
    # of the Imagen paper: https://huggingface.co/papers/2205.11487 . `guidance_scale = 1`
    # corresponds to doing no classifier free guidance.
    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1

    @property
    def num_timesteps(self):
        return self._num_timesteps

    @property
    def interrupt(self):
        return self._interrupt

    def __call__(
        self,
        prompt: Union[str, List[str]] = None,
        num_frames: Optional[int] = 49,
        height: Optional[int] = 512,
        width: Optional[int] = 512,
        num_inference_steps: Optional[int] = 50,
        guidance_scale: Optional[float] = 5.0,
        negative_prompt: Optional[Union[str, List[str]]] = None,
        num_images_per_prompt: Optional[int] = 1,
        eta: Optional[float] = 0.0,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        prompt_embeds: Optional[ms.Tensor] = None,
        timesteps: Optional[List[int]] = None,
        negative_prompt_embeds: Optional[ms.Tensor] = None,
        prompt_attention_mask: Optional[ms.Tensor] = None,
        negative_prompt_attention_mask: Optional[ms.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = False,
        callback_on_step_end: Optional[
            Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
        ] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        guidance_rescale: float = 0.0,
    ):
        r"""
        Generates images or video using the EasyAnimate pipeline based on the provided prompts.

        Examples:
            prompt (`str` or `List[str]`, *optional*):
                Text prompts to guide the image or video generation. If not provided, use `prompt_embeds` instead.
            num_frames (`int`, *optional*):
                Length of the generated video (in frames).
            height (`int`, *optional*):
                Height of the generated image in pixels.
            width (`int`, *optional*):
                Width of the generated image in pixels.
            num_inference_steps (`int`, *optional*, defaults to 50):
                Number of denoising steps during generation. More steps generally yield higher quality images but slow
                down inference.
            guidance_scale (`float`, *optional*, defaults to 5.0):
                Encourages the model to align outputs with prompts. A higher value may decrease image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
                Prompts indicating what to exclude in generation. If not specified, use `negative_prompt_embeds`.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                Number of images to generate for each prompt.
            eta (`float`, *optional*, defaults to 0.0):
                Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                A generator to ensure reproducibility in image generation.
            latents (`ms.Tensor`, *optional*):
                Predefined latent tensors to condition generation.
            prompt_embeds (`ms.Tensor`, *optional*):
                Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
            negative_prompt_embeds (`ms.Tensor`, *optional*):
                Embeddings for negative prompts. Overrides string inputs if defined.
            prompt_attention_mask (`ms.Tensor`, *optional*):
                Attention mask for the primary prompt embeddings.
            negative_prompt_attention_mask (`ms.Tensor`, *optional*):
                Attention mask for negative prompt embeddings.
            output_type (`str`, *optional*, defaults to "latent"):
                Format of the generated output, either as a PIL image or as a NumPy array.
            return_dict (`bool`, *optional*, defaults to `False`):
                If `True`, returns a structured output. Otherwise returns a simple tuple.
            callback_on_step_end (`Callable`, *optional*):
                Functions called at the end of each denoising step.
            callback_on_step_end_tensor_inputs (`List[str]`, *optional*):
                Tensor names to be included in callback function calls.
            guidance_rescale (`float`, *optional*, defaults to 0.0):
                Adjusts noise levels based on guidance scale.
            original_size (`Tuple[int, int]`, *optional*, defaults to `(1024, 1024)`):
                Original dimensions of the output.
            target_size (`Tuple[int, int]`, *optional*):
                Desired output dimensions for calculations.
            crops_coords_top_left (`Tuple[int, int]`, *optional*, defaults to `(0, 0)`):
                Coordinates for cropping.

        Returns:
            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
                otherwise a `tuple` is returned where the first element is a list with the generated images and the
                second element is a list of `bool`s indicating whether the corresponding generated image contains
                "not-safe-for-work" (nsfw) content.
        """

        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs

        # 0. default height and width
        height = int((height // 16) * 16)
        width = int((width // 16) * 16)

        # 1. Check inputs. Raise error if not correct
        self.check_inputs(
            prompt,
            height,
            width,
            negative_prompt,
            prompt_embeds,
            negative_prompt_embeds,
            prompt_attention_mask,
            negative_prompt_attention_mask,
            callback_on_step_end_tensor_inputs,
        )
        self._guidance_scale = guidance_scale
        self._guidance_rescale = guidance_rescale
        self._interrupt = False

        # 2. Define call parameters
        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            batch_size = prompt_embeds.shape[0]

        if self.text_encoder is not None:
            dtype = self.text_encoder.dtype
        else:
            dtype = self.transformer.dtype

        # 3. Encode input prompt
        (
            prompt_embeds,
            negative_prompt_embeds,
            prompt_attention_mask,
            negative_prompt_attention_mask,
        ) = self.encode_prompt(
            prompt=prompt,
            dtype=dtype,
            num_images_per_prompt=num_images_per_prompt,
            do_classifier_free_guidance=self.do_classifier_free_guidance,
            negative_prompt=negative_prompt,
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
            prompt_attention_mask=prompt_attention_mask,
            negative_prompt_attention_mask=negative_prompt_attention_mask,
        )

        # 4. Prepare timesteps
        if isinstance(self.scheduler, FlowMatchEulerDiscreteScheduler):
            timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, timesteps, mu=1)
        else:
            timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, timesteps)

        # 5. Prepare latent variables
        num_channels_latents = self.transformer.config.in_channels
        latents = self.prepare_latents(
            batch_size * num_images_per_prompt,
            num_channels_latents,
            num_frames,
            height,
            width,
            dtype,
            generator,
            latents,
        )

        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)

        if self.do_classifier_free_guidance:
            prompt_embeds = mint.cat([negative_prompt_embeds, prompt_embeds])
            prompt_attention_mask = mint.cat([negative_prompt_attention_mask, prompt_attention_mask])

        # 7. Denoising loop
        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
        self._num_timesteps = len(timesteps)
        with self.progress_bar(total=num_inference_steps) as progress_bar:
            for i, t in enumerate(timesteps):
                if self.interrupt:
                    continue

                # expand the latents if we are doing classifier free guidance
                latent_model_input = mint.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                if hasattr(self.scheduler, "scale_model_input"):
                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                # expand scalar t to 1-D tensor to match the 1st dim of latent_model_input
                t_expand = ms.tensor([t] * latent_model_input.shape[0]).to(dtype=latent_model_input.dtype)

                # predict the noise residual
                noise_pred = self.transformer(
                    latent_model_input,
                    t_expand,
                    encoder_hidden_states=prompt_embeds,
                    return_dict=False,
                )[0]

                if noise_pred.shape[1] != self.vae.config.latent_channels:
                    noise_pred, _ = noise_pred.chunk(2, dim=1)

                # perform guidance
                if self.do_classifier_free_guidance:
                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

                if self.do_classifier_free_guidance and guidance_rescale > 0.0:
                    # Based on 3.4. in https://huggingface.co/papers/2305.08891
                    noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)

                # compute the previous noisy sample x_t -> x_t-1
                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]

                if callback_on_step_end is not None:
                    callback_kwargs = {}
                    for k in callback_on_step_end_tensor_inputs:
                        callback_kwargs[k] = locals()[k]
                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                    latents = callback_outputs.pop("latents", latents)
                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                    progress_bar.update()

        if not output_type == "latent":
            latents = 1 / self.vae.config.scaling_factor * latents
            # vae decode only support pynative
            with pynative_context():
                video = self.vae.decode(latents, return_dict=False)[0]
            video = self.video_processor.postprocess_video(video=video, output_type=output_type)
        else:
            video = latents

        if not return_dict:
            return (video,)

        return EasyAnimatePipelineOutput(frames=video)

`mindone.diffusers.EasyAnimatePipeline.call(prompt=None, num_frames=49, height=512, width=512, num_inference_steps=50, guidance_scale=5.0, negative_prompt=None, num_images_per_prompt=1, eta=0.0, generator=None, latents=None, prompt_embeds=None, timesteps=None, negative_prompt_embeds=None, prompt_attention_mask=None, negative_prompt_attention_mask=None, output_type='pil', return_dict=False, callback_on_step_end=None, callback_on_step_end_tensor_inputs=['latents'], guidance_rescale=0.0)` ¶

Generates images or video using the EasyAnimate pipeline based on the provided prompts.

Examples:

prompt (str or List[str], optional): Text prompts to guide the image or video generation. If not provided, use prompt_embeds instead. num_frames (int, optional): Length of the generated video (in frames). height (int, optional): Height of the generated image in pixels. width (int, optional): Width of the generated image in pixels. num_inference_steps (int, optional, defaults to 50): Number of denoising steps during generation. More steps generally yield higher quality images but slow down inference. guidance_scale (float, optional, defaults to 5.0): Encourages the model to align outputs with prompts. A higher value may decrease image quality. negative_prompt (str or List[str], optional): Prompts indicating what to exclude in generation. If not specified, use negative_prompt_embeds. num_images_per_prompt (int, optional, defaults to 1): Number of images to generate for each prompt. eta (float, optional, defaults to 0.0): Applies to DDIM scheduling. Controlled by the eta parameter from the related literature. generator (np.random.Generator or List[np.random.Generator], optional): A generator to ensure reproducibility in image generation. latents (ms.Tensor, optional): Predefined latent tensors to condition generation. prompt_embeds (ms.Tensor, optional): Text embeddings for the prompts. Overrides prompt string inputs for more flexibility. negative_prompt_embeds (ms.Tensor, optional): Embeddings for negative prompts. Overrides string inputs if defined. prompt_attention_mask (ms.Tensor, optional): Attention mask for the primary prompt embeddings. negative_prompt_attention_mask (ms.Tensor, optional): Attention mask for negative prompt embeddings. output_type (str, optional, defaults to "latent"): Format of the generated output, either as a PIL image or as a NumPy array. return_dict (bool, optional, defaults to False): If True, returns a structured output. Otherwise returns a simple tuple. callback_on_step_end (Callable, optional): Functions called at the end of each denoising step. callback_on_step_end_tensor_inputs (List[str], optional): Tensor names to be included in callback function calls. guidance_rescale (float, optional, defaults to 0.0): Adjusts noise levels based on guidance scale. original_size (Tuple[int, int], optional, defaults to (1024, 1024)): Original dimensions of the output. target_size (Tuple[int, int], optional): Desired output dimensions for calculations. crops_coords_top_left (Tuple[int, int], optional, defaults to (0, 0)): Coordinates for cropping.

RETURNS DESCRIPTION

[~pipelines.stable_diffusion.StableDiffusionPipelineOutput] or tuple: If return_dict is True, [~pipelines.stable_diffusion.StableDiffusionPipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content.

Source code in mindone/diffusers/pipelines/easyanimate/pipeline_easyanimate.py

def __call__(
    self,
    prompt: Union[str, List[str]] = None,
    num_frames: Optional[int] = 49,
    height: Optional[int] = 512,
    width: Optional[int] = 512,
    num_inference_steps: Optional[int] = 50,
    guidance_scale: Optional[float] = 5.0,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    num_images_per_prompt: Optional[int] = 1,
    eta: Optional[float] = 0.0,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    timesteps: Optional[List[int]] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    prompt_attention_mask: Optional[ms.Tensor] = None,
    negative_prompt_attention_mask: Optional[ms.Tensor] = None,
    output_type: Optional[str] = "pil",
    return_dict: bool = False,
    callback_on_step_end: Optional[
        Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
    ] = None,
    callback_on_step_end_tensor_inputs: List[str] = ["latents"],
    guidance_rescale: float = 0.0,
):
    r"""
    Generates images or video using the EasyAnimate pipeline based on the provided prompts.

    Examples:
        prompt (`str` or `List[str]`, *optional*):
            Text prompts to guide the image or video generation. If not provided, use `prompt_embeds` instead.
        num_frames (`int`, *optional*):
            Length of the generated video (in frames).
        height (`int`, *optional*):
            Height of the generated image in pixels.
        width (`int`, *optional*):
            Width of the generated image in pixels.
        num_inference_steps (`int`, *optional*, defaults to 50):
            Number of denoising steps during generation. More steps generally yield higher quality images but slow
            down inference.
        guidance_scale (`float`, *optional*, defaults to 5.0):
            Encourages the model to align outputs with prompts. A higher value may decrease image quality.
        negative_prompt (`str` or `List[str]`, *optional*):
            Prompts indicating what to exclude in generation. If not specified, use `negative_prompt_embeds`.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            Number of images to generate for each prompt.
        eta (`float`, *optional*, defaults to 0.0):
            Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            A generator to ensure reproducibility in image generation.
        latents (`ms.Tensor`, *optional*):
            Predefined latent tensors to condition generation.
        prompt_embeds (`ms.Tensor`, *optional*):
            Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Embeddings for negative prompts. Overrides string inputs if defined.
        prompt_attention_mask (`ms.Tensor`, *optional*):
            Attention mask for the primary prompt embeddings.
        negative_prompt_attention_mask (`ms.Tensor`, *optional*):
            Attention mask for negative prompt embeddings.
        output_type (`str`, *optional*, defaults to "latent"):
            Format of the generated output, either as a PIL image or as a NumPy array.
        return_dict (`bool`, *optional*, defaults to `False`):
            If `True`, returns a structured output. Otherwise returns a simple tuple.
        callback_on_step_end (`Callable`, *optional*):
            Functions called at the end of each denoising step.
        callback_on_step_end_tensor_inputs (`List[str]`, *optional*):
            Tensor names to be included in callback function calls.
        guidance_rescale (`float`, *optional*, defaults to 0.0):
            Adjusts noise levels based on guidance scale.
        original_size (`Tuple[int, int]`, *optional*, defaults to `(1024, 1024)`):
            Original dimensions of the output.
        target_size (`Tuple[int, int]`, *optional*):
            Desired output dimensions for calculations.
        crops_coords_top_left (`Tuple[int, int]`, *optional*, defaults to `(0, 0)`):
            Coordinates for cropping.

    Returns:
        [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
            otherwise a `tuple` is returned where the first element is a list with the generated images and the
            second element is a list of `bool`s indicating whether the corresponding generated image contains
            "not-safe-for-work" (nsfw) content.
    """

    if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
        callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs

    # 0. default height and width
    height = int((height // 16) * 16)
    width = int((width // 16) * 16)

    # 1. Check inputs. Raise error if not correct
    self.check_inputs(
        prompt,
        height,
        width,
        negative_prompt,
        prompt_embeds,
        negative_prompt_embeds,
        prompt_attention_mask,
        negative_prompt_attention_mask,
        callback_on_step_end_tensor_inputs,
    )
    self._guidance_scale = guidance_scale
    self._guidance_rescale = guidance_rescale
    self._interrupt = False

    # 2. Define call parameters
    if prompt is not None and isinstance(prompt, str):
        batch_size = 1
    elif prompt is not None and isinstance(prompt, list):
        batch_size = len(prompt)
    else:
        batch_size = prompt_embeds.shape[0]

    if self.text_encoder is not None:
        dtype = self.text_encoder.dtype
    else:
        dtype = self.transformer.dtype

    # 3. Encode input prompt
    (
        prompt_embeds,
        negative_prompt_embeds,
        prompt_attention_mask,
        negative_prompt_attention_mask,
    ) = self.encode_prompt(
        prompt=prompt,
        dtype=dtype,
        num_images_per_prompt=num_images_per_prompt,
        do_classifier_free_guidance=self.do_classifier_free_guidance,
        negative_prompt=negative_prompt,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        prompt_attention_mask=prompt_attention_mask,
        negative_prompt_attention_mask=negative_prompt_attention_mask,
    )

    # 4. Prepare timesteps
    if isinstance(self.scheduler, FlowMatchEulerDiscreteScheduler):
        timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, timesteps, mu=1)
    else:
        timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, timesteps)

    # 5. Prepare latent variables
    num_channels_latents = self.transformer.config.in_channels
    latents = self.prepare_latents(
        batch_size * num_images_per_prompt,
        num_channels_latents,
        num_frames,
        height,
        width,
        dtype,
        generator,
        latents,
    )

    # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
    extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)

    if self.do_classifier_free_guidance:
        prompt_embeds = mint.cat([negative_prompt_embeds, prompt_embeds])
        prompt_attention_mask = mint.cat([negative_prompt_attention_mask, prompt_attention_mask])

    # 7. Denoising loop
    num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
    self._num_timesteps = len(timesteps)
    with self.progress_bar(total=num_inference_steps) as progress_bar:
        for i, t in enumerate(timesteps):
            if self.interrupt:
                continue

            # expand the latents if we are doing classifier free guidance
            latent_model_input = mint.cat([latents] * 2) if self.do_classifier_free_guidance else latents
            if hasattr(self.scheduler, "scale_model_input"):
                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

            # expand scalar t to 1-D tensor to match the 1st dim of latent_model_input
            t_expand = ms.tensor([t] * latent_model_input.shape[0]).to(dtype=latent_model_input.dtype)

            # predict the noise residual
            noise_pred = self.transformer(
                latent_model_input,
                t_expand,
                encoder_hidden_states=prompt_embeds,
                return_dict=False,
            )[0]

            if noise_pred.shape[1] != self.vae.config.latent_channels:
                noise_pred, _ = noise_pred.chunk(2, dim=1)

            # perform guidance
            if self.do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

            if self.do_classifier_free_guidance and guidance_rescale > 0.0:
                # Based on 3.4. in https://huggingface.co/papers/2305.08891
                noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)

            # compute the previous noisy sample x_t -> x_t-1
            latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]

            if callback_on_step_end is not None:
                callback_kwargs = {}
                for k in callback_on_step_end_tensor_inputs:
                    callback_kwargs[k] = locals()[k]
                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                latents = callback_outputs.pop("latents", latents)
                prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
                negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)

            if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                progress_bar.update()

    if not output_type == "latent":
        latents = 1 / self.vae.config.scaling_factor * latents
        # vae decode only support pynative
        with pynative_context():
            video = self.vae.decode(latents, return_dict=False)[0]
        video = self.video_processor.postprocess_video(video=video, output_type=output_type)
    else:
        video = latents

    if not return_dict:
        return (video,)

    return EasyAnimatePipelineOutput(frames=video)

`mindone.diffusers.EasyAnimatePipeline.encode_prompt(prompt, num_images_per_prompt=1, do_classifier_free_guidance=True, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, prompt_attention_mask=None, negative_prompt_attention_mask=None, dtype=None, max_sequence_length=256)` ¶

Encodes the prompt into text encoder hidden states.

PARAMETER	DESCRIPTION
`prompt`	prompt to be encoded TYPE: `str` or `List[str]`, optional
`dtype`	mindspore dtype TYPE: `ms.Type` DEFAULT: `None`
`num_images_per_prompt`	number of images that should be generated per prompt TYPE: `int` DEFAULT: `1`
`do_classifier_free_guidance`	whether to use classifier free guidance or not TYPE: `bool` DEFAULT: `True`
`negative_prompt`	The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). TYPE: `str` or `List[str]`, optional DEFAULT: `None`
`prompt_embeds`	Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`negative_prompt_embeds`	Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`prompt_attention_mask`	Attention mask for the prompt. Required when `prompt_embeds` is passed directly. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`negative_prompt_attention_mask`	Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly. TYPE: `ms.Tensor`, optional DEFAULT: `None`
`max_sequence_length`	maximum sequence length to use for the prompt. TYPE: `int`, optional DEFAULT: `256`

Source code in mindone/diffusers/pipelines/easyanimate/pipeline_easyanimate.py

def encode_prompt(
    self,
    prompt: Union[str, List[str]],
    num_images_per_prompt: int = 1,
    do_classifier_free_guidance: bool = True,
    negative_prompt: Optional[Union[str, List[str]]] = None,
    prompt_embeds: Optional[ms.Tensor] = None,
    negative_prompt_embeds: Optional[ms.Tensor] = None,
    prompt_attention_mask: Optional[ms.Tensor] = None,
    negative_prompt_attention_mask: Optional[ms.Tensor] = None,
    dtype: Optional[ms.Type] = None,
    max_sequence_length: int = 256,
):
    r"""
    Encodes the prompt into text encoder hidden states.

    Args:
        prompt (`str` or `List[str]`, *optional*):
            prompt to be encoded
        dtype (`ms.Type`):
            mindspore dtype
        num_images_per_prompt (`int`):
            number of images that should be generated per prompt
        do_classifier_free_guidance (`bool`):
            whether to use classifier free guidance or not
        negative_prompt (`str` or `List[str]`, *optional*):
            The prompt or prompts not to guide the image generation. If not defined, one has to pass
            `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
            less than `1`).
        prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
            provided, text embeddings will be generated from `prompt` input argument.
        negative_prompt_embeds (`ms.Tensor`, *optional*):
            Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
            weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
            argument.
        prompt_attention_mask (`ms.Tensor`, *optional*):
            Attention mask for the prompt. Required when `prompt_embeds` is passed directly.
        negative_prompt_attention_mask (`ms.Tensor`, *optional*):
            Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly.
        max_sequence_length (`int`, *optional*): maximum sequence length to use for the prompt.
    """
    dtype = dtype or self.text_encoder.dtype

    if prompt is not None and isinstance(prompt, str):
        batch_size = 1
    elif prompt is not None and isinstance(prompt, list):
        batch_size = len(prompt)
    else:
        batch_size = prompt_embeds.shape[0]

    if prompt_embeds is None:
        if isinstance(prompt, str):
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": prompt}],
                }
            ]
        else:
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": _prompt}],
                }
                for _prompt in prompt
            ]
        text = [
            self.tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=True) for m in messages
        ]

        text_inputs = self.tokenizer(
            text=text,
            padding="max_length",
            max_length=max_sequence_length,
            truncation=True,
            return_attention_mask=True,
            padding_side="right",
            return_tensors="np",
        )

        text_input_ids = ms.tensor(text_inputs.input_ids)
        prompt_attention_mask = ms.tensor(text_inputs.attention_mask)
        if self.enable_text_attention_mask:
            # Inference: Generation of the output
            # text_encoder only support pynative
            with pynative_context():
                prompt_embeds = self.text_encoder(
                    input_ids=text_input_ids, attention_mask=prompt_attention_mask, output_hidden_states=True
                )[2][-2]
        else:
            raise ValueError("LLM needs attention_mask")
        prompt_attention_mask = prompt_attention_mask.tile((num_images_per_prompt, 1))

    prompt_embeds = prompt_embeds.to(dtype=dtype)

    bs_embed, seq_len, _ = prompt_embeds.shape
    # duplicate text embeddings for each generation per prompt, using mps friendly method
    prompt_embeds = prompt_embeds.tile((1, num_images_per_prompt, 1))
    prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)

    # get unconditional embeddings for classifier free guidance
    if do_classifier_free_guidance and negative_prompt_embeds is None:
        if negative_prompt is not None and isinstance(negative_prompt, str):
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": negative_prompt}],
                }
            ]
        else:
            messages = [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": _negative_prompt}],
                }
                for _negative_prompt in negative_prompt
            ]
        text = [
            self.tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=True) for m in messages
        ]

        text_inputs = self.tokenizer(
            text=text,
            padding="max_length",
            max_length=max_sequence_length,
            truncation=True,
            return_attention_mask=True,
            padding_side="right",
            return_tensors="np",
        )

        text_input_ids = ms.tensor(text_inputs.input_ids)
        negative_prompt_attention_mask = ms.tensor(text_inputs.attention_mask)
        if self.enable_text_attention_mask:
            # Inference: Generation of the output
            # text_encoder only support pynative
            with pynative_context():
                negative_prompt_embeds = self.text_encoder(
                    input_ids=text_input_ids,
                    attention_mask=negative_prompt_attention_mask,
                    output_hidden_states=True,
                )[2][-2]
        else:
            raise ValueError("LLM needs attention_mask")
        negative_prompt_attention_mask = negative_prompt_attention_mask.tile((num_images_per_prompt, 1))

    if do_classifier_free_guidance:
        # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
        seq_len = negative_prompt_embeds.shape[1]

        negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype)

        negative_prompt_embeds = negative_prompt_embeds.tile((1, num_images_per_prompt, 1))
        negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)

    return prompt_embeds, negative_prompt_embeds, prompt_attention_mask, negative_prompt_attention_mask

`mindone.diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput` `dataclass` ¶

Bases: BaseOutput

Output class for EasyAnimate pipelines.

PARAMETER	DESCRIPTION
`frames`	List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`. TYPE: `ms.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]

Source code in mindone/diffusers/pipelines/easyanimate/pipeline_output.py

@dataclass
class EasyAnimatePipelineOutput(BaseOutput):
    r"""
    Output class for EasyAnimate pipelines.

    Args:
        frames (`ms.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
            List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
            denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
            `(batch_size, num_frames, channels, height, width)`.
    """

    frames: ms.Tensor

EasyAnimate¶

mindone.diffusers.EasyAnimatePipeline ¶

mindone.diffusers.EasyAnimatePipeline.encode_prompt(prompt, num_images_per_prompt=1, do_classifier_free_guidance=True, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, prompt_attention_mask=None, negative_prompt_attention_mask=None, dtype=None, max_sequence_length=256) ¶

mindone.diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput dataclass ¶

`mindone.diffusers.EasyAnimatePipeline` ¶

`mindone.diffusers.EasyAnimatePipeline.encode_prompt(prompt, num_images_per_prompt=1, do_classifier_free_guidance=True, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None, prompt_attention_mask=None, negative_prompt_attention_mask=None, dtype=None, max_sequence_length=256)` ¶

`mindone.diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput` `dataclass` ¶