Skip to content

Text or image-to-video

Driven by the success of text-to-image diffusion models, generative video models are able to generate short clips of video from a text prompt or an initial image. These models extend a pretrained diffusion model to generate videos by adding some type of temporal and/or spatial convolution layer to the architecture. A mixed dataset of images and videos are used to train the model which learns to output a series of video frames based on the text or image conditioning.

This guide will show you how to generate videos, how to configure video model parameters, and how to control video generation.

Tip

Discover other cool and trending video generation models on the Hub here!

Stable Video Diffusions (SVD), I2VGen-XL and AnimateDiff are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos.

Stable Video Diffusion

SVD is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the Stable Video Diffusion guide.

Begin by loading the StableVideoDiffusionPipeline and passing an initial image to generate a video from.

import mindspore as ms
from mindone.diffusers import StableVideoDiffusionPipeline
from mindone.diffusers.utils import load_image, export_to_video
import numpy as np

pipeline = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", mindspore_dtype=ms.float16, variant="fp16"
)

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = np.random.Generator(np.random.PCG64(42))
frames = pipeline(image, decode_chunk_size=8, generator=generator, num_frames=5)[0]
export_to_video(frames, "generated.mp4", fps=7)
initial image
generated video

I2VGen-XL

I2VGen-XL is a diffusion model that can generate higher resolution videos than SVD and it is also capable of accepting text prompts in addition to images. The model is trained with two hierarchical encoders (detail and global encoder) to better capture low and high-level details in images. These learned details are used to train a video diffusion model which refines the video resolution and details in the generated video.

You can use I2VGen-XL by loading the I2VGenXLPipeline, and passing a text and image prompt to generate a video.

import mindspore as ms
from mindone.diffusers import I2VGenXLPipeline
from mindone.diffusers.utils import export_to_gif, load_image
import numpy as np

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16")

image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
image = load_image(image_url).convert("RGB")
image = image.resize((image.width // 2, image.height // 2))

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = np.random.Generator(np.random.PCG64(8888))

frames = pipeline(
    prompt=prompt,
    image=image,
    height=image.height,
    width=image.width,
    num_inference_steps=50,
    negative_prompt=negative_prompt,
    guidance_scale=9.0,
    generator=generator
)[0][0]
export_to_gif(frames, "i2v.gif")
initial image
generated video

AnimateDiff

AnimateDiff is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into "video models".

Start by loading a [MotionAdapter].

import mindspore as ms
from mindone.diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from mindone.diffusers.utils import export_to_gif
import numpy as np

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", mindspore_dtype=ms.float16)

Then load a finetuned Stable Diffusion model with the AnimateDiffPipeline.

pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, mindspore_dtype=ms.float16)
scheduler = DDIMScheduler.from_pretrained(
    "emilianJR/epiCRealism",
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipeline.scheduler = scheduler

Create a prompt and generate the video.

output = pipeline(
    prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
    negative_prompt="bad quality, worse quality, low resolution",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=50,
    generator=np.random.Generator(np.random.PCG64(49)),
)
frames = output[0][0]
export_to_gif(frames, "animation.gif")

Configure model parameters

There are a few important parameters you can configure in the pipeline that'll affect the video generation process and quality. Let's take a closer look at what these parameters do and how changing them affects the output.

Number of frames

The num_frames parameter determines how many video frames are generated per second. A frame is an image that is played in a sequence of other frames to create motion or a video. This affects video length because the pipeline generates a certain number of frames per second (check a pipeline's API reference for the default value). To increase the video duration, you'll need to increase the num_frames parameter.

import mindspore as ms
from mindone.diffusers import I2VGenXLPipeline
from mindone.diffusers.utils import export_to_gif, load_image
import numpy as np

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16")

image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
image = load_image(image_url).convert("RGB")
image = image.resize((image.width // 2, image.height // 2))

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = np.random.Generator(np.random.PCG64(8888))

frames = pipeline(
    prompt=prompt,
    image=image,
    height=image.height,
    width=image.width,
    num_inference_steps=50,
    negative_prompt=negative_prompt,
    guidance_scale=9.0,
    generator=generator,
    num_frames=25,
)[0][0]
export_to_gif(frames, "i2v.gif")
num_frames=14
num_frames=25

Guidance scale

The guidance_scale parameter controls how closely aligned the generated video and text prompt or initial image is. A higher guidance_scale value means your generated video is more aligned with the text prompt or initial image, while a lower guidance_scale value means your generated video is less aligned which could give the model more "creativity" to interpret the conditioning input.

Tip

SVD uses the min_guidance_scale and max_guidance_scale parameters for applying guidance to the first and last frames respectively.

import mindspore as ms
from mindone.diffusers import I2VGenXLPipeline
from mindone.diffusers.utils import export_to_gif, load_image
import numpy as np

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16")

image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
image = load_image(image_url).convert("RGB")
image = image.resize((image.width // 2, image.height // 2))

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = np.random.Generator(np.random.PCG64(0))

frames = pipeline(
    prompt=prompt,
    image=image,
    height=image.height,
    width=image.width,
    num_inference_steps=50,
    negative_prompt=negative_prompt,
    guidance_scale=1.0,
    generator=generator
)[0][0]
export_to_gif(frames, "i2v.gif")
guidance_scale=9.0
guidance_scale=1.0

Negative prompt

A negative prompt deters the model from generating things you don’t want it to. This parameter is commonly used to improve overall generation quality by removing poor or bad features such as “low resolution” or “bad details”.

import mindspore as ms
from mindone.diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from mindone.diffusers.utils import export_to_gif
import numpy as np

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", mindspore_dtype=ms.float16)

pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, mindspore_dtype=ms.float16)
scheduler = DDIMScheduler.from_pretrained(
    "emilianJR/epiCRealism",
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipeline.scheduler = scheduler

output = pipeline(
    prompt="360 camera shot of a sushi roll in a restaurant",
    negative_prompt="Distorted, discontinuous, ugly, blurry, low resolution, motionless, static",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=50,
    generator=np.random.Generator(np.random.PCG64(0)),
)
frames = output[0][0]
export_to_gif(frames, "animation.gif")
no negative prompt
negative prompt applied

Model-specific parameters

There are some pipeline parameters that are unique to each model such as adjusting the motion in a video or adding noise to the initial image.

Stable Video Diffusion provides additional micro-conditioning for the frame rate with the fps parameter and for motion with the motion_bucket_id parameter. Together, these parameters allow for adjusting the amount of motion in the generated video.

There is also a noise_aug_strength parameter that increases the amount of noise added to the initial image. Varying this parameter affects how similar the generated video and initial image are. A higher noise_aug_strength also increases the amount of motion. To learn more, read the Micro-conditioning guide.