Mochi 1 Preview¶

Tip

Only a research preview of the model weights is available at the moment.

Mochi 1 is a video generation model by Genmo with a strong focus on prompt adherence and motion quality. The model features a 10B parameter Asmmetric Diffusion Transformer (AsymmDiT) architecture, and uses non-square QKV and output projection layers to reduce inference memory requirements. A single T5-XXL model is used to encode prompts.

Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. The model is released under a permissive Apache 2.0 license.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Generating videos with Mochi-1 Preview¶

The following example will download the full precision mochi-1-preview weights and produce the highest quality results but will require at least 42GB VRAM to run.

import mindspore as ms
from mindone.diffusers import MochiPipeline
from mindone.diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", mindspore_dtype=ms.float16)

# Enable memory savings
pipe.enable_vae_tiling()

prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."

frames = pipe(prompt, num_inference_steps=28, guidance_scale=3.5)[0][0]

export_to_video(frames, "mochi.mp4", fps=30)

Using a lower precision variant to save memory¶

The following example will use the bfloat16 variant of the model and requires 22GB VRAM to run. There is a slight drop in the quality of the generated video as a result.

import mindspore as ms
from mindone.diffusers import MochiPipeline
from mindone.diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", mindspore_dtype=ms.bfloat16)

# Enable memory savings
pipe.enable_vae_tiling()

prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
frames = pipe(prompt, num_frames=85)[0][0]

export_to_video(frames, "mochi.mp4", fps=30)

Using single file loading with the Mochi Transformer¶

You can use from_single_file to load the Mochi transformer in its original format.

Tip

Diffusers currently doesn't support using the FP8 scaled versions of the Mochi single file checkpoints.

import mindspore as ms
from mindone.diffusers import MochiPipeline, MochiTransformer3DModel
from mindone.diffusers.utils import export_to_video

model_id = "genmo/mochi-1-preview"

ckpt_path = "https://huggingface.co/Comfy-Org/mochi_preview_repackaged/blob/main/split_files/diffusion_models/mochi_preview_bf16.safetensors"

transformer = MochiTransformer3DModel.from_pretrained(ckpt_path, mindspore_dtype=ms.bfloat16)

pipe = MochiPipeline.from_pretrained(model_id,  transformer=transformer)
pipe.enable_vae_tiling()

frames = pipe(
    prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
    negative_prompt="",
    height=480,
    width=848,
    num_frames=85,
    num_inference_steps=50,
    guidance_scale=4.5,
    num_videos_per_prompt=1,
    generator=torch.Generator(device="cuda").manual_seed(0),
    max_sequence_length=256,
    output_type="pil",
)[0][0]

export_to_video(frames, "output.mp4", fps=30)

`mindone.diffusers.MochiPipeline` ¶

Bases: DiffusionPipeline, Mochi1LoraLoaderMixin

The mochi pipeline for text-to-video generation.

Reference: https://github.com/genmoai/models

PARAMETER	DESCRIPTION
`transformer`	Conditional Transformer architecture to denoise the encoded video latents. TYPE: [`MochiTransformer3DModel`]
`scheduler`	A scheduler to be used in combination with `transformer` to denoise the encoded image latents. TYPE: [`FlowMatchEulerDiscreteScheduler`]
`vae`	Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations. TYPE: [`AutoencoderKLMochi`]
`text_encoder`	T5, specifically the google/t5-v1_1-xxl variant. TYPE: [`T5EncoderModel`]
`tokenizer`	Tokenizer of class CLIPTokenizer. TYPE: `CLIPTokenizer`
`tokenizer`	Second Tokenizer of class T5TokenizerFast. TYPE: `T5TokenizerFast`