Skip to content

Shap-E

The Shap-E model was proposed in Shap-E: Generating Conditional 3D Implicit Functions by Alex Nichol and Heewoo Jun from OpenAI.

The abstract from the paper is:

We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.

The original codebase can be found at openai/shap-e.

Tip

See the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

mindone.diffusers.ShapEPipeline

Bases: DiffusionPipeline

Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
prior

The canonical unCLIP prior to approximate the image embedding from the text embedding.

TYPE: [`PriorTransformer`]

text_encoder

Frozen text-encoder.

TYPE: [`~transformers.CLIPTextModelWithProjection`]

tokenizer

A CLIPTokenizer to tokenize text.

TYPE: [`~transformers.CLIPTokenizer`]

scheduler

A scheduler to be used in combination with the prior model to generate image embedding.

TYPE: [`HeunDiscreteScheduler`]

shap_e_renderer

Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF rendering method.

TYPE: [`ShapERenderer`]

Source code in mindone/diffusers/pipelines/shap_e/pipeline_shap_e.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
class ShapEPipeline(DiffusionPipeline):
    """
    Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        prior ([`PriorTransformer`]):
            The canonical unCLIP prior to approximate the image embedding from the text embedding.
        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
        tokenizer ([`~transformers.CLIPTokenizer`]):
             A `CLIPTokenizer` to tokenize text.
        scheduler ([`HeunDiscreteScheduler`]):
            A scheduler to be used in combination with the `prior` model to generate image embedding.
        shap_e_renderer ([`ShapERenderer`]):
            Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
            rendering method.
    """

    model_cpu_offload_seq = "text_encoder->prior"
    _exclude_from_cpu_offload = ["shap_e_renderer"]

    def __init__(
        self,
        prior: PriorTransformer,
        text_encoder: CLIPTextModelWithProjection,
        tokenizer: CLIPTokenizer,
        scheduler: HeunDiscreteScheduler,
        shap_e_renderer: ShapERenderer,
    ):
        super().__init__()

        self.register_modules(
            prior=prior,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            scheduler=scheduler,
            shap_e_renderer=shap_e_renderer,
        )

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = latents * scheduler.init_noise_sigma.to(dtype)
        return latents

    def _encode_prompt(
        self,
        prompt,
        num_images_per_prompt,
        do_classifier_free_guidance,
    ):
        len(prompt) if isinstance(prompt, list) else 1

        # YiYi Notes: set pad_token_id to be 0, not sure why I can't set in the config file
        self.tokenizer.pad_token_id = 0
        # get prompt text embeddings
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True,
            return_tensors="np",
        )
        text_input_ids = text_inputs.input_ids
        untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="np").input_ids

        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not np.array_equal(
            text_input_ids, untruncated_ids
        ):
            removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
            logger.warning(
                "The following part of your input was truncated because CLIP can only handle sequences up to"
                f" {self.tokenizer.model_max_length} tokens: {removed_text}"
            )

        text_encoder_output = self.text_encoder(ms.tensor(text_input_ids))
        prompt_embeds = text_encoder_output[0]

        prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
        # in Shap-E it normalize the prompt_embeds and then later rescale it
        prompt_embeds = prompt_embeds / ops.norm(prompt_embeds, dim=-1, keepdim=True)

        if do_classifier_free_guidance:
            negative_prompt_embeds = ops.zeros_like(prompt_embeds)

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            prompt_embeds = ops.cat([negative_prompt_embeds, prompt_embeds])

        # Rescale the features to have unit variance
        prompt_embeds = float(math.sqrt(prompt_embeds.shape[1])) * prompt_embeds

        return prompt_embeds

    def __call__(
        self,
        prompt: str,
        num_images_per_prompt: int = 1,
        num_inference_steps: int = 25,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        guidance_scale: float = 4.0,
        frame_size: int = 64,
        output_type: Optional[str] = "pil",  # pil, np, latent, mesh
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor is generated by sampling using the supplied random `generator`.
            guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            frame_size (`int`, *optional*, default to 64):
                The width and height of each image frame of the generated 3D output.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`), `"latent"` (`ms.Tensor`), or mesh ([`MeshDecoderOutput`]).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
                tuple.

        Examples:

        Returns:
            [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned,
                otherwise a `tuple` is returned where the first element is a list with the generated images.
        """

        if isinstance(prompt, str):
            batch_size = 1
        elif isinstance(prompt, list):
            batch_size = len(prompt)
        else:
            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = guidance_scale > 1.0
        prompt_embeds = self._encode_prompt(prompt, num_images_per_prompt, do_classifier_free_guidance)

        # prior

        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

        num_embeddings = self.prior.config.num_embeddings
        embedding_dim = self.prior.config.embedding_dim

        latents = self.prepare_latents(
            (batch_size, num_embeddings * embedding_dim),
            prompt_embeds.dtype,
            generator,
            latents,
            self.scheduler,
        )

        # YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
        latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)

        for i, t in enumerate(self.progress_bar(timesteps)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
            # TODO: method of scheduler should not change the dtype of input.
            #  Remove the casting after cuiyushi confirm that.
            tmp_dtype = latent_model_input.dtype
            scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
            scaled_model_input = scaled_model_input.to(tmp_dtype)

            noise_pred = self.prior(
                scaled_model_input,
                timestep=t,
                proj_embedding=prompt_embeds,
            )[0]

            # remove the variance
            noise_pred, _ = noise_pred.split(
                scaled_model_input.shape[2], axis=2
            )  # batch_size, num_embeddings, embedding_dim

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)

            # TODO: method of scheduler should not change the dtype of input.
            #  Remove the casting after cuiyushi confirm that.
            tmp_dtype = latents.dtype
            latents = self.scheduler.step(
                noise_pred,
                timestep=t,
                sample=latents,
            )[0]
            latents = latents.to(tmp_dtype)

        if output_type not in ["np", "pil", "latent", "mesh"]:
            raise ValueError(
                f"Only the output types `pil`, `np`, `latent` and `mesh` are supported not output_type={output_type}"
            )

        if output_type == "latent":
            return ShapEPipelineOutput(images=latents)

        images = []
        if output_type == "mesh":
            for i, latent in enumerate(latents):
                mesh = self.shap_e_renderer.decode_to_mesh(
                    latent[None, :],
                )
                images.append(mesh)

        else:
            # np, pil
            for i, latent in enumerate(latents):
                image = self.shap_e_renderer.decode_to_image(
                    latent[None, :],
                    size=frame_size,
                )
                images.append(image)

            images = ops.stack(images)

            images = images.numpy()

            if output_type == "pil":
                images = [self.numpy_to_pil(image) for image in images]

        if not return_dict:
            return (images,)

        return ShapEPipelineOutput(images=images)

mindone.diffusers.ShapEPipeline.__call__(prompt, num_images_per_prompt=1, num_inference_steps=25, generator=None, latents=None, guidance_scale=4.0, frame_size=64, output_type='pil', return_dict=False)

The call function to the pipeline for generation.

PARAMETER DESCRIPTION
prompt

The prompt or prompts to guide the image generation.

TYPE: `str` or `List[str]`

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 25 DEFAULT: 25

generator

A np.random.Generator to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

latents

Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 4.0

frame_size

The width and height of each image frame of the generated 3D output.

TYPE: `int`, *optional*, default to 64 DEFAULT: 64

output_type

The output format of the generated image. Choose between "pil" (PIL.Image.Image), "np" (np.array), "latent" (ms.Tensor), or mesh ([MeshDecoderOutput]).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION

[~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] or tuple: If return_dict is True, [~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/shap_e/pipeline_shap_e.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
def __call__(
    self,
    prompt: str,
    num_images_per_prompt: int = 1,
    num_inference_steps: int = 25,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    guidance_scale: float = 4.0,
    frame_size: int = 64,
    output_type: Optional[str] = "pil",  # pil, np, latent, mesh
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        prompt (`str` or `List[str]`):
            The prompt or prompts to guide the image generation.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor is generated by sampling using the supplied random `generator`.
        guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        frame_size (`int`, *optional*, default to 64):
            The width and height of each image frame of the generated 3D output.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`), `"latent"` (`ms.Tensor`), or mesh ([`MeshDecoderOutput`]).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
            tuple.

    Examples:

    Returns:
        [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned,
            otherwise a `tuple` is returned where the first element is a list with the generated images.
    """

    if isinstance(prompt, str):
        batch_size = 1
    elif isinstance(prompt, list):
        batch_size = len(prompt)
    else:
        raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = guidance_scale > 1.0
    prompt_embeds = self._encode_prompt(prompt, num_images_per_prompt, do_classifier_free_guidance)

    # prior

    self.scheduler.set_timesteps(num_inference_steps)
    timesteps = self.scheduler.timesteps

    num_embeddings = self.prior.config.num_embeddings
    embedding_dim = self.prior.config.embedding_dim

    latents = self.prepare_latents(
        (batch_size, num_embeddings * embedding_dim),
        prompt_embeds.dtype,
        generator,
        latents,
        self.scheduler,
    )

    # YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
    latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)

    for i, t in enumerate(self.progress_bar(timesteps)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
        # TODO: method of scheduler should not change the dtype of input.
        #  Remove the casting after cuiyushi confirm that.
        tmp_dtype = latent_model_input.dtype
        scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
        scaled_model_input = scaled_model_input.to(tmp_dtype)

        noise_pred = self.prior(
            scaled_model_input,
            timestep=t,
            proj_embedding=prompt_embeds,
        )[0]

        # remove the variance
        noise_pred, _ = noise_pred.split(
            scaled_model_input.shape[2], axis=2
        )  # batch_size, num_embeddings, embedding_dim

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)

        # TODO: method of scheduler should not change the dtype of input.
        #  Remove the casting after cuiyushi confirm that.
        tmp_dtype = latents.dtype
        latents = self.scheduler.step(
            noise_pred,
            timestep=t,
            sample=latents,
        )[0]
        latents = latents.to(tmp_dtype)

    if output_type not in ["np", "pil", "latent", "mesh"]:
        raise ValueError(
            f"Only the output types `pil`, `np`, `latent` and `mesh` are supported not output_type={output_type}"
        )

    if output_type == "latent":
        return ShapEPipelineOutput(images=latents)

    images = []
    if output_type == "mesh":
        for i, latent in enumerate(latents):
            mesh = self.shap_e_renderer.decode_to_mesh(
                latent[None, :],
            )
            images.append(mesh)

    else:
        # np, pil
        for i, latent in enumerate(latents):
            image = self.shap_e_renderer.decode_to_image(
                latent[None, :],
                size=frame_size,
            )
            images.append(image)

        images = ops.stack(images)

        images = images.numpy()

        if output_type == "pil":
            images = [self.numpy_to_pil(image) for image in images]

    if not return_dict:
        return (images,)

    return ShapEPipelineOutput(images=images)

mindone.diffusers.ShapEImg2ImgPipeline

Bases: DiffusionPipeline

Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method from an image.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

PARAMETER DESCRIPTION
prior

The canonincal unCLIP prior to approximate the image embedding from the text embedding.

TYPE: [`PriorTransformer`]

image_encoder

Frozen image-encoder.

TYPE: [`~transformers.CLIPVisionModel`]

image_processor

A CLIPImageProcessor to process images.

TYPE: [`~transformers.CLIPImageProcessor`]

scheduler

A scheduler to be used in combination with the prior model to generate image embedding.

TYPE: [`HeunDiscreteScheduler`]

shap_e_renderer

Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF rendering method.

TYPE: [`ShapERenderer`]

Source code in mindone/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
class ShapEImg2ImgPipeline(DiffusionPipeline):
    """
    Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method from an image.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        prior ([`PriorTransformer`]):
            The canonincal unCLIP prior to approximate the image embedding from the text embedding.
        image_encoder ([`~transformers.CLIPVisionModel`]):
            Frozen image-encoder.
        image_processor ([`~transformers.CLIPImageProcessor`]):
             A `CLIPImageProcessor` to process images.
        scheduler ([`HeunDiscreteScheduler`]):
            A scheduler to be used in combination with the `prior` model to generate image embedding.
        shap_e_renderer ([`ShapERenderer`]):
            Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
            rendering method.
    """

    model_cpu_offload_seq = "image_encoder->prior"
    _exclude_from_cpu_offload = ["shap_e_renderer"]

    def __init__(
        self,
        prior: PriorTransformer,
        image_encoder: CLIPVisionModel,
        image_processor: CLIPImageProcessor,
        scheduler: HeunDiscreteScheduler,
        shap_e_renderer: ShapERenderer,
    ):
        super().__init__()

        self.register_modules(
            prior=prior,
            image_encoder=image_encoder,
            image_processor=image_processor,
            scheduler=scheduler,
            shap_e_renderer=shap_e_renderer,
        )

    # Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
    def prepare_latents(self, shape, dtype, generator, latents, scheduler):
        if latents is None:
            latents = randn_tensor(shape, generator=generator, dtype=dtype)
        else:
            if latents.shape != shape:
                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")

        latents = latents * scheduler.init_noise_sigma.to(dtype)
        return latents

    def _encode_image(
        self,
        image,
        num_images_per_prompt,
        do_classifier_free_guidance,
    ):
        if isinstance(image, List) and isinstance(image[0], ms.Tensor):
            image = ops.cat(image, axis=0) if image[0].ndim == 4 else ops.stack(image, axis=0)

        if not isinstance(image, ms.Tensor):
            image = self.image_processor(image, return_tensors="np").pixel_values[0]
            image = ms.Tensor.from_numpy(image).unsqueeze(0)

        image = image.to(dtype=self.image_encoder.dtype)

        image_embeds = self.image_encoder(image)[0]
        image_embeds = image_embeds[:, 1:, :]  # batch_size, dim, 256

        image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)

        if do_classifier_free_guidance:
            negative_image_embeds = ops.zeros_like(image_embeds)

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            image_embeds = ops.cat([negative_image_embeds, image_embeds])

        return image_embeds

    def __call__(
        self,
        image: Union[PIL.Image.Image, List[PIL.Image.Image]],
        num_images_per_prompt: int = 1,
        num_inference_steps: int = 25,
        generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
        latents: Optional[ms.Tensor] = None,
        guidance_scale: float = 4.0,
        frame_size: int = 64,
        output_type: Optional[str] = "pil",  # pil, np, latent, mesh
        return_dict: bool = False,
    ):
        """
        The call function to the pipeline for generation.

        Args:
            image (`ms.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[ms.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
                `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
                latents as image, but if passing latents directly it is not encoded again.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
                A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            latents (`ms.Tensor`, *optional*):
                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor is generated by sampling using the supplied random `generator`.
            guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            frame_size (`int`, *optional*, default to 64):
                The width and height of each image frame of the generated 3D output.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
                (`np.array`), `"latent"` (`ms.Tensor`), or mesh ([`MeshDecoderOutput`]).
            return_dict (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
                tuple.

        Examples:

        Returns:
            [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned,
                otherwise a `tuple` is returned where the first element is a list with the generated images.
        """

        if isinstance(image, PIL.Image.Image):
            batch_size = 1
        elif isinstance(image, ms.Tensor):
            batch_size = image.shape[0]
        elif isinstance(image, list) and isinstance(image[0], (ms.Tensor, PIL.Image.Image)):
            batch_size = len(image)
        else:
            raise ValueError(
                f"`image` has to be of type `PIL.Image.Image`, `ms.Tensor`, `List[PIL.Image.Image]` or `List[ms.Tensor]` but is {type(image)}"
            )

        batch_size = batch_size * num_images_per_prompt

        do_classifier_free_guidance = guidance_scale > 1.0
        image_embeds = self._encode_image(image, num_images_per_prompt, do_classifier_free_guidance)

        # prior

        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

        num_embeddings = self.prior.config.num_embeddings
        embedding_dim = self.prior.config.embedding_dim
        if latents is None:
            latents = self.prepare_latents(
                (batch_size, num_embeddings * embedding_dim),
                image_embeds.dtype,
                generator,
                latents,
                self.scheduler,
            )

        # YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
        latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)

        for i, t in enumerate(self.progress_bar(timesteps)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
            # TODO: method of scheduler should not change the dtype of input.
            #  Remove the casting after cuiyushi confirm that.
            tmp_dtype = latent_model_input.dtype
            scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
            scaled_model_input = scaled_model_input.to(tmp_dtype)

            noise_pred = self.prior(
                scaled_model_input,
                timestep=t,
                proj_embedding=image_embeds,
            )[0]

            # remove the variance
            noise_pred, _ = noise_pred.split(
                scaled_model_input.shape[2], axis=2
            )  # batch_size, num_embeddings, embedding_dim

            if do_classifier_free_guidance:
                noise_pred_uncond, noise_pred = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)

            # TODO: method of scheduler should not change the dtype of input.
            #  Remove the casting after cuiyushi confirm that.
            tmp_dtype = latents.dtype
            latents = self.scheduler.step(
                noise_pred,
                timestep=t,
                sample=latents,
            )[0]
            latents = latents.to(tmp_dtype)

        if output_type not in ["np", "pil", "latent", "mesh"]:
            raise ValueError(
                f"Only the output types `pil`, `np`, `latent` and `mesh` are supported not output_type={output_type}"
            )

        if output_type == "latent":
            return ShapEPipelineOutput(images=latents)

        images = []
        if output_type == "mesh":
            for i, latent in enumerate(latents):
                mesh = self.shap_e_renderer.decode_to_mesh(
                    latent[None, :],
                )
                images.append(mesh)

        else:
            # np, pil
            for i, latent in enumerate(latents):
                image = self.shap_e_renderer.decode_to_image(
                    latent[None, :],
                    size=frame_size,
                )
                images.append(image)

            images = ops.stack(images)

            images = images.numpy()

            if output_type == "pil":
                images = [self.numpy_to_pil(image) for image in images]

        if not return_dict:
            return (images,)

        return ShapEPipelineOutput(images=images)

mindone.diffusers.ShapEImg2ImgPipeline.__call__(image, num_images_per_prompt=1, num_inference_steps=25, generator=None, latents=None, guidance_scale=4.0, frame_size=64, output_type='pil', return_dict=False)

The call function to the pipeline for generation.

PARAMETER DESCRIPTION
image

Image or tensor representing an image batch to be used as the starting point. Can also accept image latents as image, but if passing latents directly it is not encoded again.

TYPE: `ms.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[ms.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`

num_images_per_prompt

The number of images to generate per prompt.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

num_inference_steps

The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

TYPE: `int`, *optional*, defaults to 25 DEFAULT: 25

generator

A np.random.Generator to make generation deterministic.

TYPE: `np.random.Generator` or `List[np.random.Generator]`, *optional* DEFAULT: None

latents

Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.

TYPE: `ms.Tensor`, *optional* DEFAULT: None

guidance_scale

A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

TYPE: `float`, *optional*, defaults to 4.0 DEFAULT: 4.0

frame_size

The width and height of each image frame of the generated 3D output.

TYPE: `int`, *optional*, default to 64 DEFAULT: 64

output_type

The output format of the generated image. Choose between "pil" (PIL.Image.Image), "np" (np.array), "latent" (ms.Tensor), or mesh ([MeshDecoderOutput]).

TYPE: `str`, *optional*, defaults to `"pil"` DEFAULT: 'pil'

return_dict

Whether or not to return a [~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] instead of a plain tuple.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION

[~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] or tuple: If return_dict is True, [~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput] is returned, otherwise a tuple is returned where the first element is a list with the generated images.

Source code in mindone/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def __call__(
    self,
    image: Union[PIL.Image.Image, List[PIL.Image.Image]],
    num_images_per_prompt: int = 1,
    num_inference_steps: int = 25,
    generator: Optional[Union[np.random.Generator, List[np.random.Generator]]] = None,
    latents: Optional[ms.Tensor] = None,
    guidance_scale: float = 4.0,
    frame_size: int = 64,
    output_type: Optional[str] = "pil",  # pil, np, latent, mesh
    return_dict: bool = False,
):
    """
    The call function to the pipeline for generation.

    Args:
        image (`ms.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[ms.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
            `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
            latents as image, but if passing latents directly it is not encoded again.
        num_images_per_prompt (`int`, *optional*, defaults to 1):
            The number of images to generate per prompt.
        num_inference_steps (`int`, *optional*, defaults to 25):
            The number of denoising steps. More denoising steps usually lead to a higher quality image at the
            expense of slower inference.
        generator (`np.random.Generator` or `List[np.random.Generator]`, *optional*):
            A [`np.random.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
            generation deterministic.
        latents (`ms.Tensor`, *optional*):
            Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
            generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
            tensor is generated by sampling using the supplied random `generator`.
        guidance_scale (`float`, *optional*, defaults to 4.0):
            A higher guidance scale value encourages the model to generate images closely linked to the text
            `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
        frame_size (`int`, *optional*, default to 64):
            The width and height of each image frame of the generated 3D output.
        output_type (`str`, *optional*, defaults to `"pil"`):
            The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
            (`np.array`), `"latent"` (`ms.Tensor`), or mesh ([`MeshDecoderOutput`]).
        return_dict (`bool`, *optional*, defaults to `False`):
            Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
            tuple.

    Examples:

    Returns:
        [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`:
            If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned,
            otherwise a `tuple` is returned where the first element is a list with the generated images.
    """

    if isinstance(image, PIL.Image.Image):
        batch_size = 1
    elif isinstance(image, ms.Tensor):
        batch_size = image.shape[0]
    elif isinstance(image, list) and isinstance(image[0], (ms.Tensor, PIL.Image.Image)):
        batch_size = len(image)
    else:
        raise ValueError(
            f"`image` has to be of type `PIL.Image.Image`, `ms.Tensor`, `List[PIL.Image.Image]` or `List[ms.Tensor]` but is {type(image)}"
        )

    batch_size = batch_size * num_images_per_prompt

    do_classifier_free_guidance = guidance_scale > 1.0
    image_embeds = self._encode_image(image, num_images_per_prompt, do_classifier_free_guidance)

    # prior

    self.scheduler.set_timesteps(num_inference_steps)
    timesteps = self.scheduler.timesteps

    num_embeddings = self.prior.config.num_embeddings
    embedding_dim = self.prior.config.embedding_dim
    if latents is None:
        latents = self.prepare_latents(
            (batch_size, num_embeddings * embedding_dim),
            image_embeds.dtype,
            generator,
            latents,
            self.scheduler,
        )

    # YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
    latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)

    for i, t in enumerate(self.progress_bar(timesteps)):
        # expand the latents if we are doing classifier free guidance
        latent_model_input = ops.cat([latents] * 2) if do_classifier_free_guidance else latents
        # TODO: method of scheduler should not change the dtype of input.
        #  Remove the casting after cuiyushi confirm that.
        tmp_dtype = latent_model_input.dtype
        scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
        scaled_model_input = scaled_model_input.to(tmp_dtype)

        noise_pred = self.prior(
            scaled_model_input,
            timestep=t,
            proj_embedding=image_embeds,
        )[0]

        # remove the variance
        noise_pred, _ = noise_pred.split(
            scaled_model_input.shape[2], axis=2
        )  # batch_size, num_embeddings, embedding_dim

        if do_classifier_free_guidance:
            noise_pred_uncond, noise_pred = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)

        # TODO: method of scheduler should not change the dtype of input.
        #  Remove the casting after cuiyushi confirm that.
        tmp_dtype = latents.dtype
        latents = self.scheduler.step(
            noise_pred,
            timestep=t,
            sample=latents,
        )[0]
        latents = latents.to(tmp_dtype)

    if output_type not in ["np", "pil", "latent", "mesh"]:
        raise ValueError(
            f"Only the output types `pil`, `np`, `latent` and `mesh` are supported not output_type={output_type}"
        )

    if output_type == "latent":
        return ShapEPipelineOutput(images=latents)

    images = []
    if output_type == "mesh":
        for i, latent in enumerate(latents):
            mesh = self.shap_e_renderer.decode_to_mesh(
                latent[None, :],
            )
            images.append(mesh)

    else:
        # np, pil
        for i, latent in enumerate(latents):
            image = self.shap_e_renderer.decode_to_image(
                latent[None, :],
                size=frame_size,
            )
            images.append(image)

        images = ops.stack(images)

        images = images.numpy()

        if output_type == "pil":
            images = [self.numpy_to_pil(image) for image in images]

    if not return_dict:
        return (images,)

    return ShapEPipelineOutput(images=images)

mindone.diffusers.pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput dataclass

Bases: BaseOutput

Output class for [ShapEPipeline] and [ShapEImg2ImgPipeline].

Source code in mindone/diffusers/pipelines/shap_e/pipeline_shap_e.py
61
62
63
64
65
66
67
68
69
70
71
@dataclass
class ShapEPipelineOutput(BaseOutput):
    """
    Output class for [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`].

    Args:
        images (`ms.Tensor`)
            A list of images for 3D rendering.
    """

    images: Union[List[List[PIL.Image.Image]], List[List[np.ndarray]]]