YouDream : Generating Anatomically Controllable Consistent Text-to-3D Animals

Abstract

3D generation guided by text-to-image diffusion models enables the creation of visually compelling assets. However previous methods explore generation based on image or text. The boundaries of creativity are limited by what can be expressed through words or the images that can be sourced. We present YouDream, a method to generate high quality anatomically controllable animals. YouDream is guided using a text-to-image diffusion model controlled by 2D views of a 3D pose prior. Our method generates 3D animals which are not possible to create using previous text-to-3D generative methods. Additionally, our method is capable of preserving anatomic consistency in the generated animals, an area where prior text-to-3D approaches often struggle. Moreover, we design a fully automated pipeline for generating commonly found animals. To circumvent the need for human intervention to create a 3D pose, we propose a multi-agent LLM that adapts poses from a limited library of animal 3D poses to represent the desired animal. A user study conducted on the outcomes of YouDream demonstrates the preference of the animal models generated by our method over others.

Automatic pipeline for 3D animal generation.
Given the name of an animal and textual pose description, we A) utilize a multi-agent LLM to generate a 3D pose (φ) supported by a small library of animal names paired with 3D poses. B) With the obtained 3D pose, we train a NeRF to generate the 3D animal guided by a diffusion model controlled by 2D views (φ^proj) of φ.

Multi-agent LLM for pose editing

Given an input 3D pose from an animal pose library of 16 entries, our multi-agent LLM setup generates novel animals.

Generation using 3D pose prior

Given a 3D pose prior, YouDream generates multi-view consistent 3D assets.

Comparison with prior art

We compare with SOTA text-to-3D generation while listing a) whether the guidance diffusion model was trained with 3D data, b) resolution of NeRF and c) time to generate a 3D object on a single NVIDIA A100 80 GB GPU. Since there is no pose guidance in the baselines we append ", full body" to the prompts for them. YouDream generates a 3D asset in ~40 minutes when used without the init stage, and ~50 minutes with init stage. We outperform MVDream even without utilizing any 3D objects for training diffusion model.

ProlificDreamer

trained on 3D objects ❌

Res.: 512 x 512

Time: ~10 hrs (70k iters)

HiFA

trained on 3D objects ❌

Res.: 512 x 512

Time: ~6 hrs (10k iters)

MVDream

trained on 3D objects ✓

Res.: 64 x 64 -> 256 x 256

Time: ~40 mins (10k iters)

YouDream (ours)

trained on 3D objects ❌

Res.: 128 x 128

Time: ~40/50 mins (10k/20k iters)

a pangolin standing on a flat concrete surface

a dragon with three heads separating from the neck

a zoomed out photo of a llama with octopus tentacles body

Pose Controlled 3D Generation

Generated assets for "a tiger" prompt for various 3D poses.

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Sandeep Mishra^*1

Oindrila Saha^*2

Alan C. Bovik¹

¹University of Texas at Austin

²University of Massachusetts Amherst

^*equal contribution

Accepted at NeurIPS 2024

Abstract

Multi-agent LLM for pose editing

Generation using 3D pose prior

Comparison with prior art

Pose Controlled 3D Generation

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Sandeep Mishra*1

Oindrila Saha*2

Alan C. Bovik1

1University of Texas at Austin

2University of Massachusetts Amherst

*equal contribution

Accepted at NeurIPS 2024

Abstract

Multi-agent LLM for pose editing

Generation using 3D pose prior

Comparison with prior art

Pose Controlled 3D Generation

Sandeep Mishra^*1

Oindrila Saha^*2

Alan C. Bovik¹

¹University of Texas at Austin

²University of Massachusetts Amherst

^*equal contribution