CLIP-Head: Text-Guided Generation of Textured Neural Parametric 3D Head Models

SIGGRAPH Asia 2023 Technical Communications

CVIT, International Institute of Information Technology, Hyderabad

CLIP-Head generates textured 3D neural head meshes, conditioned on simple text prompts.

Method

We propose CLIP-Head, a novel approach towards text-driven neural parametric 3D head model generation. Our method takes simple text prompts in natural language, describing the appearance & facial expressions, and generates 3D neural head avatars with accurate geometry and high-quality texture maps. Unlike existing approaches, which use conventional parametric head models with limited control and expressiveness, we leverage Neural Parametric Head Models (NPHM), offering disjoint latent codes for the disentangled encoding of identities and expressions.

To facilitate the text-driven generation, we propose two weakly-supervised mapping networks to map the CLIP’s encoding of input text prompt to NPHM’s disjoint identity and expression vector. The predicted latent codes are then fed to a pre-trained NPHM network to generate 3D head geometry. Since NPHM mesh doesn’t support textures, we propose a novel aligned parametrization technique, followed by text-driven generation of texture maps by leveraging a recently proposed controllable diffusion model for the task of text-to-image synthesis. Our method is capable of generating 3D head meshes with arbitrary appearances and a variety of facial expressions, along with high-quality texture details.

CLIP-Head
features

Aligned UV maps

The UV alignment allows us to train a ControlNet architecture, conditioned on UV normals of the mesh. This strategy allows us to generate diverse UV texture maps in a single inference pass, unlike other iterative optimization-based mesh texturing methods.

results1
results1

Various identities

The "Texture Synthesis" module leverages Stable Diffusion, empowering the CLIP-Head with the capability to create 3D representations of heads with diverse identities and appearances based on textual prompts.