The advent of deep generative AI models has significantly accelerated the development of AI with remarkable capabilities in natural language generation, 3D generation, image generation, and speech synthesis. 3D generative models have transformed numerous industries and applications, revolutionizing the current 3D production landscape. However, many current deep generative models encounter a common roadblock: complex wiring and generated meshes with lighting textures are often incompatible with traditional rendering pipelines like PBR (Physically Based Rendering). Diffusion-based models, which generate 3D assets without lighting textures, possess remarkable capabilities for diverse 3D asset generation, thereby augmenting existing 3D frameworks across industries such as filmmaking, gaming, and augmented/virtual reality.
In this article, we will discuss Paint3D, a novel coarse-to-fine framework capable of producing diverse, high-resolution 2K UV texture maps for untextured 3D meshes, conditioned on either visual or textual inputs. The key challenge that Paint3D addresses is generating high-quality textures without embedding illumination information, allowing users to re-edit or re-light within modern graphics pipelines. To tackle this issue, the Paint3D framework employs a pre-trained 2D diffusion model to perform multi-view texture fusion and generate view-conditional images, initially producing a coarse texture map. However, since 2D models cannot fully disable lighting effects or completely represent 3D shapes, the texture map may exhibit illumination artifacts and incomplete areas.
In this article, we will explore the Paint3D framework in-depth, examining its working and architecture, and comparing it against state-of-the-art deep generative frameworks. So, let’s get started.
Deep Generative AI models have demonstrated exceptional capabilities in natural language generation, 3D generation, and image synthesis, and have been implemented in real-life applications, revolutionizing the 3D generation industry. However, despite their remarkable capabilities, modern deep generative AI frameworks often produce meshes with complex wiring and chaotic lighting textures that are incompatible with conventional rendering pipelines, including Physically Based Rendering (PBR). Similarly, texture synthesis has advanced rapidly, especially with the use of 2D diffusion models. These models effectively utilize pre-trained depth-to-image diffusion models and text conditions to generate high-quality textures. However, a significant challenge remains: pre-illuminated textures can adversely affect the final 3D environment renderings, introducing lighting errors when the lights are adjusted within common workflows, as demonstrated in the following image.
As observed, texture maps without pre-illumination work seamlessly with traditional rendering pipelines, delivering accurate results. In contrast, texture maps with pre-illumination include inappropriate shadows when relighting is applied. Texture generation frameworks trained on 3D data offer an alternative approach, generating textures by understanding a specific 3D object’s entire geometry. While these frameworks might deliver better results, they lack the generalization capabilities needed to apply the model to 3D objects outside their training data.
Current texture generation models face two critical challenges: achieving broad generalization across different objects using image guidance or diverse prompts, and eliminating coupled illumination from pre-training results. Pre-illuminated textures can interfere with the final outcomes of textured objects within rendering engines. Additionally, since pre-trained 2D diffusion models only provide 2D results in the view domain, they lack a comprehensive understanding of shapes, leading to inconsistencies in maintaining view consistency for 3D objects.
To address these challenges, the Paint3D framework develops a dual-stage texture diffusion model for 3D objects that generalizes across different pre-trained generative models and preserves view consistency while generating lighting-free textures.
Paint3D is a dual-stage, coarse-to-fine texture generation model that leverages the strong prompt guidance and image generation capabilities of pre-trained generative AI models to texture 3D objects. In the first stage, Paint3D samples multi-view images from a pre-trained depth-aware 2D image diffusion model progressively, enabling the generalization of high-quality, rich texture results from diverse prompts. The model then generates an initial texture map by back-projecting these images onto the 3D mesh surface. In the second stage, the model focuses on generating lighting-free textures by implementing approaches employed by diffusion models specialized in removing lighting influences and refining shape-aware incomplete regions. Throughout the process, the Paint3D framework consistently generates high-quality 2K textures semantically, eliminating intrinsic illumination effects.
In summary, Paint3D is a novel, coarse-to-fine generative AI model designed to produce diverse, lighting-free, high-resolution 2K UV texture maps for untextured 3D meshes. It aims to achieve state-of-the-art performance in texturing 3D objects with different conditional inputs, including text and images, offering significant advantages for synthesis and graphics editing tasks.
Methodology and Architecture
The Paint3D framework generates and refines texture maps progressively to produce diverse and high-quality textures for 3D models using conditional inputs such as images and prompts, as demonstrated in the following image.
Stage 1: Progressive Coarse Texture Generation
In the initial coarse texture generation stage, Paint3D employs pre-trained 2D image diffusion models to sample multi-view images, which are then back-projected onto the mesh surface to create the initial texture maps. This stage begins with generating a depth map from various camera views. The model uses depth conditions to sample images from the diffusion model, which are then back-projected onto the 3D mesh surface. This alternate rendering, sampling, and back-projection approach enhances the consistency of texture meshes and aids in progressively generating the texture map.
The process starts with the visible regions of the 3D mesh, focusing on generating texture from the first camera view by rendering the 3D mesh to a depth map. A texture image is then sampled based on appearance and depth conditions and back-projected onto the mesh. This method is repeated for subsequent viewpoints, incorporating previous textures to render not only a depth image but also a partially colored RGB image with uncolored masks. The model uses a depth-aware image inpainting encoder to fill uncolored areas, generating a complete coarse texture map by back-projecting inpainted images onto the 3D mesh.
For more complex scenes or objects, the model uses multiple views. Initially, it captures two depth maps from symmetric viewpoints and combines them into a depth grid, which replaces a single depth image for multi-view depth-aware texture sampling.
Stage 2: Texture Refinement in UV Space
Despite generating logical coarse texture maps, challenges such as texture holes from rendering processes and lighting shadows from 2D image diffusion models arise. To address these, Paint3D performs a diffusion process in UV space based on the coarse texture map, enhancing the visual appeal and resolving issues.
However, refining the texture map in UV space can introduce discontinuities due to the fragmentation of continuous textures into individual fragments. To mitigate this, Paint3D refines the texture map by using the adjacency information of texture fragments. In UV space, the position map represents the 3D adjacency information of texture fragments, treating each non-background element as a 3D point coordinate. The model uses an additional position map encoder, similar to ControlNet, to integrate this adjacency information during the diffusion process.
The model simultaneously uses the position of the conditional encoder and other encoders to perform refinement tasks in UV space, offering two capabilities: UVHD (UV High Definition) and UV inpainting. UVHD enhances the visual appeal and aesthetics, using an image enhancement encoder and position encoder with the diffusion model. UV inpainting fills texture holes, avoiding self-occlusion issues from rendering. The refinement stage starts with UV inpainting, followed by UVHD to produce a final refined texture map.
By integrating these refinement methods, the Paint3D framework generates complete, diverse, high-resolution, and lighting-free UV texture maps, making it a robust solution for texturing 3D objects.
Paint3D : Experiments and Results
The Paint3D model utilizes the Stable Diffusion text2image model to assist with texture generation tasks, while the image encoder component manages image conditions. To enhance its control over conditional tasks like image inpainting, depth handling, and high-definition imagery, the Paint3D framework employs ControlNet domain encoders. The model is implemented on the PyTorch framework, with rendering and texture projections executed on Kaolin.
Text to Textures Comparison
To evaluate Paint3D’s performance, we begin by analyzing its texture generation when conditioned with textual prompts, comparing it against state-of-the-art frameworks such as Text2Tex, TEXTure, and LatentPaint. As shown in the following image, the Paint3D framework not only excels at generating high-quality texture details but also effectively synthesizes an illumination-free texture map.
By leveraging the robust capabilities of Stable Diffusion and ControlNet encoders, Paint3D provides superior texture quality and versatility. The comparison highlights Paint3D’s ability to produce detailed, high-resolution textures without embedded illumination, making it a leading solution for 3D texturing tasks.
In comparison, the Latent-Paint framework is prone to generating blurry textures that results in suboptimal visual effects. On the other hand, although the TEXTure framework generates clear textures, it lacks smoothness and exhibits noticeable splicing and seams. Finally, the Text2Tex framework generates smooth textures remarkably well, but it fails to replicate the performance for generating fine textures with intricate detailing. The following image compares the Paint3D framework with state of the art frameworks quantitatively.
As it can be observed, the Paint3D framework outperforms all the existing models, and by a significant margin with nearly 30% improvement in the FID baseline and approximately 40% improvement in the KID baseline. The improvement in the FID and KID baseline scores demonstrate Paint3D’s ability to generate high-quality textures across diverse objects and categories.
Image to Texture Comparison
To generate Paint3D’s generative capabilities using visual prompts, we use the TEXTure model as the baseline. As mentioned earlier, the Paint3D model employs an image encoder sourced from the text2image model from Stable Diffusion. As it can be seen in the following image, the Paint3D framework synthesizes exquisite textures remarkably well, and is still able to maintain high fidelity w.r.t the image condition.
On the other hand, the TEXTure framework is able to generate a texture similar to Paint3D, but it falls short to represent the texture details in the image condition accurately. Furthermore, as demonstrated in the following image, the Paint3D framework delivers better FID and KID baseline scores when compared to the TEXTure framework with the former decreasing from 40.83 to 26.86 whereas the latter showing a drop from 9.76 to 4.94.
Final Thoughts
In this article, we have talked about Paint3D, a coarse-to-fine novel framework capable of producing lighting-less, diverse, and high-resolution 2K UV texture maps for untextured 3D meshes conditioned either on visual or textual inputs. The main highlight of the Paint3D framework is that it is capable of generating lighting-less high-resolution 2K UV textures that are semantically consistent without being conditioned on image or text inputs. Owing to its coarse-to-fine approach, the Paint3D framework produce lighting-less, diverse, and high-resolution texture maps, and delivers better performance than current state of the art frameworks.