Simplify 3D Object Editing with Vox-E: An Artificial Intelligence (AI) Framework For Text-guided Voxel Editing of 3D Objects
Three-dimensional (3D) models are widely used in various fields, such as animation, gaming, virtual reality, and product design. Creating 3D models is a complex and time-consuming task requiring extensive knowledge and specialized software skills. Although pre-designed models are readily accessible from online databases, customizing them to fit a specific artistic vision falls under the same tricky process of 3D model creation, which, as already mentioned, demands specialized 3D editing software expertise. Recently, research has demonstrated the expressive power of neural field-based representations such as NeRF for capturing fine details and enabling effective optimization schemes through differentiable rendering. As a result, their applicability has expanded for various editing tasks.
However, most research in this area has focused on appearance-only manipulations, which alter the object’s texture and style, or on geometric editing through correspondences with an explicit mesh representation. Unfortunately, these methods still require users to place control points on the mesh representation, and they do not allow for adding new structures or significantly modifying the object’s geometry.
Therefore, a novel voxel-editing approach, termed Vox-E, has been developed to address the abovementioned issues. The architecture overview is illustrated in the figure below.
This framework focuses on enabling more localized and flexible object edits guided solely by textual prompts, which can encompass appearance and geometric modifications. To achieve this, the authors exploit pre-trained 2D diffusion models to modify images and match specific textual descriptions. The score distillation (SDS) loss has been adapted for unconditional text-driven 3D generation and utilized together with regularization techniques. The optimization process in 3D space is regularized by coupling two volumetric fields. This approach gives the system more flexibility to comply with the text guidance while preserving the input geometry and appearance.
Instead of utilizing neural fields, Vox-E relies on ReLU Fields, which are lighter than NeRF-based approaches and do not rely on neural networks. ReLU Fields represent the scene as a voxel grid where each voxel contains learned features. This explicit grid structure enables faster reconstruction and rendering times, as well as tight volumetric coupling between the volumetric fields representing the 3D object before and after the desired edit. Vox-E achieves this through a novel volumetric correlation loss over the density features.
To further refine the spatial extent of the edits, the authors exploit 2D cross-attention maps to capture regions associated with the target edit and transform them into volumetric grids. The premise behind this approach is that, while independent 2D internal features of generative models may be noisy, unifying them into a single 3D representation allows for better distillation of semantic knowledge. These 3D cross-attention grids are necessary for a binary volumetric segmentation algorithm to split the reconstructed volume into edited and non-edited regions. This process allows the framework to merge the features of the volumetric grids and to preserve better areas that should not be affected by the textual edit.
The results of this approach are compared with other state-of-the-art techniques. Some samples taken from the mentioned work are depicted below.
This was the summary of Vox-E, an AI framework for text-guided voxel editing of 3D objects.
If you are interested or want to learn more about this work, you can find a link to the paper and the project page.
Check out the Paper, Code, and Project Page. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.
Credit: Source link
Comments are closed.