KAIST Researchers Introduce FaceCLIPNeRF: A Text-Driven Manipulation Pipeline of a 3D Face Using Deformable NeRF

A crucial component of 3D digital human content improvements is the ability to manipulate 3D face representation easily. Although Neural Radiance Field (NeRF) has made significant progress in reconstructing 3D scenes, many of its manipulative techniques focus on rigid geometry or color manipulations, which need to be improved for jobs requiring fine-grained control over facial expressions. Although a recent study presented a regionally controlled face editing approach, it necessitates a laborious procedure of gathering user-annotated masks of different portions of the face from selected training frames, followed by human attribute control to accomplish a desired alteration. 

Face-specific implicit representation techniques encode observed facial expressions with high fidelity by using the parameters of morphable face models as priors. Their hand manipulations, however, need large training sets that span a range of facial expressions and number around 6000 frames. This makes both the data gathering and manipulation processes arduous. Instead, researchers from KAIST and Scatter Lab develop a method that trains over a dynamic portrait video with around 300 training frames that comprise a few different types of face deformation instances to allow text-driven modification, as shown in Figure 1.

Figure 1

Their approach learns, and isolates observed deformations from a canonical space using HyperNeRF before controlling a face deformation. In particular, a common latent code conditional implicit scene network and per-frame deformation latent codes are taught across the training frames. Their fundamental discovery is using numerous spatially variable latent codes to express scene deformations for manipulation tasks. The epiphany arises from the drawbacks of naively applying HyperNeRF formulations to manipulation problems, namely, to look for a single latent code that encodes a desired facial distortion. 

For example, a single latent code cannot convey a facial expression that requires a mixture of local deformations seen in many cases. In their study, they identify this problem as a “linked local attribute problem” and address it by providing a modified scene with spatially variable latent codes. To do this, they first compile all observed deformations into a collection of anchor codes, which they then teach MLP to combine to produce numerous position-conditional latent codes. Then, by enhancing the produced pictures of the latent codes to be near a target text in CLIP embedding space, the reflectivity of the latent codes on the visual characteristics of a target text is realized. In conclusion, their work contributes the following: 

• Design of a manipulation network that learns to represent a scene with spatially variable latent codes

• Proposal of a text-driven manipulation pipeline of a face rebuilt with NeRF

• To the best of their knowledge, the first person to manipulate text about a face that has been NeRF-reconstructed.


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.


🔥 Gain a competitive
edge with data: Actionable market intelligence for global brands, retailers, analysts, and investors. (Sponsored)

Credit: Source link

Comments are closed.