Computer science has recently entered a new era in which Artificial Intelligence (AI) technology can be used to create detailed and lifelike images. A great improvement has been brought in the field of multimedia generation (for instance, text-to-text, text-to-image, image-to-image, and image-to-text generation). Thanks to the successful release of many recent generative models like Stable Diffusion and Dall-E (text-to-image) or ChatGPT (text-to-text) from OpenAI, these technologies are rapidly improving and capturing people’s interests. Other than the previously mentioned generation, these models have been developed for many different goals. Another important application is the so-called talking head generation.
For those who do not know it, talking head generation represents the task of generating a talking face from a set of images of a person.
Virtual reality, face-to-face live chat, and virtual avatars in games and media are just a few places talking heads have found significant use. Recent advances in neural rendering approaches have surpassed those achieved with pricey driving sensors and sophisticated 3D human modeling. Despite the rising realism and greater rendering resolution that these works achieve, identity preservation is still hard to achieve since the human visual system is so sensitive to even the slightest change in a person’s face shape. The work presented in this article attempts to create a talking face that seems genuine and can move according to the driver’s motion using only a single source picture (one-shot).
The idea is to develop an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source
identity during synthesis so that the network better preserves the key characteristics of the image portrait.
The picture depicted below shows the overall framework architecture.
The input to the model is dual. First, an image of a person will be utilized as the source image, and a sequence of driving video frames is requested to guide the video generation. The model is indeed asked to generate an output video with the motions derived from the driving video while maintaining the identity of the source image.
The first step is landmark detection. The authors claim that dense landmark prediction is the key to a geometry-aware warping field estimation, used in later stages to capture and guide the head movement. For this purpose, a prediction model has been trained (on synthetic faces) to ease the landmark acquisition process. A simple approach for processing these landmarks would be to concatenate them channel-wise. However, this operation is computationally demanding, given the many channels involved. Hence, in the paper, a different strategy has been presented. The landmark points are connected through a line and differentiated through colors.
The second step is the warping field generation. For this task, the landmarks of the source and driving images are concatenated with the source image. Furthermore, the warping field prediction is conditioned to a latent vector produced from the concatenated images.
The third step involves identity-preserving refinement. If the source image were warped directly with the predicted flow field, artifacts would inevitably arise, and the identity will likely not be preserved. For this reason, the authors introduce an identity-preserving refinement network that takes the warping field prediction, the source image, and an identity embedding of the image (extracted via a pre-trained face recognition model) to generate the semantically-preserved driven frame.
The last step involves upsampling the frames. Doing this naively without considering the temporal consistency between frames would produce artifacts in the output video. Therefore, the proposed solution includes a temporal super-resolution network to account for temporal relationships across adjacent frames. Specifically, it leverages a pretrained
StyleGAN model and 3D convolution (in the spatio-temporal domain), implemented in a U-Net module. The output video through super-resolution will have a dimension of 512×512.
The image below represents the comparison between the proposed architecture and state-of-the-art approaches.
This was the summary of MetaPortrait, a novel framework to address the talking head generation problem. If you are interested, you can find more information in the links below.
Check out the Paper, Github, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.
Credit: Source link
Comments are closed.