UC Berkeley And Adobe AI Researchers Propose BlobGAN, A New Unsupervised And Mid-Level Representation For Insane Scene Manipulation

Since the advent of computer vision, one of the fundamental questions of the research community has always been how to represent the incredible richness of the visual world. One concept that emerged since the beginning is the importance of a scene in the context of understanding objects. Suppose we want a classifier for distinguishing between a couch and a bed. In that case, the scene context will give information concerning the surrounding (i.e., the room is a living room or a bedroom) that could be helpful for the classification. 

However, after years of research, images of scenes are still mainly represented in two ways: 1) in a top-down fashion, so scene classes are represented with a label in the same way as object classes, or 2) in a bottom-up fashion, with semantic labeling of single pixels. The principal limit of these two approaches is that they do not represent the different parts of a scene as entities. In the first case, the various components are merged in a unique label; in the second case, the single elements are individual pixels, not entities.

From the official video presentation | Source: https://arxiv.org/pdf/2205.02837.pdf

To fill this lack, researchers from UC Berkeley and Adobe Research proposed BlobGAN, an extremely new unsupervised mid-level representation for generative models of scenes. Mid-level means that the representation is not per-pixel nor per-image, but entities in scenes are modeled with spatial and depth-ordered Gaussian blobs. Given some random noise, the layout network, an 8-layer MLP, maps it to a collection of blobs’ parameters, which are then distributed in a spatial grid and passed to a StyleGAN2-like decoder. The model is trained in an adversarial framework with an unmodified StyleGAN2 discriminator.

Source: https://arxiv.org/pdf/2205.02837.pdf

More specifically, blobs are represented as ellipses with center coordinates x, scale s, aspect ratio a, and rotation angle θ. In addition, each blob is associated with two feature vectors, one for structure and one for style. 

Source: https://arxiv.org/pdf/2205.02837.pdf | From the official video presentation

So, the layout network maps random noise to a fixed number of k blobs (the network can also decide to suppress a blob by imposing a very low scaling parameter), each represented by four parameters (actually five, as the center is defined by x and y coordinates) and two feature vectors. Then, all the ellipses defined by the parameters are splatted in a grid with also the depth dimension, and subsequently, alpha composited (to handle occlusion and relationships) in 2D and populated using the information contained in the features vectors. The image is then passed to the generator. In the original StyleGAN2, the generator took as input a single array with all the extracted information, while in this work, the first layers are modified to take layout and appearance separately. This technique enforced a disentangled representation, together with the fact that the authors added uniform noise to blob parameters before inputting them to the generator. 

The above-defined network was trained with the LSUN scenes dataset in an unsupervised way. 

Despite being not supervised, thanks to the spatial uniformity of blobs and the locality of convolutions, the network was able to associate different blobs to different components of the scene. This can be seen from the presented results, computed with k=10 blobs. For an extensive visualization of the results, here’s the project page with animations. The results are awe-inspiring, as it is possible to deduce from the image below: manipulating blobs allows substantial and precise modification to the generated image. It is, for example, possible to empty a room (even if the dataset was not trained with empty rooms images), add, shrink and move entities and also restyle the different objects.

Source: https://arxiv.org/pdf/2205.02837.pdf

In conclusion, if diffusion models recently overshadowed GANs, this paper presents a new and disruptive technique that controls the scene with unseen precision. In addition, the training is entirely unsupervised, thus not needing time for labeling the various images.

This Article is written as a summary article by Marktechpost Staff based on the paper 'BlobGAN: Spatially Disentangled
Scene Representations'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github, project.

Please Don't Forget To Join Our ML Subreddit

Credit: Source link

Comments are closed.