Uni3D: Exploring Unified 3D Representation at Scale

Scaling up representations of text and visuals has been a major focus of research in recent years. Developments and research conducted in the recent past have led to numerous revolutions in language learning and vision. However, despite the popularity of scaling text and visual representations, the scaling of representations for 3D scenes and objects has not been sufficiently discussed.

Today, we will discuss Uni3D, a 3D foundation model that aims to explore unified 3D representations. The Uni3D framework employs a 2D-initialized ViT framework, pretrained end-to-end, to align image-text features with their corresponding 3D point cloud features.

The Uni3D framework uses pretext tasks and a simple architecture to leverage the abundance of pretrained 2D models and image-text-aligned models as initializations and targets, respectively. This approach unleashes the full potential of 2D models and strategies to scale them to the 3D world.

In this article, we will delve deeper into 3D computer vision and the Uni3D framework, exploring the essential concepts and the architecture of the model. So, let’s begin.

In the past few years, computer vision has emerged as one of the most heavily invested domains in the AI industry. Following significant advancements in 2D computer vision frameworks, developers have shifted their focus to 3D computer vision. This field, particularly 3D representation learning, merges aspects of computer graphics, machine learning, computer vision, and mathematics to automate the processing and understanding of 3D geometry. The rapid development of 3D sensors like LiDAR, along with their widespread applications in the AR/VR industry, has resulted in 3D representation learning gaining increased attention. Its potential applications continue to grow daily.

Although existing frameworks have shown remarkable progress in 3D model architecture, task-oriented modeling, and learning objectives, most explore 3D architecture on a relatively small scale with limited data, parameters, and task scenarios. The challenge of learning scalable 3D representations, which can then be applied to real-time applications in diverse environments, remains largely unexplored.

Moving along, in the past few years, scaling large language models that are pre-trained has helped in revolutionizing the natural language processing domain, and recent works have indicated a translation in the progress to 2D from language using data and model scaling which makes way for developers to try & reattempt this success to learn a 3D representation that can be scaled & be transferred to applications in real world. 

Uni3D is a scalable and unified pretraining 3D framework developed with the aim to learn large-scale 3D representations that tests its limits at the scale of over a billion parameters, over 10 million images paired with over 70 million texts, and over a million 3D shapes. The figure below compares the zero-shot accuracy against parameters in the Uni3D framework. The Uni3D framework successfully scales 3D representations from 6 million to over a billion. 

The Uni3D framework consists of a 2D ViT or Vision Transformer as the 3D encoder that is then pre-trained end-to-end to align the image-text aligned features with the 3D point cloud features. The Uni3D framework makes use of pretext tasks and  simple architecture to leverage the abundance of pretrained 2D models and image text aligned models as initialization and targets respectively, thus unleashing the full potential of 2D models, and strategies to scale them to the 3D world. The flexibility & scalability of the Uni3D framework is measured in terms of

  1. Scaling the model from 6M to over a billion parameters. 
  2. 2D initialization to text supervised from visual self-supervised learning. 
  3. Text-image target model scaling from 150 million to over a billion parameters. 

Under the flexible and unified framework offered by Uni3D, developers observe a coherent boost in the performance when it comes to scaling each component. The large-scale 3D representation learning also benefits immensely from the sharable 2D and scale-up strategies. 

As it can be seen in the figure below, the Uni3D framework displays a boost in the performance when compared to prior art in few-shot and zero-shot settings. It is worth noting that the Uni3D framework returns a zero-shot classification accuracy score of over 88% on ModelNet which is at par with the performance of several state of the art supervision methods. 

Furthermore, the Uni3D framework also delivers top notch accuracy & performance when performing other representative 3D tasks like part segmentation, and open world understanding. The Uni3D framework aims to bridge the gap between 2D vision and 3D vision by scaling 3D foundational models with a unified yet simple pre-training approach to learn more robust 3D representations across a wide array of tasks, that might ultimately help in the convergence of 2D and 3D vision across a wide array of modalities.

Uni3D : Related Work

The Uni3D framework draws inspiration, and learns from the developments made by previous 3D representation learning, and Foundational models especially under different modalities. 

3D Representation Learning

The 3D representation learning method uses cloud points for 3D understanding of the object, and this field has been explored by developers a lot in the recent past, and it has been observed that these cloud points can be pre-trained under self-supervision using specific 3D pretext tasks including mask point modeling, self-reconstruction, and contrastive learning. 

It is worth noting that these methods work with limited data, and they often do not investigate multimodal representations to 3D from 2D or NLP. However, the recent success of the CLIP framework that returns high efficiency in learning visual concepts from raw text using the contrastive learning method, and further seeks to learn 3D representations by aligning image, text, and cloud point features using the same contrastive learning method. 

Foundation Models

Developers have exhaustively been working on designing foundation models to scale up and unify multimodal representations. For example, in the NLP domain, developers have been working on frameworks that can scale up pre-trained language models, and it is slowly revolutionizing the NLP industry. Furthermore, advancements can be observed in the 2D vision domain as well because developers are working on frameworks that use data & model scaling techniques to help in the progress of language to 2D models, although such frameworks are difficult to replicate for 3D models because of the limited availability of 3D data, and the challenges encountered when unifying & scaling up the 3D frameworks. 

By learning from the above two work domains, developers have created the Uni3D framework, the first 3D foundation model with over a billion parameters that makes use of a unified ViT or Vision Transformer architecture that allows developers to scale the Uni3D model using unified 3D or NLP strategies for scaling up the models. Developers hope that this method will allow the Uni3D framework to bridge the gap that currently separates 2D and 3D vision along with facilitating multimodal convergence

Uni3D : Method and Architecture

The above image demonstrates the generic overview of the Uni3D framework, a scalable and unified pre-training 3D framework for large-scale 3D representation learning. Developers make use of over 70 million texts, and 10 million images paired with over a million 3D shapes to scale the Uni3D framework to over a billion parameters. The Uni3D framework uses a 2D ViT or Vision Transformer as a 3D encoder that is then trained end-to-end to align the text-image data with the 3D cloud point features, allowing the Uni3D framework to deliver the desired efficiency & accuracy across a wide array of benchmarks. Let us now have a detailed look at the working of the Uni3D framework. 

Scaling the Uni3D Framework

Prior studies on cloud point representation learning have traditionally focused heavily on designing particular model architectures that deliver better performance across a wide range of applications, and work on a limited amount of data thanks to small-scale datasets. However, recent studies have tried exploring the possibility of using scalable pre-training in 3D but there were no major outcomes thanks to the availability of limited 3D data. To solve the scalability problem of 3D frameworks, the Uni3D framework leverages the power of a vanilla transformer structure that almost mirrors a Vision Transformer, and can solve the scaling problems by using unified 2D or NLP scaling-up strategies to scale the model size. 

Credit: Source link

Comments are closed.