Apple Researchers Propose an End-to-End Network Producing Detailed 3D Reconstructions from Posed Images

Have you ever played GTA-5? One gets admired for the 3D graphics in the game. Unlike 2D graphics on a flat plane, 3D graphics simulate depth and perspective, allowing for more realistic and immersive visuals. These graphics are widely used in various fields, including video games, film production, architectural visualization, medical imaging, virtual reality, and more.

The traditional method to create a 3D model was by estimating the depth maps for the input images, which were later fused to create a 3D model. A team of researchers from Apple and the University of California, Santa Barbara created a  direct inference of scene-level 3D geometry using deep neural networks, which didn’t involve the traditional method of test-time optimization. 

The traditional method resulted in missing geometry or artifacts in the areas where the depth maps didn’t match due to being transparent or low-textured surfaces. The researcher’s approach features the images onto a voxel grid and directly predicts the scene’s truncated signed distance function (TSDF) using a 3D convolution neural network.

 A Convolutional Neural Network (CNN) is a specialized artificial neural network designed for processing and analyzing visual data, particularly images and videos. The advantage of using this technique is that CNN can learn and produce smooth, consistent surfaces that can fill the gaps in the low-textured or transparent regions. 

Researchers used tri-linear interpolation to sample the ground-truth TSDF to align with the model’s voxel grid during the training. This tri-linear interpolation sampling added random noise to the details in the training session. To overcome this, they considered only the supervised predictions at the exact points where the ground-truth TSDF is well known, and this method improved the results by 10%.

 A voxel is a short form for volume pixels. It represents a point in 3D space within a grid, similar to how a pixel represents a point in a 2D image. The existing voxels are 4cm or larger, which is not enough to resolve the geometric details visible in natural images, and it is expensive to increase the voxel resolution. They fixed this issue using a CNN grid feature, directly projecting image features to the query point. 

They were required to use a dense back projection for sampling any feature from each input image from each voxel. However, it caused blurring in the back-projection volume, and they solved this by using initial multi-view stereo depth estimation, which was further used to enhance the feature volume. 

Researchers claim that their method is key to enabling the network to learn the fine details and allowing the free selection of output resolution without requiring additional training or 3D convolution levels. 


Check out the Paper and Github link. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter


Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.


🚀 CodiumAI enables busy developers to generate meaningful tests (Sponsored)


Credit: Source link

Comments are closed.