NOVA3R Logo

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

ICLR 2026

1TU Munich    2Munich Center for Machine Learning    3University of Oxford    4NTU
Project Lead

TL;DR: Given unposed multi-view images, NOVA3R recovers complete, non-overlapping 3D geometry, reconstructing visible and occluded regions with physical plausibility.

Abstract

We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images, in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non–pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.

Examples

Compared with pixel-aligned reconstruction models like VGGT, NOVA3R can reconstruct both visible and occluded regions.


Input image

VGGT

LaRI

NOVA3R

Input Image

Input Image

Input Image

Input Image

Method

NOVA3R Pipeline: Stage 1: Learn latent scene tokens from complete point clouds using a flow-matching autoencoder. Stage 2: Map multi-view images into the same latent space via an image encoder with learnable initial tokens, and train the encoder with frozen decoder weights from stage 1 using flow-matching loss.

teaser-fig.

Complete Reconstruction

Given uncalibrated multi-view inputs, NOVA3R learns a global scene representation that integrates cross-view information to recover visible and occluded geometry. The resulting complete 3D reconstruction captures full scene structure and supports downstream tasks including robotics and scene understanding.

teaser-fig.

Non-overlapping Reconstruction with Even Density

By using a non-pixel-aligned formulation, NOVA3R learns a globally consistent point distribution, yielding non-overlapping reconstructions with uniform density. Pixel-aligned approaches, in contrast, tend to introduce duplicated and misaligned points in overlapping areas.

teaser-fig.

TRELLIS Integration for Scene Reconstruction

We integrate NOVA3R into the TRELLIS pipeline by replacing Stage-1 active voxel prediction with NOVA3R's non-pixel-aligned point cloud output. This simple substitution enables a pretrained object-centric TRELLIS model to generalize to scene-level reconstruction, while reusing the unchanged Stage-2 decoder to generate meshes, 3DGS, and radiance fields.

Input
TRELLIS
NOVA3R + TRELLIS (stage-2)

Related Work

We build on prior advances in multi-view 3D reconstruction, global scene representations, and diffusion-based generative models. NOVA3R complements pixel-aligned methods by learning a view-agnostic scene embedding that supports complete geometry recovery and reduces overlap artifacts. Here are some related works that inspired our approach:

BibTeX

@inproceedings{chen2026nova3r,
      title={NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction},
      author={Chen, Weirong and Zheng, Chuanxia and Zhang, Ganlin and Vedaldi, Andrea and Cremers, Daniel},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026}
    }