DynaTok: Token-Based 4D Reconstruction
from Partial Point Clouds

ICML 2026

1TU Munich  ·  2Google  ·  3Munich Center for Machine Learning

🚧 Page under construction

DynaTok teaser: partial depth observations across time are aggregated into a complete, temporally coherent 4D scene.

TL;DR: Given partial, unordered point-cloud sequences, DynaTok aggregates observations over time into compact tokens and reconstructs complete, temporally coherent 4D geometry.

Abstract

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences.

To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens.

Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations.

Method

DynaTok pipeline overview.
Overview of the DynaTok pipeline. Each input point cloud is tokenized into per-frame latent tokens and aggregated by a Transformer-based spatiotemporal encoder. A residual decomposition separates the geometry tokens at the reference frame (s = 1) from per-frame motion tokens (s > 1), and a conditional flow-matching decoder reconstructs the complete scene at every timestep. The model is trained end-to-end with point-cloud supervision only.

Results

Qualitative results on DeformingThings4D-Animals.
Object-level 4D reconstruction on DeformingThings4D-Animals. Under partial point-cloud input back-projected from depth, DynaTok produces clearer articulated geometry and recovers regions that are unobserved in individual frames, compared to Motion2VecSets and frame-wise 3D latent baselines.
Qualitative results on Kubric scene-level 4D reconstruction.
Scene-level 4D reconstruction on Kubric. On multi-object dynamic scenes with moving camera and static background, DynaTok aggregates partial observations into a coherent canonical space, yielding dense and accurate reconstructions of both dynamic objects and background structure.

Canonical Space Analysis

Evolution of the canonical space as more frames are observed.
Canonical-space evolution across timesteps. As more frames are progressively incorporated (S = 1, 4, 8, 12, 16), the canonical representation anchored at the reference frame is consistently expanded with newly observed geometry. The reconstruction grows denser and more complete, confirming that DynaTok fuses multi-frame observations into a unified scene rather than reconstructing each frame independently.

BibTeX

@inproceedings{chen2026dynatok,
  title     = {DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds},
  author    = {Chen, Weirong and Tateno, Keisuke and Matsuki, Hidenobu and
               Niemeyer, Michael and Cremers, Daniel and Tombari, Federico},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}