Sparse Imagination
for Efficient Visual World Model Planning

Junha Chun^1*, Youngjoon Jeong^2*, Taesup Kim^2†

¹Department of Electrical and Computer Engineering, Seoul National University
²Graduate School of Data Science, Seoul National University
*Equal contribution, †Corresponding author

arXiv Code (Coming Soon) Openreview

Abstract

World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. This computational burden is particularly restrictive in robotics, where resources are severely constrained.

We propose Sparse Imagination, which improves planning efficiency by using only a subset of visual tokens during latent rollouts. Our method trains a transformer world model with randomized grouped attention so it can robustly predict under dynamic token sparsity. Across simulation and real-world tasks, sparse imagination achieves strong speedups while preserving control performance.

World Model Planning with Sparse Imagination.

Method

The world model predicts future DINO patch tokens conditioned on past observations and actions. During training, randomized grouped attention partitions tokens into groups and applies structured attention masks, enabling robust prediction under arbitrary token subsets. During planning with world model, we apply random token dropout with drop ratio p. This reduces attention cost and allows a direct trade-off between speed and accuracy.

Training with Randomized Grouped Attention Strategy.

Main Results

Test-Time Trajectory Optimization

Evaluated on Simulation environments; PointMaze, Wall, PushT, Block Pushing with MPC-CEM and Granular and Rope with CEM, Moderate drop ratios (10-50%) preserve task performance while reducing planning time substantially.

Method	Avg. Success (%)	Avg. Planning Time (s/iter)	Avg. Change (%)
Full	69.8	183.3	-
CLS	50.7	71.0	-61.3%
Drop 10%	74.1	166.3	-9.3%
Drop 20%	69.6	150.0	-18.1%
Drop 30%	71.0	136.3	-25.6%
Drop 40%	68.6	119.8	-34.7%
Drop 50%	69.7	109.0	-40.5%
Drop 60%	63.6	99.8	-45.6%
Drop 70%	53.3	89.5	-51.2%
Drop 80%	54.3	81.5	-55.5%
Drop 90%	46.7	73.8	-59.8%

VLA-Guided Planning

For policy-guided planning in long-horizon tasks including Real-world tasks (Pick-and-Place and Close-Drawer) and LIBERO-10, Sparse imagination with 50% token drop consistently matches Full-Patch performance with much lower planning overhead.

Drop 50%

Full patch

Performance vs inference-time in Real-world tasks and LIBERO-10.

Why Random Sampling Works

Across token reduction baselines, random sampling remains highly competitive and often best on average. We attribute this to unbiased spatial coverage and reduced blind-spot risk in dynamic planning.

Random

LHS

LTRP

Attention-Encoder

STAR

Attention-WM

Method Family	Representative	Avg. Success (%)
Random Sampling	Random	66.7
	Fixed	64.3
	LHS	65.7
Learning-Based Pruning	LTRP	59.5
Attention-Based Pruning	Attention-Encoder	63.0
	STAR	61.4
	Attention-WM	64.0
Cluster and Merging	ATC	41.7

Information Sufficiency (nHSIC and Attentive Probing)

We further verify that sparse token subsets retain useful state information. nHSIC remains high even under substantial dropout, and attentive probing shows that random token subsets keep strong predictive signal. Notably, even a single random token can be comparable to the CLS token in probing performance.

nHSIC between visual tokens and environment states.

Attentive probing validation loss under token dropout.

Conclusion

Sparse imagination reduces world-model planning cost by dropping visual tokens at inference while preserving performance across simulation and real-world tasks. The method is simple, robust, and broadly compatible with transformer-based visual planners.

BibTeX

@article{chun2026sparseimagination,
  title   = {Sparse Imagination for Efficient Visual World Model Planning},
  author  = {Junha Chun and Youngjoon Jeong and Taesup Kim},
  journal = {ICLR},
  year    = {2026}
}