World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. This computational burden is particularly restrictive in robotics, where resources are severely constrained.
We propose Sparse Imagination, which improves planning efficiency by using only a subset of visual tokens during latent rollouts. Our method trains a transformer world model with randomized grouped attention so it can robustly predict under dynamic token sparsity. Across simulation and real-world tasks, sparse imagination achieves strong speedups while preserving control performance.
The world model predicts future DINO patch tokens conditioned on past observations and actions. During training, randomized grouped attention partitions tokens into groups and applies structured attention masks, enabling robust prediction under arbitrary token subsets. During planning with world model, we apply random token dropout with drop ratio p. This reduces attention cost and allows a direct trade-off between speed and accuracy.
Evaluated on Simulation environments; PointMaze, Wall, PushT, Block Pushing with MPC-CEM and Granular and Rope with CEM, Moderate drop ratios (10-50%) preserve task performance while reducing planning time substantially.
| Method | Avg. Success (%) | Avg. Planning Time (s/iter) | Avg. Change (%) |
|---|---|---|---|
| Full | 69.8 | 183.3 | - |
| CLS | 50.7 | 71.0 | -61.3% |
| Drop 10% | 74.1 | 166.3 | -9.3% |
| Drop 20% | 69.6 | 150.0 | -18.1% |
| Drop 30% | 71.0 | 136.3 | -25.6% |
| Drop 40% | 68.6 | 119.8 | -34.7% |
| Drop 50% | 69.7 | 109.0 | -40.5% |
| Drop 60% | 63.6 | 99.8 | -45.6% |
| Drop 70% | 53.3 | 89.5 | -51.2% |
| Drop 80% | 54.3 | 81.5 | -55.5% |
| Drop 90% | 46.7 | 73.8 | -59.8% |
For policy-guided planning in long-horizon tasks including Real-world tasks (Pick-and-Place and Close-Drawer) and LIBERO-10, Sparse imagination with 50% token drop consistently matches Full-Patch performance with much lower planning overhead.
Drop 50%
Full patch
Performance vs inference-time in Real-world tasks and LIBERO-10.
Across token reduction baselines, random sampling remains highly competitive and often best on average. We attribute this to unbiased spatial coverage and reduced blind-spot risk in dynamic planning.
Random
LHS
LTRP
Attention-Encoder
STAR
Attention-WM
| Method Family | Representative | Avg. Success (%) |
|---|---|---|
| Random Sampling | Random | 66.7 |
| Fixed | 64.3 | |
| LHS | 65.7 | |
| Learning-Based Pruning | LTRP | 59.5 |
| Attention-Based Pruning | Attention-Encoder | 63.0 |
| STAR | 61.4 | |
| Attention-WM | 64.0 | |
| Cluster and Merging | ATC | 41.7 |
We further verify that sparse token subsets retain useful state information. nHSIC remains high even under substantial dropout, and attentive probing shows that random token subsets keep strong predictive signal. Notably, even a single random token can be comparable to the CLS token in probing performance.
nHSIC between visual tokens and environment states.
Attentive probing validation loss under token dropout.
Sparse imagination reduces world-model planning cost by dropping visual tokens at inference while preserving performance across simulation and real-world tasks. The method is simple, robust, and broadly compatible with transformer-based visual planners.
@article{chun2026sparseimagination,
title = {Sparse Imagination for Efficient Visual World Model Planning},
author = {Junha Chun and Youngjoon Jeong and Taesup Kim},
journal = {ICLR},
year = {2026}
}