Real robot comparison frame showing latent future inference.

Robot learning project

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

LaWAM gives robot policies physical foresight by predicting compact latent visual subgoals, not future pixels.

Jialei Chen, Kai Wang, Kang Chen, Shuaihang Chen, Feng Gao, Wenhao Tang, Zhiyuan Li, Weilin Liu, Zhuyu Yao, Boxun Li, Yuanbo Xu, Chao Yu

98.6% LIBERO average SR
91.22% RoboTwin average SR
187 ms per action chunk
24x lower latency vs pixel WAMs

Why Latent Futures

Video imagination slows robot control.

Pixel-space world-action models can provide foresight, but they spend latency and model capacity on reconstructing visual detail. LaWAM moves future prediction into a frozen visual feature space, where a single latent subgoal captures the scene change needed for the next action chunk.

Pixel WAM Iterative video generation

Dense future frames add redundant appearance modeling before the robot can act.

LaWAM One latent subgoal

A compact future feature directly conditions action generation in one forward pass.

Method

Predict a latent subgoal, then act toward it.

LaWAM repurposes a latent action model decoder as a Latent World Model, then inserts the predicted future feature into a VLA action expert.

Two-stage LaWAM pipeline overview.
Stage 1 learns LaWM from visual transitions. Stage 2 trains the policy to infer latent actions, decode latent visual subgoals, and generate subgoal-conditioned action chunks.
01

Learn LaWM

Encode current and horizon observations with a frozen visual encoder, infer latent actions, and train a decoder to predict future features.

02

Distill subgoals

Teach the policy prior to predict latent actions that drive LaWM toward teacher latent subgoals from robot trajectories.

03

Generate actions

At test time, one latent world-model pass produces the subgoal used by the action expert for chunk-level control.

Three-Minute Overview

Watch the LaWAM project video.

Results

High success with low-latency latent prediction.

Across simulated and physical manipulation tasks, LaWAM keeps the predictive benefits of world-action modeling while avoiding the cost of pixel-space rollouts.

LIBERO latency and success trade-off chart.
LIBERO latency-success trade-off for 10 denoising steps.
LIBERO Benchmark
Method Model Latency Avg. SR
pi0.5 3.5B 220 ms 96.9
Cosmos-Policy 2.1B 1413 ms 98.5
LingBot-VA 5.5B 4482 ms 98.5
LaWAM 2.3B 187 ms 98.6

RoboTwin

Strong bimanual generalization over 50 manipulation tasks with 100 trials per task.

Clean92.64
Randomized89.80

Real-World Transfer

First across pick-and-place, drawer opening, and towel folding in 30 physical trials per task.

Pick-place93.3
Drawer86.7
Towel90.0
LIBERO chunk execution with latent subgoal heatmaps.
Subgoal-guided chunk execution on LIBERO.
Representative real-world robot rollouts.
Representative real-world rollouts across two robot platforms.

Dynamics Analysis

Shared latent transitions ground across embodiments.

Applying the same latent action trajectory to different initial observations produces context-specific latent rollouts, suggesting that LaWM grounds abstract transitions in the current embodiment and scene.

Cross-embodiment open-loop LaWM rollouts.
Open-loop LaWM rollouts from shared latent actions across embodiments.
Real-world inference video from the project slides.

Citation

Paper and citation.

@misc{chen2026lawam,
  title  = {LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies},
  author = {Chen, Jialei and Wang, Kai and Chen, Kang and Chen, Shuaihang and Gao, Feng and Tang, Wenhao and Li, Zhiyuan and Liu, Weilin and Yao, Zhuyu and Li, Boxun and Xu, Yuanbo and Yu, Chao},
  year   = {2026},
  note   = {Manuscript in preparation}
}