iMaC

Abstract

Embodied world models promise to serve as real-world simulators for robot policy evaluation and closed-loop rollout, but their reliability depends on how precisely they condition future video prediction on robot actions. Existing action-conditioned video models often encode future actions as compact vectors, leaving the model to infer fine-grained spatial consequences indirectly. iMaC addresses this by translating future actions into dense image-like controls. It uses robot URDF and forward kinematics to render future robot observations as motion images, and uses RGB-D geometry to construct two-stream contact images that describe robot-scene spatial relations. These controls guide future-video prediction while preserving a scalable image-to-video modeling interface.

Pipeline

Overview of iMaC translating future robot actions into motion images and contact images for action-conditioned future-video prediction.

Given a reference RGB observation and a future action sequence, iMaC converts joint actions into future robot configurations through URDF and forward kinematics. Rendered robot observations provide motion-image controls. Auxiliary depth prediction and pointcloud geometry provide contact-image controls that describe current scene to future gripper distances and future robot to current scene distances. The controls are injected as video/image controls into the future-video world model.

Generated Results and Conditions

Head View

Left Wrist View

Right Wrist View

Generated Results:
RGB Video

Condition Video:
Motion Image

Condition Video:
Contact Image:
Scene-to-Gripper

Condition Video:
Contact Image:
Robot-to-Scene

Head View

Left Wrist View

Right Wrist View

Generated Results:
RGB Video

Generated Results and Conditions

videos/task_05_grid.mp4

Condition Video:
Motion Image

Condition Video:
Contact Image:
Scene-to-Gripper

Condition Video:
Contact Image:
Robot-to-Scene

Head View

Left Wrist View

Right Wrist View

Generated Results:
RGB Video

Generated Results and Conditions

videos/task_07_grid.mp4

Condition Video:
Motion Image

Condition Video:
Contact Image:
Scene-to-Gripper

Condition Video:
Contact Image:
Robot-to-Scene

Each task shows the generated RGB rollout together with its action-derived controls: URDF/FK motion images, scene-to-gripper contact images, and robot-to-scene contact images across the three camera views.

Policy Evaluation

Correlation between normalized real-world success rate and normalized iMaC world-model success rate across eight tasks.

iMaC is designed for evaluating robot policies through closed-loop rollout in a learned real-world simulator. The correlation measures whether world-model scores preserve policy performance across different model families and across checkpoints from the same model, supporting both model comparison and iterative checkpoint selection.

iMaC: Translating Actions into Motion and Contact Images
for Embodied World Models

iMaC translates future robot actions into motion images and contact images, enabling embodied world models to predict contact-sensitive future videos for closed-loop policy evaluation.

Abstract

Pipeline

Generated Results and Conditions

Policy Evaluation

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

iMaC translates future robot actions into motion images and contact images, enabling embodied world models to predict contact-sensitive future videos for closed-loop policy evaluation.

Abstract

Pipeline

Generated Results and Conditions

Policy Evaluation

iMaC: Translating Actions into Motion and Contact Images
for Embodied World Models