ICLR: In-Context Imitation Learning with
Visual Reasoning

Toan Nguyen1*, Weiduo Yuan1, Songlin Wei1, Hui Li2, Daniel Seita1†, Yue Wang1†

1 University of Southern California    2 Autodesk Research

Co-advising    * Corresponding author

Paper arXiv (coming soon) Code (coming soon)
TL;DR — We introduce an in-context imitation learning method that learns not only to act but also to reason.

Abstract

In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional parameter updates, but existing approaches typically condition only on state–action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different task intents. We present In-Context Imitation Learning with Visual Reasoning (ICLR), a framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. Our method jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.

Method Overview

Method overview
(A) Reasoning trace generation: To generate the visual reasoning trace at a given time step, we uniformly sample five third-view images from that time step to the end of the trajectory and use Molmo2 to predict the gripper’s pixel location in each image.
(B) Encoders: Multi-view camera images and proprioceptive states are encoded to state tokens $f_s$ by a state encoder. Visual reasoning traces are embedded by a reasoning encoder to produce reasoning tokens $f_r$, and actions are embedded by an action encoder to produce action tokens $f_a$.
(C) Transformer: Modality-specific tokens are interleaved and fed into a causal transformer, which autoregressively predicts the next reasoning trace followed by the corresponding action. During training, teacher forcing is applied over reasoning and action tokens. In inference, the model first generates a reasoning trace and then produces the action in a closed-loop manner.

Quantitative Results

Simulation Results

Real-World Results

Qualitative Comparisons

All rollouts are shown at 1X speed.

Tomato to Grey Bowl

ICRT
Ours

Dumpling to Red Box

ICRT
Ours

Zebra to Blue Bowl

ICRT
Ours

Lion to Blue Bowl

ICRT
Ours

Potato to Grey Bowl

ICRT
Ours

Hedgehog to Red Box

ICRT
Ours

Poke Monkey

ICRT
Ours

Poke Hippo

ICRT
Ours

More Results