Stanford University
Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting.
Tool Hang.1 The robot inserts a hook into a base and hangs a wrench on it. Precision-heavy, but every action is determined by the current frame, so history adds no signal.
Continuous Place Back. After placing a cup on a saucer, the robot returns it to within 5 cm of where it started. The start position is seen only once, so the policy must hold it in working memory across the motion.
Iterative Casting. The robot casts an object to stop between two green lines across three attempts. Friction is hidden but fixed, so the policy infers it from earlier outcomes and adjusts across trials via reference memory.
To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference.
On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines (including Diffusion Policy with long history, and the LSTM-based BC-RNN), while maintaining competitive performance on Markovian tasks in RoboMimic.1 All code and data will be publicly available.
In cross-trial tasks, a physical parameter (such as surface friction or object mass) is sampled per episode and hidden from the policy. The robot is given multiple trials and must leverage outcomes from earlier attempts to adapt. GMP's memory gate selectively activates history context only when it is informative, enabling robust cross-trial adaptation without overfitting to irrelevant past observations.6
The robot needs to cast the object with an unknown friction coefficient so that the object stops sliding between two green lines. The robot stops at the same position in each trial and adaptively adjusts its casting speed. In each episode, the robot is allowed 3 casting attempts. The episode is considered successful if the last 2 trials are successful.
In each episode, the friction coefficient of the cube is randomly sampled from 0.005 to 0.015 and is unknown during testing. The policy is given 6 trials to push the object to the target region (red box). An episode is considered successful if the cube stops within the target region in all of the last 3 trials.
The robot must fling the cloth so that its far edge lands in the target (black) area. In each episode, the cloth's mass is randomly sampled between 0.1 kg and 2.0 kg, and the robot is allowed five flings. Flinging too slowly prevents the cloth from fully extending, while flinging too fast causes it to fold back on itself; both result in failure. An episode is considered successful if the last 3 trials are all successful (the cloth's farthest edge lands within the black area).
In-trial tasks require the robot to recall a cue observed earlier within a single episode (such as an object's starting position or a bin's color). The relevant context appears only once; the policy must encode and retain it until a later decision point. GMP's lightweight cross-attention module constructs an effective latent memory on demand, gating out irrelevant history to stay focused.7
A cup and a saucer are randomly placed on the table. The robot must first place the cup on the saucer and then return it to within 5 cm of its original position. This task tests the policy's spatial memory and robustness to real-world noise.
The robot picks up a cube from one of 4 bins with randomly assigned colors. Once the cube is picked up, the bin colors are shuffled, and the robot needs to put the cube back in the bin with the original color. This task tests the policy's visual memory.
A cube is randomly placed in one of four bins, and the robot picks it up, holds it in the air for 2 seconds before returning it to the original bin. This task tests the policy's spatial memory.
To stress-test robustness in the wild, we deploy GMP outdoors on Continuous Place Back, one of our in-trial memory tasks. The robot places the cup on the saucer and returns it to within 5 cm of where it started. Because the start position is visible only once, at the beginning of the episode, and leaves the camera view the moment the cup is lifted, the policy must retain the start location in memory for the entire pick-and-place. We show two variants side by side: a clean version on the left, and a perturbation version on the right in which a human flips the cup over by ~90 degrees mid-task, forcing the policy to right the cup and still find its way back. Trials span backgrounds, lighting, surfaces, and cup placements never seen in training, collected and deployed via iPhUMI. All clips play at 1× real-time speed.
GMP decomposes the memory problem into two coupled decisions: when to read history, and what to read. A binary memory gate μt, conditioned on the current observation, controls the former; when the gate is active, a cross-attention block reads from a cache of per-frame summary tokens to produce the latter. A third component augments cached action traces with diffusion noise one step cleaner than the prediction target, stabilizing predictions under imperfect history at deployment.4
A Diffusion Transformer backbone with a cross-attention module for history conditioning. Each past frame is summarized as one aggregated visual token (via attention pooling over a pretrained SigLIP2-B/16 ViT) paired with its action tokens; these are cached across denoising steps, so inference cost stays linear in history length. When μt=0, history attention is skipped entirely and cost is constant.8
Training the gate jointly with the policy tends to collapse into one of two failure modes: the gate stays always on and the policy overfits on Markovian tasks, or the gate stays always off and the policy fails on non-Markovian ones. To avoid both, we calibrate the gate offline on a held-out validation set:
We visualize causal cross-attention weights to understand which past timesteps the policy attends to when computing the next action. Across tasks, attention concentrates on frames that carry task-relevant information (the moment a cue was first observed, or the outcome of a prior trial) rather than spreading uniformly across the available history.5
We evaluate 3 tasks from the RoboMimic1 benchmarks: Tool Hang, Square, and Transport. (ph, mh) indicate that the data were collected from proficient-human (ph) or multi-human (mh). While most long-history policies experience performance drops on these Markovian tasks, GMP maintains competitive performance by leveraging the gating mechanism.
The memory gate helps when the policy significantly overfits to history observations. This typically happens in tasks that require high precision manipulation and where there is not enough data to cover the expanded input space that memory introduces. A concrete example is Tool Hang in RoboMimic. The gate also helps when reactiveness matters: it reduces runtime and makes the model more efficient. That said, for most of the memory-intensive tasks in our benchmark, adding the gate does not significantly improve performance. A practical approach is to first train the model without a gate (a simplified version of GMP), run evaluation, and then add gates when the model fails due to the issues above.
We only include a single aggregated token from a ViT per image (e.g., the class token in CLIP or the MAP token in SigLIP2), and all history tokens are cached. The cross attention module also scales linearly in history length. As a result, the total inference time is similar to that of a policy without memory. We release an in-the-wild checkpoint so anyone can try it in the real world.
Please read through the README.md in each folder of the codebase carefully. The experiment details in the paper's appendix will also be helpful. If your problem is still not solved, feel free to submit a GitHub issue with a detailed description.
Due to the strong scalability of cross attention, the history length is only bottlenecked by training GPU memory. By caching visual features and freezing a pretrained image encoder, our model is able to attend to 6,000 image frames (with paired actions), achieving memory recall of up to 10 minutes at 10 fps. Please refer to Finding 7 for more details.
The robot may appear frozen for long stretches, which is expected: we are stress testing the policy's memory duration. At the end, once the bin colors shuffle, the robot places the cube back in its original bin. Both clips play at real time speed (10 fps); drag the progress bar to skim through the full episode.
We thank Austin Patel for developing the iPhUMI app, which enabled in-the-wild data collection and deployment; Chiling Han for assistance with the baseline experiments; and Mengda Xu and Huy Ha for guidance on MuJoCo simulation. We also thank members of REALab for feedback on the manuscript and presentation, and Yuejiang Liu, Moo Jin Kim, John Yao, Shang Yang, Genghan Zhang, and Dongchen Han for discussions during this work.
We thank TRI for providing the UR5 robot hardware and ARX for the X5 robot hardware. We thank Apple for providing the iPhone 15 Pro, and Stanford Marlowe for computational resources.
This work was supported in part by NSF Awards #2143601, #2037101, and #2132519. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
Questions about the work?
Problems with the codebase?