Stanford University
Robot policies that blindly extend observation histories overfit and regress. GMP learns when and what to recall, achieving 30.1% higher success rates on non-Markovian manipulation while staying competitive on Markovian tasks.
Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting.
To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference.
On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code and data will be publicly available.
In cross-trial tasks, a physical parameter (such as surface friction or object mass) is randomly sampled per episode and hidden from the policy. The robot is given multiple trials and must leverage outcomes from earlier attempts to adapt. GMP's memory gate selectively activates history context only when it is informative, enabling robust cross-trial adaptation without overfitting to irrelevant past observations.
The robot needs to cast the object with an unknown friction coefficient so that the object stops sliding between two green lines. The robot stops at the same position in each trial and adaptively adjusts its casting speed. In each episode, the robot is allowed 3 casting attempts. The episode is considered successful if the last 2 trials are successful.
In each episode, the friction coefficient of the cube is randomly sampled from 0.005 to 0.015 and is unknown during testing. The policy is given 6 trials to push the object to the target region (red box). An episode is considered successful if the cube stops within the target region in all of the last 3 trials.
The robot must fling the cloth so that its far edge lands in the target (black) area. In each episode, the cloth's mass is randomly sampled between 0.1 kg and 2.0 kg, and the robot is allowed five flings. Flinging too slowly prevents the cloth from fully extending, while flinging too fast causes it to fold back on itself—both resulting in failure. An episode is considered successful if the last 3 trials are all successful.
In-trial tasks require the robot to recall a cue observed earlier within a single episode (such as an object's starting position or a bin's color). The relevant context appears only once; the policy must encode and retain it to succeed at a later decision point. GMP's lightweight cross-attention module constructs effective latent memory on demand, gating out irrelevant history to stay focused.
A cup and a saucer are randomly placed on the table. The robot must first place the cup on the saucer and then return it to within 5 cm of its original position. This task tests the policy's spatial memory and robustness to real-world noise.
The robot picks up a cube from one of 4 bins with randomly assigned colors. Once the cube is picked up, the bin colors are shuffled, and the robot needs to put the cube back in the bin with the original color. This task tests the policy's visual memory.
A cube is randomly placed in one of four bins, and the robot picks it up, holds it in the air for 2 seconds before returning it to the original bin. This task tests the policy's spatial memory.
40 outdoor Continuous Place Back trials run end-to-end with Gated Memory Policy, across backgrounds, lighting, surfaces, and cup placements never seen in training. Some include human perturbation, where we flip the cup over by hand mid-task.
The gated attention module features three key designs: (1) Binary memory gate μt that determines whether history cross-attention is skipped or applied. (2) Noised history action condition to improve robustness and reduce overfitting. (3) Cached history tokens during inference time to reduce computational cost.
We split the dataset into Dtrain and Dval. 1) Train two policies on Dtrain with memory gate always off π or always on πmem. 2) Evaluate the two policies on Dval for N rounds and calculate the error of the predicted actions at each timestep t. 3) At timestep t, if the error of the no-memory policy δt is significantly larger than the error of the memory policy δmemt, we label memory-required (μt=1) for this timestep; otherwise, μt=0. This stage is referred to as Calibration of the memory gate. 4) We freeze the memory gate weights and retrain the policy on the full dataset to obtain the final Gated Memory Policy πgated.
We visualize causal cross-attention weights to understand which past timesteps the policy attends to when computing the next action. The gated memory mechanism actively selects the most informative historical context, confirming that the policy learns structured, interpretable recall rather than attending uniformly across history.
We evaluate 3 tasks from the RoboMimic benchmarks: Tool Hang, Square, and Transport. (ph, mh) indicate that the data were collected from proficient-human (ph) or multi-human (mh). While most long-history policies experience performance drops on these Markovian tasks, GMP maintains competitive performance by leveraging the gating mechanism.
Evaluation results on MIKASA-Robo. We evaluate 5 tasks on the MIKASA-Robo benchmark, outperforming prior work MemoryVLA by 26.6% on average. The baseline performance statistics are reported in MIKASA-Robo and MemoryVLA.
[Your answer here. Explain the gating signal — what triggers μt=1 vs μt=0, and how the calibration phase produces these labels from a held-out validation set.]
[Your answer here. Discuss distribution shift, overfitting to spurious correlations in long histories, and why GMP's gated approach avoids these failure modes.]
[Your answer here. Mention the cached history tokens, the binary gate's near-zero cost when off, and any wall-clock numbers you want to share.]
[Your answer here. Describe what's needed to retrofit the gated cross-attention module onto an existing policy, and what training changes would be required.]
[Your answer here. Outline the task selection criteria, dataset format, evaluation protocol, and how researchers can run their own methods on the benchmark.]
[Your answer here. Confirm the release plan, license, and where to find each artifact.]
The authors would like to thank the many individuals and organizations whose contributions made this work possible. We are especially grateful to Austin Patel for developing the excellent iPhUMI app, which enabled in-the-wild data collection and deployment, to Chiling Han for assistance with the baseline experiments, and to Mengda Xu and Huy Ha for their guidance on MuJoCo simulation. We also thank all members of REALab for their thoughtful feedback on manuscript writing and presentation, along with Yuejiang Liu, Moo Jin Kim, John Yao, Shang Yang, Genghan Zhang, and Dongchen Han for insightful discussions throughout the course of this research.
We would like to thank TRI for providing the UR5 robot hardware and ARX for the X5 robot hardware. We thank Apple for providing the iPhone 15 Pro. We also thank Stanford Marlowe for providing computational resources.
This work was supported in part by NSF Awards #2143601, #2037101, and #2132519. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.