Gated Memory Policy

Yihuai Gao, Jinyun Liu, Shuang Li, Shuran Song†

Stanford University

Robot policies that blindly extend observation histories overfit and regress. GMP learns when and what to recall, achieving 30.1% higher success rates on non-Markovian manipulation while staying competitive on Markovian tasks.

Paper arXiv Code

Model

Dataset

Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting.

To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference.

On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code and data will be publicly available.

Cross-Trial Memory Tasks

In cross-trial tasks, a physical parameter (such as surface friction or object mass) is randomly sampled per episode and hidden from the policy. The robot is given multiple trials and must leverage outcomes from earlier attempts to adapt. GMP's memory gate selectively activates history context only when it is informative, enabling robust cross-trial adaptation without overfitting to irrelevant past observations.

Iterative Casting (Real)

The robot needs to cast the object with an unknown friction coefficient so that the object stops sliding between two green lines. The robot stops at the same position in each trial and adaptively adjusts its casting speed. In each episode, the robot is allowed 3 casting attempts. The episode is considered successful if the last 2 trials are successful.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Iterative Pushing (Sim)

In each episode, the friction coefficient of the cube is randomly sampled from 0.005 to 0.015 and is unknown during testing. The policy is given 6 trials to push the object to the target region (red box). An episode is considered successful if the cube stops within the target region in all of the last 3 trials.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Iterative Flinging (Sim)

The robot must fling the cloth so that its far edge lands in the target (black) area. In each episode, the cloth's mass is randomly sampled between 0.1 kg and 2.0 kg, and the robot is allowed five flings. Flinging too slowly prevents the cloth from fully extending, while flinging too fast causes it to fold back on itself—both resulting in failure. An episode is considered successful if the last 3 trials are all successful.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

In-Trial Memory Tasks

In-trial tasks require the robot to recall a cue observed earlier within a single episode (such as an object's starting position or a bin's color). The relevant context appears only once; the policy must encode and retain it to succeed at a later decision point. GMP's lightweight cross-attention module constructs effective latent memory on demand, gating out irrelevant history to stay focused.

Continuous Place Back (Real)

A cup and a saucer are randomly placed on the table. The robot must first place the cup on the saucer and then return it to within 5 cm of its original position. This task tests the policy's spatial memory and robustness to real-world noise.

Success rate bar chart — Place Back (Real)

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Show Initial Cup Position Distribution

Initial cup position distribution across all locations

Match Color (Sim)

The robot picks up a cube from one of 4 bins with randomly assigned colors. Once the cube is picked up, the bin colors are shuffled, and the robot needs to put the cube back in the bin with the original color. This task tests the policy's visual memory.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Discrete Place Back (Sim)

A cube is randomly placed in one of four bins, and the robot picks it up, holds it in the air for 2 seconds before returning it to the original bin. This task tests the policy's spatial memory.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Method

Model Architecture

The gated attention module features three key designs: (1) Binary memory gate μ_t that determines whether history cross-attention is skipped or applied. (2) Noised history action condition to improve robustness and reduce overfitting. (3) Cached history tokens during inference time to reduce computational cost.

Gating Mechanism

We split the dataset into D_train and D_val. 1) Train two policies on D_train with memory gate always off π or always on π_mem. 2) Evaluate the two policies on D_val for N rounds and calculate the error of the predicted actions at each timestep t. 3) At timestep t, if the error of the no-memory policy δ_t is significantly larger than the error of the memory policy δ^mem_t, we label memory-required (μ_t=1) for this timestep; otherwise, μ_t=0. This stage is referred to as Calibration of the memory gate. 4) We freeze the memory gate weights and retrain the policy on the full dataset to obtain the final Gated Memory Policy π_gated.

What Is the Policy Attending To?

We visualize causal cross-attention weights to understand which past timesteps the policy attends to when computing the next action. The gated memory mechanism actively selects the most informative historical context, confirming that the policy learns structured, interpretable recall rather than attending uniformly across history.

Visualization of gated cross-attention weights for each 8-step action trajectory throughout one episode. Blue indicates the memory gate is on (μ_t=1) and gray indicates the memory gate is off (μ_t=0). During the pushing stage in Trial 4, attention focuses primarily on the outcomes from Trial 3 (undershoot) and Trial 2 (overshoot). The policy uses these past outcomes to adjust its action and achieve a successful push.

We visualize the gated cross-attention weights for each 8-step action chunk. At t=80, when the robot is placing the cube back, the attention focuses primarily on t=48, when the robot first observed the bin colors. Blue indicates the memory gate is on (μ_t=1); gray indicates the gate is off (μ_t=0).

Benchmark Performance

RoboMimic Results

RoboMimic benchmark success rates across tasks

We evaluate 3 tasks from the RoboMimic benchmarks: Tool Hang, Square, and Transport. (ph, mh) indicate that the data were collected from proficient-human (ph) or multi-human (mh). While most long-history policies experience performance drops on these Markovian tasks, GMP maintains competitive performance by leveraging the gating mechanism.

Mikasa-Robo Results

Mikasa-Robo benchmark success rates across 5 memory-dependent tasks

Evaluation results on MIKASA-Robo. We evaluate 5 tasks on the MIKASA-Robo benchmark, outperforming prior work MemoryVLA by 26.6% on average. The baseline performance statistics are reported in MIKASA-Robo and MemoryVLA.

Questions & Answers

How does GMP decide when to use memory?

[Your answer here. Explain the gating signal — what triggers μ_t=1 vs μ_t=0, and how the calibration phase produces these labels from a held-out validation set.]

Why does extending observation history hurt performance?

[Your answer here. Discuss distribution shift, overfitting to spurious correlations in long histories, and why GMP's gated approach avoids these failure modes.]

What is the inference-time overhead of GMP vs. a no-history policy?

[Your answer here. Mention the cached history tokens, the binary gate's near-zero cost when off, and any wall-clock numbers you want to share.]

Can GMP be applied to existing visuomotor policies (e.g., Diffusion Policy)?

[Your answer here. Describe what's needed to retrofit the gated cross-attention module onto an existing policy, and what training changes would be required.]

How was the MemMimic benchmark constructed and how can others use it?

[Your answer here. Outline the task selection criteria, dataset format, evaluation protocol, and how researchers can run their own methods on the benchmark.]

Will the code, datasets, and pretrained weights be released?

[Your answer here. Confirm the release plan, license, and where to find each artifact.]

Acknowledgments

Thanks

The authors would like to thank the many individuals and organizations whose contributions made this work possible. We are especially grateful to Austin Patel for developing the excellent iPhUMI app, which enabled in-the-wild data collection and deployment, to Chiling Han for assistance with the baseline experiments, and to Mengda Xu and Huy Ha for their guidance on MuJoCo simulation. We also thank all members of REALab for their thoughtful feedback on manuscript writing and presentation, along with Yuejiang Liu, Moo Jin Kim, John Yao, Shang Yang, Genghan Zhang, and Dongchen Han for insightful discussions throughout the course of this research.

We would like to thank TRI for providing the UR5 robot hardware and ARX for the X5 robot hardware. We thank Apple for providing the iPhone 15 Pro. We also thank Stanford Marlowe for providing computational resources.

This work was supported in part by NSF Awards #2143601, #2037101, and #2132519. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

Gated Memory Policy

Cross-Trial Memory Tasks

Iterative Casting (Real)

Iterative Pushing (Sim)

Iterative Flinging (Sim)

In-Trial Memory Tasks

Continuous Place Back (Real)

Match Color (Sim)

Discrete Place Back (Sim)

In-the-Wild Results

Method

Model Architecture

Gating Mechanism

What Is the Policy Attending To?

Benchmark Performance

RoboMimic Results

Mikasa-Robo Results

Questions & Answers

Team

Thanks