Gated Memory Policy

Yihuai Gao, Jinyun Liu, Shuang Li, Shuran Song†

Stanford University

Published April 22, 2026 Updated April 23, 2026

Paper

arXiv Code

Model

Dataset

Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting.

Markovian task example — **Three memory regimes span robot manipulation.** Naively extending policy history degrades performance on (a) and grows prohibitively expensive for (c); GMP selectively recalls task-relevant history and handles all three.

In-trial memory task example — **Three memory regimes span robot manipulation.** Naively extending policy history degrades performance on (a) and grows prohibitively expensive for (c); GMP selectively recalls task-relevant history and handles all three.

To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference.

On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines (including Diffusion Policy with long history, and the LSTM-based BC-RNN), while maintaining competitive performance on Markovian tasks in RoboMimic.1 All code and data will be publicly available.

MemMimic Benchmark

Cross-Trial Memory Tasks

In cross-trial tasks, a physical parameter (such as surface friction or object mass) is sampled per episode and hidden from the policy. The robot is given multiple trials and must leverage outcomes from earlier attempts to adapt. GMP's memory gate selectively activates history context only when it is informative, enabling robust cross-trial adaptation without overfitting to irrelevant past observations.6

Iterative Casting (Real)

The robot needs to cast the object with an unknown friction coefficient so that the object stops sliding between two green lines. The robot stops at the same position in each trial and adaptively adjusts its casting speed. In each episode, the robot is allowed 3 casting attempts. The episode is considered successful if the last 2 trials are successful.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Iterative Pushing (Sim)

In each episode, the friction coefficient of the cube is randomly sampled from 0.005 to 0.015 and is unknown during testing. The policy is given 6 trials to push the object to the target region (red box). An episode is considered successful if the cube stops within the target region in all of the last 3 trials.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Iterative Flinging (Sim)

The robot must fling the cloth so that its far edge lands in the target (black) area. In each episode, the cloth's mass is randomly sampled between 0.1 kg and 2.0 kg, and the robot is allowed five flings. Flinging too slowly prevents the cloth from fully extending, while flinging too fast causes it to fold back on itself; both result in failure. An episode is considered successful if the last 3 trials are all successful (the cloth's farthest edge lands within the black area).

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

MemMimic Benchmark

In-Trial Memory Tasks

In-trial tasks require the robot to recall a cue observed earlier within a single episode (such as an object's starting position or a bin's color). The relevant context appears only once; the policy must encode and retain it until a later decision point. GMP's lightweight cross-attention module constructs an effective latent memory on demand, gating out irrelevant history to stay focused.7

Continuous Place Back (Real)

A cup and a saucer are randomly placed on the table. The robot must first place the cup on the saucer and then return it to within 5 cm of its original position. This task tests the policy's spatial memory and robustness to real-world noise.

Success rate bar chart,Place Back (Real)

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Show Initial Cup Position Distribution

Initial cup position distribution across all locations

Match Color (Sim)

The robot picks up a cube from one of 4 bins with randomly assigned colors. Once the cube is picked up, the bin colors are shuffled, and the robot needs to put the cube back in the bin with the original color. This task tests the policy's visual memory.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Discrete Place Back (Sim)

A cube is randomly placed in one of four bins, and the robot picks it up, holds it in the air for 2 seconds before returning it to the original bin. This task tests the policy's spatial memory.

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Ours (Gated Memory)

Long-History DP

No-History DP

Generalization

In-the-Wild Results

To stress-test robustness in the wild, we deploy GMP outdoors on Continuous Place Back, one of our in-trial memory tasks. The robot places the cup on the saucer and returns it to within 5 cm of where it started. Because the start position is visible only once, at the beginning of the episode, and leaves the camera view the moment the cup is lifted, the policy must retain the start location in memory for the entire pick-and-place. We show two variants side by side: a clean version on the left, and a perturbation version on the right in which a human flips the cup over by ~90 degrees mid-task, forcing the policy to right the cup and still find its way back. Trials span backgrounds, lighting, surfaces, and cup placements never seen in training, collected and deployed via iPhUMI. All clips play at 1× real-time speed.

Continuous Place Back Without Perturbation

Continuous Place Back With Human Perturbation

Method

GMP decomposes the memory problem into two coupled decisions: when to read history, and what to read. A binary memory gate μ_t, conditioned on the current observation, controls the former; when the gate is active, a cross-attention block reads from a cache of per-frame summary tokens to produce the latter. A third component augments cached action traces with diffusion noise one step cleaner than the prediction target, stabilizing predictions under imperfect history at deployment.4

Model Architecture

A Diffusion Transformer backbone with a cross-attention module for history conditioning. Each past frame is summarized as one aggregated visual token (via attention pooling over a pretrained SigLIP2-B/16 ViT) paired with its action tokens; these are cached across denoising steps, so inference cost stays linear in history length. When μ_t=0, history attention is skipped entirely and cost is constant.8

Gating Mechanism

Training the gate jointly with the policy tends to collapse into one of two failure modes: the gate stays always on and the policy overfits on Markovian tasks, or the gate stays always off and the policy fails on non-Markovian ones. To avoid both, we calibrate the gate offline on a held-out validation set:

Split the dataset into D_train and D_val. Train two policies on D_train: gate always off π, always on π_mem.
Roll both on D_val for N samples; record per-timestep action-prediction error.
Label a timestep memory-required (μ_t=1) where δ_t exceeds δ^mem_t by ratio threshold θ; otherwise μ_t=0. Train the gate MLP on these labels with BCE loss.
Freeze the gate; retrain the policy on the full dataset to obtain π_gated.

What Is the Policy Attending To?

We visualize causal cross-attention weights to understand which past timesteps the policy attends to when computing the next action. Across tasks, attention concentrates on frames that carry task-relevant information (the moment a cue was first observed, or the outcome of a prior trial) rather than spreading uniformly across the available history.5

Visualization of gated cross-attention weights for each 8-step action trajectory throughout one episode. Blue indicates the memory gate is on (μ_t=1) and gray indicates the memory gate is off (μ_t=0). During the pushing stage in Trial 4, attention focuses primarily on the outcomes from Trial 3 (undershoot) and Trial 2 (overshoot). The policy uses these past outcomes to adjust its action and achieve a successful push.

We visualize the gated cross-attention weights for each 8-step action chunk. At t=80, when the robot is placing the cube back, the attention focuses primarily on t=48, when the robot first observed the bin colors. Blue indicates the memory gate is on (μ_t=1); gray indicates the gate is off (μ_t=0).

Benchmark Performance

RoboMimic Results

RoboMimic benchmark success rates across tasks

We evaluate 3 tasks from the RoboMimic1 benchmarks: Tool Hang, Square, and Transport. (ph, mh) indicate that the data were collected from proficient-human (ph) or multi-human (mh). While most long-history policies experience performance drops on these Markovian tasks, GMP maintains competitive performance by leveraging the gating mechanism.

Mikasa-Robo Results

Mikasa-Robo benchmark success rates across 5 memory-dependent tasks

We evaluate 5 tasks on the MIKASA-Robo2 benchmark, outperforming prior work MemoryVLA3 by 26.6% on average. The baseline performance statistics are reported in MIKASA-Robo and MemoryVLA.

Questions & Answers

When is the memory gate helpful, and when should I train it?

The memory gate helps when the policy significantly overfits to history observations. This typically happens in tasks that require high precision manipulation and where there is not enough data to cover the expanded input space that memory introduces. A concrete example is Tool Hang in RoboMimic. The gate also helps when reactiveness matters: it reduces runtime and makes the model more efficient. That said, for most of the memory-intensive tasks in our benchmark, adding the gate does not significantly improve performance. A practical approach is to first train the model without a gate (a simplified version of GMP), run evaluation, and then add gates when the model fails due to the issues above.

What is the inference overhead of GMP vs. a no-memory policy?

We only include a single aggregated token from a ViT per image (e.g., the class token in CLIP or the MAP token in SigLIP2), and all history tokens are cached. The cross attention module also scales linearly in history length. As a result, the total inference time is similar to that of a policy without memory. We release an in-the-wild checkpoint so anyone can try it in the real world.

What should I do if I have trouble reproducing the results?

Please read through the README.md in each folder of the codebase carefully. The experiment details in the paper's appendix will also be helpful. If your problem is still not solved, feel free to submit a GitHub issue with a detailed description.

How long can our model memorize?

Due to the strong scalability of cross attention, the history length is only bottlenecked by training GPU memory. By caching visual features and freezing a pretrained image encoder, our model is able to attend to 6,000 image frames (with paired actions), achieving memory recall of up to 10 minutes at 10 fps. Please refer to Finding 7 for more details.

Episode 7 · ~8 min

Episode 69 · ~10 min

The robot may appear frozen for long stretches, which is expected: we are stress testing the policy's memory duration. At the end, once the bin colors shuffle, the robot places the cube back in its original bin. Both clips play at real time speed (10 fps); drag the progress bar to skim through the full episode.

Acknowledgments

We thank Austin Patel for developing the iPhUMI app, which enabled in-the-wild data collection and deployment; Chiling Han for assistance with the baseline experiments; and Mengda Xu and Huy Ha for guidance on MuJoCo simulation. We also thank members of REALab for feedback on the manuscript and presentation, and Yuejiang Liu, Moo Jin Kim, John Yao, Shang Yang, Genghan Zhang, and Dongchen Han for discussions during this work.

We thank TRI for providing the UR5 robot hardware and ARX for the X5 robot hardware. We thank Apple for providing the iPhone 15 Pro, and Stanford Marlowe for computational resources.

This work was supported in part by NSF Awards #2143601, #2037101, and #2132519. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

Gated Memory Policy

Cross-Trial Memory Tasks

Iterative Casting (Real)

Iterative Pushing (Sim)

Iterative Flinging (Sim)

In-Trial Memory Tasks

Continuous Place Back (Real)

Match Color (Sim)

Discrete Place Back (Sim)

In-the-Wild Results

Method

Model Architecture

Gating Mechanism

What Is the Policy Attending To?

Benchmark Performance

RoboMimic Results

Mikasa-Robo Results

Questions & Answers

Team

Acknowledgments