- Microsoft Research teaches reasoning models to summarise their own thinking mid-generation — 2.5x less peak KV cache, ~2x throughput, and a surprising 'hidden channel' in the KV states that alone is worth 15 accuracy points on AIME24.
TL;DR
Microsoft Research released MEMENTO — a training recipe that teaches reasoning LLMs to compress their own chain-of-thought mid-generation. The model segments its thinking into blocks, emits a dense memento summary after each block, then masks and evicts the original block from its KV cache. Result: ~2.5x peak KV-cache reduction, ~1.75–2x throughput on vLLM, and a fascinating ablation showing the KV states themselves carry hidden residual information worth 15 accuracy points on AIME24. Paper, 228K-trace dataset, and vLLM fork are all open under MIT.
What's new
Most long-context tricks bolt onto frozen models: head-level eviction, quantization, restart-based recomputation. MEMENTO goes in the other direction — it trains the model to manage its own context from the inside.
- Reasoning traces get split into semantically coherent blocks using new special tokens (
<think>,<|block_start|>,<|summary_start|>). - When a block closes, the model writes a memento: a terse, information-dense summary of what it just figured out.
- The original block is masked from future attention and its KV entries are flushed.
- From then on, the model sees only past mementos plus the block it's currently working through.
Peak KV memory stops climbing monotonically and instead traces a sawtooth: each memento write drops the curve sharply before the next block slowly builds it back up.
Why it matters
Reasoning models are eating more and more tokens per answer. A 32K-token chain-of-thought is now routine; agentic loops stack those chains turn after turn. Every one of those tokens lives in the KV cache, and KV cache — not FLOPs — is usually what caps how many users you can fit on a GPU.
MEMENTO attacks exactly that bottleneck. Nearly doubled throughput on the same hardware is not a paper-only claim: on a B200 GPU, vLLM with the MEMENTO overlay served 240 concurrent requests in 693 s vs 1,096 s baseline, at 4,290 tok/s vs 2,447 tok/s.
Technical facts
| Property | Value |
|---|---|
| Per-block compression | 5–20x |
| Trace-level compression | ~6x (11K tokens → under 2K) |
| Peak KV cache reduction | ~2.5x (2–3x across configs) |
| Throughput (B200, vLLM) | 4,290 tok/s vs 2,447 tok/s (~1.75x) |
| Base-model overlap | 96.4% problem-solving capability retained |
| OpenMementos dataset | 228K traces (54% math / 19% code / 27% science) |
| Training data needed | ~30K examples suffice for SFT |
| Models tested | Qwen2.5-7B, Qwen3-8B/32B, Phi-4 Reasoning 14B, OLMo3-7B-Think |
| vLLM overlay | Fork of vLLM 0.13.0 with KV block masking |
| License | MIT |
The hidden channel: KV states as secondary memory
This is the most interesting finding in the paper. When the model writes a memento, it is still attending to the full block — so the KV entries produced at that moment encode more than the visible memento text alone.
To test it, the authors ran a restart-mode inference that throws away the memento's KV states and recomputes them from the memento text alone. Same text, same tokens — but accuracy on AIME'24 collapses from 66.1% to 50.8%, a 15.3-point drop.
The memento carries information in two channels: the explicit text, and the implicit KV representations produced while writing it. Remove the hidden channel and you lose a large fraction of the reasoning quality — even though the visible output looks identical.
This is an intriguing result for interpretability too. It suggests that parts of what a reasoning model "knows" at a given step live not in tokens you can read, but in tensors you can only keep or discard.
Comparison with prior KV-cache tricks
- Head-level eviction / quantization: works on any checkpoint, but treats KV as a static pool to shrink. MEMENTO instead teaches the model to retire whole reasoning segments once it has digested them.
- Restart-based recomputation: cheaper memory but loses the hidden KV channel → –15 pts on AIME24. MEMENTO's in-place masking is the first approach to keep that channel alive.
- Other CoT-compression methods (e.g. compressed CoT via dense representations): mostly offline or pre-/post-process the full trace. MEMENTO compresses during generation, which is what yields the throughput gain.
Use cases
- Serving reasoning models at higher concurrency on the same hardware — direct opex win for anyone running Qwen/Phi/OLMo-class models.
- Longer effective reasoning inside fixed context windows: useful when a single request needs to chew through dozens of thousands of tokens of intermediate state.
- Agentic workflows — the authors explicitly flag multi-turn terminal / CLI agents as the next target, where every tool call piles on history.
- Research on reasoning-time memory and the interpretability of implicit channels in attention.
Limitations & pricing
- Small initial accuracy gaps on AIME and GPQA-Diamond at smaller scales. They shrink with model size and close entirely with additional RL or majority voting at k=3.
- Not a drop-in inference trick — the model must be fine-tuned with the two-stage SFT recipe and the new special tokens.
- Inference requires the vLLM 0.13.0 overlay fork for block-level KV masking.
- Pricing: free. Code, dataset, and vLLM fork are all MIT-licensed.
What's next
The authors signal two directions: (1) scaling RL on top of the two-stage SFT to close the last accuracy gaps on hard math/science benchmarks, and (2) applying MEMENTO to agentic settings — long-horizon terminal and CLI agents, where context bloat is the dominant cost. If the technique generalises, "self-compressing reasoning" could become a standard capability shipped with the next generation of open-source reasoning models rather than a research curiosity.
Sources: Microsoft Research, arXiv 2604.09852, github.com/microsoft/memento, OpenMementos dataset.

