TL;DR

Microsoft Research released MEMENTO — a training recipe that teaches reasoning LLMs to compress their own chain-of-thought mid-generation. The model segments its thinking into blocks, emits a dense memento summary after each block, then masks and evicts the original block from its KV cache. Result: ~2.5x peak KV-cache reduction, ~1.75–2x throughput on vLLM, and a fascinating ablation showing the KV states themselves carry hidden residual information worth 15 accuracy points on AIME24. Paper, 228K-trace dataset, and vLLM fork are all open under MIT.

What's new

Most long-context tricks bolt onto frozen models: head-level eviction, quantization, restart-based recomputation. MEMENTO goes in the other direction — it trains the model to manage its own context from the inside.

  • Reasoning traces get split into semantically coherent blocks using new special tokens (<think>, <|block_start|>, <|summary_start|>).
  • When a block closes, the model writes a memento: a terse, information-dense summary of what it just figured out.
  • The original block is masked from future attention and its KV entries are flushed.
  • From then on, the model sees only past mementos plus the block it's currently working through.

Peak KV memory stops climbing monotonically and instead traces a sawtooth: each memento write drops the curve sharply before the next block slowly builds it back up.

Why it matters

Reasoning models are eating more and more tokens per answer. A 32K-token chain-of-thought is now routine; agentic loops stack those chains turn after turn. Every one of those tokens lives in the KV cache, and KV cache — not FLOPs — is usually what caps how many users you can fit on a GPU.

MEMENTO attacks exactly that bottleneck. Nearly doubled throughput on the same hardware is not a paper-only claim: on a B200 GPU, vLLM with the MEMENTO overlay served 240 concurrent requests in 693 s vs 1,096 s baseline, at 4,290 tok/s vs 2,447 tok/s.

Technical facts

PropertyValue
Per-block compression5–20x
Trace-level compression~6x (11K tokens → under 2K)
Peak KV cache reduction~2.5x (2–3x across configs)
Throughput (B200, vLLM)4,290 tok/s vs 2,447 tok/s (~1.75x)
Base-model overlap96.4% problem-solving capability retained
OpenMementos dataset228K traces (54% math / 19% code / 27% science)
Training data needed~30K examples suffice for SFT
Models testedQwen2.5-7B, Qwen3-8B/32B, Phi-4 Reasoning 14B, OLMo3-7B-Think
vLLM overlayFork of vLLM 0.13.0 with KV block masking
LicenseMIT

The hidden channel: KV states as secondary memory

This is the most interesting finding in the paper. When the model writes a memento, it is still attending to the full block — so the KV entries produced at that moment encode more than the visible memento text alone.

To test it, the authors ran a restart-mode inference that throws away the memento's KV states and recomputes them from the memento text alone. Same text, same tokens — but accuracy on AIME'24 collapses from 66.1% to 50.8%, a 15.3-point drop.

The memento carries information in two channels: the explicit text, and the implicit KV representations produced while writing it. Remove the hidden channel and you lose a large fraction of the reasoning quality — even though the visible output looks identical.

This is an intriguing result for interpretability too. It suggests that parts of what a reasoning model "knows" at a given step live not in tokens you can read, but in tensors you can only keep or discard.

Comparison with prior KV-cache tricks

  • Head-level eviction / quantization: works on any checkpoint, but treats KV as a static pool to shrink. MEMENTO instead teaches the model to retire whole reasoning segments once it has digested them.
  • Restart-based recomputation: cheaper memory but loses the hidden KV channel → –15 pts on AIME24. MEMENTO's in-place masking is the first approach to keep that channel alive.
  • Other CoT-compression methods (e.g. compressed CoT via dense representations): mostly offline or pre-/post-process the full trace. MEMENTO compresses during generation, which is what yields the throughput gain.

Use cases

  • Serving reasoning models at higher concurrency on the same hardware — direct opex win for anyone running Qwen/Phi/OLMo-class models.
  • Longer effective reasoning inside fixed context windows: useful when a single request needs to chew through dozens of thousands of tokens of intermediate state.
  • Agentic workflows — the authors explicitly flag multi-turn terminal / CLI agents as the next target, where every tool call piles on history.
  • Research on reasoning-time memory and the interpretability of implicit channels in attention.

Limitations & pricing

  • Small initial accuracy gaps on AIME and GPQA-Diamond at smaller scales. They shrink with model size and close entirely with additional RL or majority voting at k=3.
  • Not a drop-in inference trick — the model must be fine-tuned with the two-stage SFT recipe and the new special tokens.
  • Inference requires the vLLM 0.13.0 overlay fork for block-level KV masking.
  • Pricing: free. Code, dataset, and vLLM fork are all MIT-licensed.

What's next

The authors signal two directions: (1) scaling RL on top of the two-stage SFT to close the last accuracy gaps on hard math/science benchmarks, and (2) applying MEMENTO to agentic settings — long-horizon terminal and CLI agents, where context bloat is the dominant cost. If the technique generalises, "self-compressing reasoning" could become a standard capability shipped with the next generation of open-source reasoning models rather than a research curiosity.

Sources: Microsoft Research, arXiv 2604.09852, github.com/microsoft/memento, OpenMementos dataset.