Microsoft MEMENTO: LLMs that compress their own chain-of-thought

TL;DR

Microsoft Research released MEMENTO — a training recipe that teaches reasoning LLMs to compress their own chain-of-thought mid-generation. The model segments its thinking into blocks, emits a dense memento summary after each block, then masks and evicts the original block from its KV cache. Result: ~2.5x peak KV-cache reduction, ~1.75–2x throughput on vLLM, and a fascinating ablation showing the KV states themselves carry hidden residual information worth 15 accuracy points on AIME24. Paper, 228K-trace dataset, and vLLM fork are all open under MIT.

What's new

Most long-context tricks bolt onto frozen models: head-level eviction, quantization, restart-based recomputation. MEMENTO goes in the other direction — it trains the model to manage its own context from the inside.

Reasoning traces get split into semantically coherent blocks using new special tokens (<think>, <|block_start|>, <|summary_start|>).
When a block closes, the model writes a memento: a terse, information-dense summary of what it just figured out.
The original block is masked from future attention and its KV entries are flushed.
From then on, the model sees only past mementos plus the block it's currently working through.

Peak KV memory stops climbing monotonically and instead traces a sawtooth: each memento write drops the curve sharply before the next block slowly builds it back up.

Why it matters

Reasoning models are eating more and more tokens per answer. A 32K-token chain-of-thought is now routine; agentic loops stack those chains turn after turn. Every one of those tokens lives in the KV cache, and KV cache — not FLOPs — is usually what caps how many users you can fit on a GPU.

MEMENTO attacks exactly that bottleneck. Nearly doubled throughput on the same hardware is not a paper-only claim: on a B200 GPU, vLLM with the MEMENTO overlay served 240 concurrent requests in 693 s vs 1,096 s baseline, at 4,290 tok/s vs 2,447 tok/s.

Technical facts

Property	Value
Per-block compression	5–20x
Trace-level compression	~6x (11K tokens → under 2K)
Peak KV cache reduction	~2.5x (2–3x across configs)
Throughput (B200, vLLM)	4,290 tok/s vs 2,447 tok/s (~1.75x)
Base-model overlap	96.4% problem-solving capability retained
OpenMementos dataset	228K traces (54% math / 19% code / 27% science)
Training data needed	~30K examples suffice for SFT
Models tested	Qwen2.5-7B, Qwen3-8B/32B, Phi-4 Reasoning 14B, OLMo3-7B-Think
vLLM overlay	Fork of vLLM 0.13.0 with KV block masking
License	MIT

The hidden channel: KV states as secondary memory

This is the most interesting finding in the paper. When the model writes a memento, it is still attending to the full block — so the KV entries produced at that moment encode more than the visible memento text alone.

To test it, the authors ran a restart-mode inference that throws away the memento's KV states and recomputes them from the memento text alone. Same text, same tokens — but accuracy on AIME'24 collapses from 66.1% to 50.8%, a 15.3-point drop.

The memento carries information in two channels: the explicit text, and the implicit KV representations produced while writing it. Remove the hidden channel and you lose a large fraction of the reasoning quality — even though the visible output looks identical.

This is an intriguing result for interpretability too. It suggests that parts of what a reasoning model "knows" at a given step live not in tokens you can read, but in tensors you can only keep or discard.

Comparison with prior KV-cache tricks

Head-level eviction / quantization: works on any checkpoint, but treats KV as a static pool to shrink. MEMENTO instead teaches the model to retire whole reasoning segments once it has digested them.
Restart-based recomputation: cheaper memory but loses the hidden KV channel → –15 pts on AIME24. MEMENTO's in-place masking is the first approach to keep that channel alive.
Other CoT-compression methods (e.g. compressed CoT via dense representations): mostly offline or pre-/post-process the full trace. MEMENTO compresses during generation, which is what yields the throughput gain.

Use cases

Serving reasoning models at higher concurrency on the same hardware — direct opex win for anyone running Qwen/Phi/OLMo-class models.
Longer effective reasoning inside fixed context windows: useful when a single request needs to chew through dozens of thousands of tokens of intermediate state.
Agentic workflows — the authors explicitly flag multi-turn terminal / CLI agents as the next target, where every tool call piles on history.
Research on reasoning-time memory and the interpretability of implicit channels in attention.

Limitations & pricing

Small initial accuracy gaps on AIME and GPQA-Diamond at smaller scales. They shrink with model size and close entirely with additional RL or majority voting at k=3.
Not a drop-in inference trick — the model must be fine-tuned with the two-stage SFT recipe and the new special tokens.
Inference requires the vLLM 0.13.0 overlay fork for block-level KV masking.
Pricing: free. Code, dataset, and vLLM fork are all MIT-licensed.

What's next

The authors signal two directions: (1) scaling RL on top of the two-stage SFT to close the last accuracy gaps on hard math/science benchmarks, and (2) applying MEMENTO to agentic settings — long-horizon terminal and CLI agents, where context bloat is the dominant cost. If the technique generalises, "self-compressing reasoning" could become a standard capability shipped with the next generation of open-source reasoning models rather than a research curiosity.

Sources: Microsoft Research, arXiv 2604.09852, github.com/microsoft/memento, OpenMementos dataset.

Microsoft MEMENTO: LLMs that compress their own chain-of-thought

TL;DR

What's new

Why it matters

Technical facts

The hidden channel: KV states as secondary memory

Comparison with prior KV-cache tricks

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Mind DeepResearch 30B của Li Auto vượt Gemini 3.1 trên benchmark deep research

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding

DeepSeek V4 Pro tự hack 3 challenge PortSwigger và 1 app Android — review bởi Claude Opus 4.7