- Meta Superintelligence Labs just shipped REFRAG — a decoding framework that compresses RAG context into chunk embeddings, hitting 30.85× faster time-to-first-token, 16× longer context, and zero perplexity loss.
- No LLM retraining required.
TL;DR
Meta Superintelligence Labs, NUS, and Rice University released REFRAG (REpresentation For RAG) — a decoding framework that compresses retrieved passages into dense chunk embeddings before feeding them to the LLM. Result: 30.85× faster time-to-first-token, 6.78× higher throughput, 16× longer context, and ~9.3% better perplexity than the previous SOTA (CEPE). No architecture change. Paper on arXiv (2509.01092); code planned at facebookresearch/refrag.
What's new
RAG has a dirty secret: when you retrieve 80 passages, only 5–10 actually matter — but the decoder still pays full quadratic attention cost on every token. REFRAG stops that waste.
Instead of feeding raw tokens from retrieved passages into the decoder, REFRAG uses a lightweight encoder to split each passage into 16-token chunks, compress each chunk into a single dense embedding, and project those embeddings into the decoder's token space. The decoder then generates as normal — but sees 16× fewer input positions.
The insight: RAG contexts naturally exhibit block-diagonal attention. Retrieved passages rarely interact with each other (they're deduplicated and re-ranked for diversity). So most cross-passage attention computation is unnecessary — and can be eliminated with minimal quality impact.
Why it matters
Production RAG teams know the pain: more context = richer answers, but also quadratic latency and a ballooning KV cache. A 16K-token prompt can blow past 100 seconds to first token on a single GPU. REFRAG collapses that bottleneck without forcing you to re-train the base model or redesign the pipeline.
The encoder is plug-and-play: it works on top of LLaMA, RoBERTa, OPT, and other decoder-only foundations. Chunk embeddings are precomputable from the retriever, cached, and reused across inferences — which means cold-start latency drops even further in real workloads.
For teams running customer support bots, internal search, or agent pipelines, this changes the unit economics. A query that previously burned GPU cycles processing 16K tokens can now land in the compute budget of a single retrieved passage. Engineering leads who were force-capping retrieval at 3–5 passages can finally take 8, 16, or 32 without blowing out their SLA.
Technical facts
| Metric | REFRAG | CEPE (prior SOTA) | LLaMA-32K |
|---|---|---|---|
| TTFT acceleration | 30.85× (k=32) | 2–8× | 1× |
| Throughput vs LLaMA | 6.78× | — | 1× |
| Context extension | 16× | 8× | 8× |
| Perplexity vs CEPE | +9.3% | baseline | baseline |
- Chunk size: 16 tokens → 1 embedding
- At k=16: 16.53× TTFT acceleration; at k=32: 30.85×
- Pretrained on 20B tokens from SlimPajama (Books + arXiv)
- Evaluated on Book, Arxiv, PG19, ProofPile, plus RAG + multi-turn + summarization benchmarks
- REFRAG16 and REFRAG32 beat fine-tuned LLaMA while feeding 2–4× fewer tokens to the decoder
Comparison
Prior work on long-context efficiency — CEPE, landmark attention, sliding-window approaches — plateaued at 2–8× speedups and often traded accuracy for speed. REFRAG reports 3.75× better TTFT than CEPE while improving perplexity, not degrading it.
The key architectural bet that pays off: REFRAG doesn't try to be a general long-context recipe. It specifically exploits the block-diagonal structure unique to RAG workloads, which is why it can afford such aggressive compression without semantic drift.
Use cases
- Enterprise RAG pipelines: scale to thousands of retrieved passages at single-passage latency
- Multi-turn agents: retain full conversation history without truncation tricks
- Long-document QA: legal contracts, financial filings, research papers with full context awareness
- Web-scale search assistants: real-time answering over millions of retrieved documents
- Weak-retriever scenarios: REFRAG widens the lead when retrievers are noisy — it can absorb more irrelevant passages under the same latency budget and still extract the useful ones
Limitations & availability
REFRAG was benchmarked on text corpora. Open questions remain on how the RL chunk selector handles heterogeneous domains like code, multimodal inputs, or highly structured legal text. The team also notes an unexplored theoretical ceiling on compression before semantic drift kicks in.
The paper is public on arXiv (2509.01092, v2 revised Oct 12, 2025). Meta announced the source code will be released at facebookresearch/refrag on GitHub. No hosted service, no pricing — it's a research release meant to be integrated into your own stack.
What's next
Expected next steps from the research roadmap: co-training retrievers with REFRAG encoders for domain-specific RAG, hybridizing with streaming attention and token pruning, and extending the RL chunk selector to multimodal contexts. For builders, the playbook is simpler: once code lands, swap REFRAG's encoder in front of your existing decoder and measure TTFT / throughput on your own traffic.
The bigger takeaway is economic: RAG costs were trending the wrong way as teams added more passages and longer windows. REFRAG flips that curve — more context at lower latency — without asking you to migrate models.
Sources: arXiv paper, MarkTechPost, TechTalks, Data Science Dojo.




