- DeepSeek + Peking University win ACL 2025 Best Paper with NSA — a sparse attention mechanism trained from scratch.
- 27B model beats dense baseline, runs 9× faster forward, 11.6× faster decoding at 64k on A100.
TL;DR
Attention is the transformer's cost center: every token compares against every other token, so compute grows with the square of sequence length. Native Sparse Attention (NSA), the ACL 2025 Best Paper from DeepSeek, Peking University and the University of Washington, trains that sparsity from scratch instead of bolting it on at inference. The result: a 27B-parameter model that matches or beats a dense baseline on general, long-context and reasoning benchmarks — with 9× forward, 6× backward, 11.6× decoding speedups at 64k context on A100.
What's new
Most sparse-attention papers take a transformer trained with full attention and prune comparisons at inference time. That pushes sparsity onto a model that never learned to work under it. Yuan et al. do the opposite: gradients flow through the block-selection mechanism during pretraining, so the model learns which blocks to attend to while it learns everything else.
The architecture uses three attention pathways per query, gated together:
- Compressed coarse-grained tokens — block-level summaries for global context.
- Fine-grained token selection — top-k blocks chosen by a trainable gate.
- Sliding window — local context preserved as-is.
Why it matters
Sparse attention has been a graveyard of ideas that worked in papers but not in production. NSA is the first design that is hardware-aligned, end-to-end trainable, and actually faster in wall-clock — not just in FLOP counts. That combination is why the ACL committee picked it as Best Paper, and why it is likely to shape the next wave of long-context models.
Technical facts
| Property | Value |
|---|---|
| Model size | 27B parameters |
| Pretraining tokens | 260B |
| Context length tested | 64k |
| Hardware | NVIDIA A100 |
| Forward pass speedup | 9× |
| Backward pass speedup | 6× |
| Decode speedup | 11.6× |
Comparison — why prior sparse methods failed
The paper is refreshingly specific about what has gone wrong before:
- Clustering-based selection (routing tokens to cluster centroids) causes load imbalance in MoE systems — some clusters drown, others starve.
- Per-head selection conflicts with the shared key-value cache in grouped-query attention (GQA), where multiple query heads deliberately share one set of keys/values for memory efficiency.
- Token-level (non-block) selection breaks the contiguous memory access FlashAttention relies on for throughput. You can save FLOPs on paper and still run slower on the GPU.
NSA sidesteps all three by picking contiguous blocks, sharing selection across a GQA group, and co-designing kernels for arithmetic-intensity balance.
Use cases
- Reasoning models — chain-of-thought traces are long, and NSA lets a model "think longer" cheaper. The appendix has a telling example: on the same competition-math problem, the NSA model reaches the correct answer in 2,275 thinking tokens while the dense baseline burns 9,392 tokens and still answers wrong.
- Long-document and codebase LLMs — anywhere inputs push past 32k–64k.
- Inference providers — 11.6× decode speedup at 64k translates almost directly into serving cost.
- Pretraining teams — sparsity-aware from step zero, no retrofitting, no distillation loss.
Limitations & availability
NSA is a research paper, not a product. To get the full benefit you have to train from scratch — you cannot swap sparse attention into an existing dense checkpoint. Kernels are tuned for A100-class GPUs; H100 and Blackwell ports are not covered. The authors do not announce official weights in the abstract, though community re-implementations (e.g. the FSA kernel, arXiv 2508.18224) have already appeared.
What's next
Expect three things over the next six to twelve months: Blackwell-tuned NSA kernels, NSA-style sparsity integrated into DeepSeek's production models, and a wider shift in how frontier labs think about long-context training. Section 6.1 of the paper shows two abandoned designs with their loss curves — a rare honest look at ablations — and the clear message is that sparse-from-scratch is no longer an exotic choice. It is starting to look like the default for anything past 64k.
Sources: arXiv 2502.11089, ACL Anthology, ACL 2025 Awards, 36kr.


