Native Sparse Attention: the ACL 2025 Best Paper that makes 64k context 11.6× cheaper

TL;DR

Attention is the transformer's cost center: every token compares against every other token, so compute grows with the square of sequence length. Native Sparse Attention (NSA), the ACL 2025 Best Paper from DeepSeek, Peking University and the University of Washington, trains that sparsity from scratch instead of bolting it on at inference. The result: a 27B-parameter model that matches or beats a dense baseline on general, long-context and reasoning benchmarks — with 9× forward, 6× backward, 11.6× decoding speedups at 64k context on A100.

What's new

Most sparse-attention papers take a transformer trained with full attention and prune comparisons at inference time. That pushes sparsity onto a model that never learned to work under it. Yuan et al. do the opposite: gradients flow through the block-selection mechanism during pretraining, so the model learns which blocks to attend to while it learns everything else.

The architecture uses three attention pathways per query, gated together:

Compressed coarse-grained tokens — block-level summaries for global context.
Fine-grained token selection — top-k blocks chosen by a trainable gate.
Sliding window — local context preserved as-is.

Why it matters

Sparse attention has been a graveyard of ideas that worked in papers but not in production. NSA is the first design that is hardware-aligned, end-to-end trainable, and actually faster in wall-clock — not just in FLOP counts. That combination is why the ACL committee picked it as Best Paper, and why it is likely to shape the next wave of long-context models.

Technical facts

Property	Value
Model size	27B parameters
Pretraining tokens	260B
Context length tested	64k
Hardware	NVIDIA A100
Forward pass speedup	9×
Backward pass speedup	6×
Decode speedup	11.6×

Comparison — why prior sparse methods failed

The paper is refreshingly specific about what has gone wrong before:

Clustering-based selection (routing tokens to cluster centroids) causes load imbalance in MoE systems — some clusters drown, others starve.
Per-head selection conflicts with the shared key-value cache in grouped-query attention (GQA), where multiple query heads deliberately share one set of keys/values for memory efficiency.
Token-level (non-block) selection breaks the contiguous memory access FlashAttention relies on for throughput. You can save FLOPs on paper and still run slower on the GPU.

NSA sidesteps all three by picking contiguous blocks, sharing selection across a GQA group, and co-designing kernels for arithmetic-intensity balance.

Use cases

Reasoning models — chain-of-thought traces are long, and NSA lets a model "think longer" cheaper. The appendix has a telling example: on the same competition-math problem, the NSA model reaches the correct answer in 2,275 thinking tokens while the dense baseline burns 9,392 tokens and still answers wrong.
Long-document and codebase LLMs — anywhere inputs push past 32k–64k.
Inference providers — 11.6× decode speedup at 64k translates almost directly into serving cost.
Pretraining teams — sparsity-aware from step zero, no retrofitting, no distillation loss.

Limitations & availability

NSA is a research paper, not a product. To get the full benefit you have to train from scratch — you cannot swap sparse attention into an existing dense checkpoint. Kernels are tuned for A100-class GPUs; H100 and Blackwell ports are not covered. The authors do not announce official weights in the abstract, though community re-implementations (e.g. the FSA kernel, arXiv 2508.18224) have already appeared.

What's next

Expect three things over the next six to twelve months: Blackwell-tuned NSA kernels, NSA-style sparsity integrated into DeepSeek's production models, and a wider shift in how frontier labs think about long-context training. Section 6.1 of the paper shows two abandoned designs with their loss curves — a rare honest look at ablations — and the clear message is that sparse-from-scratch is no longer an exotic choice. It is starting to look like the default for anything past 64k.

Sources: arXiv 2502.11089, ACL Anthology, ACL 2025 Awards, 36kr.

Native Sparse Attention: the ACL 2025 Best Paper that makes 64k context 11.6× cheaper

TL;DR

What's new

Why it matters

Technical facts

Comparison — why prior sparse methods failed

Use cases

Limitations & availability

What's next

Tiếp tục lướt

Self-Attention vs Cross-Attention bằng tay: cùng thuật toán, khác đúng một input

DeepSeek vừa public TileKernels — lớp kernel mà Google, NVIDIA, Meta không bao giờ hé lộ

DeepSeek-V4 ra mắt: 1M token context với 10% KV cache và 27% FLOPs của V3.2

DeepSeek V4: 1M context mà agent thật sự dùng được, KV cache chỉ còn 10% V3.2

DeepSeek V4 giảm 10 lần KV cache ở 1M context — và giải luôn cơn khát HBM