DeepSeek V4 Pro: 1.6T MoE, 1M context, #1 open-weights model

TL;DR

DeepSeek released the V4 Preview series on April 24, 2026 — two open-weight MoE models under MIT license: V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B). Both ship with a native 1M-token context, three reasoning effort modes, and an API compatible with both OpenAI and Anthropic protocols.

V4-Pro is now the largest open-weights model ever released, the #1 open model on Artificial Analysis' GDPval-AA agentic benchmark (Elo 1554), and yet costs $3.48 per 1M output tokens — roughly 7x less than Claude Opus 4.6 and 8.6x less than GPT-5.5. The efficiency story is the headline: at 1M context, V4-Pro uses only 27% of V3.2's FLOPs and 10% of its KV cache, despite more than doubling total parameters.

What's new

V4 is DeepSeek's first new architecture family since V3 — every model in between (V3.1, V3.2, R1, R1 0528) shared the V3 base of 685B total / 37B active. V4 breaks that mold with two fresh MoE designs:

DeepSeek-V4-Pro — 1.6T total, 49B active, pre-trained on 33T tokens. Exposed as Expert Mode on chat.deepseek.com.
DeepSeek-V4-Flash — 284B total, 13B active, pre-trained on 32T tokens. Exposed as Instant Mode.

Both are text-in / text-out, native 1M context, 384K max output, three reasoning effort levels (Non-Think, Think High, Think Max), and MIT-licensed weights on Hugging Face. The API is drop-in compatible with OpenAI ChatCompletions and Anthropic formats — switch base_url and model ID and most Claude Code / OpenCode / OpenClaw setups keep working.

V4-Pro at 865GB on Hugging Face is now the largest open-weights model in existence, surpassing Kimi K2.6 (1T / 32B, ~500GB INT4) and GLM-5.1 (754B BF16) in both total and active parameter counts.

Why it matters

Artificial Analysis published GDPval-AA results on day one: V4-Pro leads every open-weights peer, and V4-Flash hands DeepSeek a ~210 Elo uplift over V3.2 at a fraction of the size.

Model (Reasoning, Max)	GDPval-AA Elo
GPT-5.4 xHigh	1674
Claude Opus 4.6 Max	1619
DeepSeek V4-Pro	1554
GLM-5.1 Thinking	1535
MiniMax-M2.7	1514
Kimi K2.6 Thinking	1484
DeepSeek V4-Flash	1388
DeepSeek V3.2 (Reasoning)	1203

V4-Pro still trails the absolute frontier (GPT-5.4, Opus 4.6) by 65-120 Elo on GDPval-AA — DeepSeek itself frames this as a 3-6 month gap — but the combination of open weights, 1M context and aggressive pricing is the real product.

Technical facts

The architecture changes are the reason the efficiency numbers look the way they do.

Hybrid Attention (CSA + HCA). Compressed Sparse Attention compresses KV every m tokens + adds a top-k sparse selector and a sliding window for local detail. Heavily Compressed Attention folds many more tokens into a single KV entry but keeps attention dense. Interleaving the two gives both precise lookup and broad global summary on different layers.
Manifold-Constrained Hyper-Connections (mHC). Replaces standard residual connections with multiple parallel streams. Mixing matrices are constrained to the Birkhoff Polytope via Sinkhorn-Knopp, dropping signal amplification from 3,000x to 1.6x — what made stable 1.6T training feasible. The seed paper, co-authored by CEO Liang Wenfeng, dropped December 31, 2025.
Muon optimizer instead of AdamW for faster convergence, plus Anticipatory Routing and SwiGLU clamping to tame loss spikes at scale.
FP4 + FP8 mixed precision. Routed MoE expert weights and indexer Q/K paths in FP4; most other params in FP8. Pre-training used FP4 quantization-aware training.

The efficiency payoff, measured in equivalent FP8 FLOPs on the same hardware:

Model (at 1M context)	Total	Active	FLOPs vs V3.2	KV cache vs V3.2
V3.2 (reference)	671B	37B	100%	100%
V4-Pro	1.6T	49B	27%	10%
V4-Flash	284B	13B	10%	7%

V4-Pro has more active and total params than V3.2 yet costs ~3.7x less per token to serve at 1M context. That is entirely attention + KV cache engineering, not compute-brute-force.

Comparison

Where V4-Pro Max wins against closed frontier and other open models:

LiveCodeBench: 93.5 — new open-model high, ahead of Gemini-3.1-Pro (91.7) and Opus 4.6 (88.8).
Codeforces rating: 3206 — beats GPT-5.4 xHigh (3168), ranks roughly 23rd among human contestants per DeepSeek.
Terminal-Bench 2.0: 67.9 — beats Opus 4.6 (65.4) on autonomous terminal agent tasks.
Chinese-SimpleQA: 84.4 — beats every closed model except Gemini-3.1-Pro (85.9).
Apex Shortlist: 90.2. Putnam-2025: 120/120.

Where V4-Pro trails:

MMLU-Pro 87.5 vs Gemini 91.0, GPQA Diamond 90.1 vs 94.3 — raw knowledge recall.
SimpleQA-Verified 57.9 vs Gemini 75.6 — English factual knowledge gap is real.
MRCR 1M 83.5 vs Opus 4.6 92.9 — V4 makes 1M affordable, but Opus still holds the absolute long-context retrieval crown.
SWE-Pro 55.4 vs Kimi K2.6 58.6 — long-horizon codebase resolution is K2.6's edge.

Use cases

Pick V4-Pro if you run high-volume agentic coding, competitive-programming-style code generation, long-context RAG over codebases or legal/financial corpora, or Chinese-first products. A pipeline processing 50M output tokens/month costs $174 on V4-Pro vs $1,250 on Opus 4.6. MIT license means you can self-host for data sovereignty — provided you have multi-node H100/H200 infra for a 1.6T MoE.

Pick V4-Flash if you need tiered routing (Flash for drafts and simple queries, Pro for complex reasoning), high-frequency lower-complexity coding, or previously uneconomical long-document workflows. At $0.28 per 1M output tokens, Flash is ~89x cheaper than Opus 4.6 while scoring 79.0 on SWE-bench Verified (V4-Pro: 80.6).

Stay on Claude or Gemini if your workload demands absolute factual accuracy (SimpleQA), expert cross-domain reasoning (HLE), high-precision long-context retrieval (MRCR 1M), or you're in regulated industries where China-hosted API routing is a compliance blocker.

Limitations & pricing

Official DeepSeek first-party API pricing:

Model	Input (miss)	Input (cache hit)	Output
deepseek-v4-flash	$0.14 / 1M	$0.028 / 1M	$0.28 / 1M
deepseek-v4-pro	$1.74 / 1M	$0.145 / 1M	$3.48 / 1M

Known caveats:

Text-only I/O — no multimodal, same as V3.2.
Preview release; DeepSeek has flagged further post-training refinements before a final V4 branding.
No Jinja chat template — use the Python encoding scripts (encoding_dsv4.py) shipped in the Hugging Face repo.
Not yet on AWS Bedrock or Azure — only DeepSeek's API or self-host.
Server infrastructure is China-based; consider self-hosting for regulated data.

What's next

DeepSeek has already set the deprecation clock: the legacy deepseek-chat and deepseek-reasoner endpoints (currently routing to V4-Flash) retire on July 24, 2026, 15:59 UTC. Community GGUF quantizations are expected within days; third-party aggregators (OpenRouter, ofox) are rolling out V4 support imminently. A finalized non-Preview V4 is the likely next milestone.

Sources: DeepSeek API Docs, DeepSeek-V4-Pro on Hugging Face, Simon Willison, Artificial Analysis, Digital Applied, Build Fast With AI.

DeepSeek V4 Pro: 1.6T MoE, 1M context, #1 open-weights model — at 7x less than Claude

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Mind DeepResearch 30B của Li Auto vượt Gemini 3.1 trên benchmark deep research

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding