200 tok/s, 49W: Qwen3.6-27B-FP8 Runs Flagship Coding on a Single DGX Spark

TL;DR

On 23 April 2026, engineer Mitko Vasilev (@iotcoi) posted a benchmark of the brand-new Qwen3.6-27B-FP8 running with the DFlash + DDTree speculative-decoding stack on a single NVIDIA GB10 (DGX Spark). The numbers: ~200 tok/s peak, 136 tok/s average, 256k context, 10 concurrent agents, 49 W. Put differently — a flagship-level 27B dense coder, at laptop-charger power, handling repo-scale agent workflows locally.

What's new

Three things collided within 24 hours:

Qwen3.6-27B shipped on 22 April 2026. The Qwen team positions it as flagship-level agentic coding in a 27B dense model — beating the prior open-source king Qwen3.5-397B-A17B (807 GB on disk) across coding benchmarks, while weighing just 55.6 GB. Native 262,144-token context, extensible toward 1M. The FP8 variant landed same-day.
DFlash + DDTree, z-lab's block-diffusion speculative-decoding stack, already supports FP8 Qwen weights. DFlash drafts whole token blocks in parallel; DDTree turns that into a verified draft tree.
NVIDIA GB10 (DGX Spark) — Grace Blackwell superchip, 128 GB unified memory, desktop form factor — finally has software that fully saturates its FP8 tensor cores on a sub-30B dense model.

The result is the first credible demonstration that you can run a frontier-grade open coder at genuine interactive speed on a box that fits on a desk and runs on a wall socket.

Why it matters

Throughput is only half the story. The tok/s-per-watt number is where this gets strange: 136 tok/s at 49 W ≈ 2.8 tok/s per watt. A tuned RTX 3090 on the same 27B family peaks around 207 tok/s but pulls 300–350 W — roughly 0.6 tok/s per watt. That's a ~5× efficiency gap in favour of Blackwell FP8 + speculative decoding.

For solo builders and small teams, this is the first setup where local 256k-context, 10-agent workflows are actually practical — without an API bill, without a rate limit, without a data-egress conversation with legal.

Technical facts

Config	Hardware	Avg tok/s	Peak tok/s	Power	Notes
Qwen3.6-27B-FP8 + DFlash + DDTree	Single GB10	136	~200	49 W	256k ctx, 10 agents
Qwen3.5-27B-AWQ + DFlash + DDTree (prior run)	GB10, CUDA 13.1	91	115 drafted	—	Queue of 107 requests
Qwen3.5-27B	RTX 3090	—	207	300–350 W	Single-user
Qwen 2.5 72B / Llama 3.2 90B	GB10	4.6	—	—	No spec decoding

A few details worth calling out:

DFlash reports up to 6× lossless speedup on Qwen3-8B versus autoregressive, and roughly 2.5× faster than EAGLE-3 — because it injects target-model features into every draft layer's KV cache, not just the first.
DDTree builds a best-first tree of drafts from the block-diffusion logits and verifies the whole tree in one target forward pass using an ancestor-only attention mask. On code it adds another ~10–15% over DFlash.
Going from Qwen3.5-27B-AWQ to Qwen3.6-27B-FP8 on the same GB10 bumped average throughput from 91 → 136 tok/s (+50%). Blackwell's FP8 path likes the new weights.

Comparison

Two framings make the jump concrete:

Size-for-size vs. last generation. Qwen3.5-397B-A17B required multi-GPU inference and 807 GB of weights. Qwen3.6-27B-FP8 fits in unified memory on a single GB10, wins on coding benchmarks, and serves 10 agents concurrently at 256k context.
Watt-for-watt vs. discrete GPUs. A 3090 class card can match peak tok/s, but at 6–7× the power draw and without the 128 GB unified memory that lets you actually load a 256k-context workload plus KV cache plus 10 parallel streams.

Use cases

Local agentic coding: 10 agents × 256k context = whole-repo reasoning loops (plan → edit → test → iterate) running offline, at interactive speed.
Long-context workflows: feed an entire codebase or a multi-hour transcript into one window without chunking tricks.
Regulated / air-gapped environments: a 49 W desktop box that matches cloud-API throughput is a compliance dream for finance, healthcare, and gov workloads.
Speculative-decoding R&D: DFlash + DDTree on Blackwell FP8 is now a clean baseline for squeezing more from dense models before jumping to MoE.

Limitations & pricing

Self-reported. The 200 / 136 / 49 W numbers come from a single run by one engineer on X. Independent reproduction — and full-suite benchmarks — are pending.
49 W is GPU-only. Full DGX Spark wall draw is higher once CPU, memory, and PSU overhead are counted. Still a fraction of a discrete-GPU server.
Speculative decoding is lossless but uneven. Effective speedup scales with draft-acceptance rate; code and structured outputs accept well, freeform prose less so.
Availability. Qwen3.6-27B-FP8 is open-weights on Hugging Face. DGX Spark Founders Edition (4 TB) lists around US$3,999.

What's next

Expect three fast-follow trends: (1) Ollama and LM Studio drops of Qwen3.6-27B quants within days, bringing this stack to laptops with 64 GB of VRAM-equivalent. (2) DDTree node-budget tuning specifically for FP8 Blackwell to push average closer to the 200 tok/s peak. (3) Broader third-party benchmarks from the DGX Spark community — including the 35B-A3B MoE sibling — which should tell us whether dense-27B-FP8 or MoE is the right shape for this hardware class.

For now, a single takeaway: the bar for “local flagship-grade coding” moved this week, and it moved on a 49-watt desktop.

Nguồn: @iotcoi on X, Qwen blog, z-lab DFlash, NVIDIA DGX Spark.

200 tok/s, 49W: Qwen3.6-27B-FP8 Runs Flagship Coding on a Single DGX Spark

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding

Free CLI Agent: Pi + Ollama + Gemma 4 + Parallel Search MCP — $0, No API Keys

SmallClaw: AI agent framework local-first cho small models, chạy ngon trên laptop 8GB RAM