TL;DR

On 23 April 2026, engineer Mitko Vasilev (@iotcoi) posted a benchmark of the brand-new Qwen3.6-27B-FP8 running with the DFlash + DDTree speculative-decoding stack on a single NVIDIA GB10 (DGX Spark). The numbers: ~200 tok/s peak, 136 tok/s average, 256k context, 10 concurrent agents, 49 W. Put differently — a flagship-level 27B dense coder, at laptop-charger power, handling repo-scale agent workflows locally.

What's new

Three things collided within 24 hours:

  • Qwen3.6-27B shipped on 22 April 2026. The Qwen team positions it as flagship-level agentic coding in a 27B dense model — beating the prior open-source king Qwen3.5-397B-A17B (807 GB on disk) across coding benchmarks, while weighing just 55.6 GB. Native 262,144-token context, extensible toward 1M. The FP8 variant landed same-day.
  • DFlash + DDTree, z-lab's block-diffusion speculative-decoding stack, already supports FP8 Qwen weights. DFlash drafts whole token blocks in parallel; DDTree turns that into a verified draft tree.
  • NVIDIA GB10 (DGX Spark) — Grace Blackwell superchip, 128 GB unified memory, desktop form factor — finally has software that fully saturates its FP8 tensor cores on a sub-30B dense model.

The result is the first credible demonstration that you can run a frontier-grade open coder at genuine interactive speed on a box that fits on a desk and runs on a wall socket.

Why it matters

Throughput is only half the story. The tok/s-per-watt number is where this gets strange: 136 tok/s at 49 W ≈ 2.8 tok/s per watt. A tuned RTX 3090 on the same 27B family peaks around 207 tok/s but pulls 300–350 W — roughly 0.6 tok/s per watt. That's a ~5× efficiency gap in favour of Blackwell FP8 + speculative decoding.

For solo builders and small teams, this is the first setup where local 256k-context, 10-agent workflows are actually practical — without an API bill, without a rate limit, without a data-egress conversation with legal.

Technical facts

ConfigHardwareAvg tok/sPeak tok/sPowerNotes
Qwen3.6-27B-FP8 + DFlash + DDTreeSingle GB10136~20049 W256k ctx, 10 agents
Qwen3.5-27B-AWQ + DFlash + DDTree (prior run)GB10, CUDA 13.191115 draftedQueue of 107 requests
Qwen3.5-27BRTX 3090207300–350 WSingle-user
Qwen 2.5 72B / Llama 3.2 90BGB104.6No spec decoding

A few details worth calling out:

  • DFlash reports up to 6× lossless speedup on Qwen3-8B versus autoregressive, and roughly 2.5× faster than EAGLE-3 — because it injects target-model features into every draft layer's KV cache, not just the first.
  • DDTree builds a best-first tree of drafts from the block-diffusion logits and verifies the whole tree in one target forward pass using an ancestor-only attention mask. On code it adds another ~10–15% over DFlash.
  • Going from Qwen3.5-27B-AWQ to Qwen3.6-27B-FP8 on the same GB10 bumped average throughput from 91 → 136 tok/s (+50%). Blackwell's FP8 path likes the new weights.

Comparison

Two framings make the jump concrete:

  1. Size-for-size vs. last generation. Qwen3.5-397B-A17B required multi-GPU inference and 807 GB of weights. Qwen3.6-27B-FP8 fits in unified memory on a single GB10, wins on coding benchmarks, and serves 10 agents concurrently at 256k context.
  2. Watt-for-watt vs. discrete GPUs. A 3090 class card can match peak tok/s, but at 6–7× the power draw and without the 128 GB unified memory that lets you actually load a 256k-context workload plus KV cache plus 10 parallel streams.

Use cases

  • Local agentic coding: 10 agents × 256k context = whole-repo reasoning loops (plan → edit → test → iterate) running offline, at interactive speed.
  • Long-context workflows: feed an entire codebase or a multi-hour transcript into one window without chunking tricks.
  • Regulated / air-gapped environments: a 49 W desktop box that matches cloud-API throughput is a compliance dream for finance, healthcare, and gov workloads.
  • Speculative-decoding R&D: DFlash + DDTree on Blackwell FP8 is now a clean baseline for squeezing more from dense models before jumping to MoE.

Limitations & pricing

  • Self-reported. The 200 / 136 / 49 W numbers come from a single run by one engineer on X. Independent reproduction — and full-suite benchmarks — are pending.
  • 49 W is GPU-only. Full DGX Spark wall draw is higher once CPU, memory, and PSU overhead are counted. Still a fraction of a discrete-GPU server.
  • Speculative decoding is lossless but uneven. Effective speedup scales with draft-acceptance rate; code and structured outputs accept well, freeform prose less so.
  • Availability. Qwen3.6-27B-FP8 is open-weights on Hugging Face. DGX Spark Founders Edition (4 TB) lists around US$3,999.

What's next

Expect three fast-follow trends: (1) Ollama and LM Studio drops of Qwen3.6-27B quants within days, bringing this stack to laptops with 64 GB of VRAM-equivalent. (2) DDTree node-budget tuning specifically for FP8 Blackwell to push average closer to the 200 tok/s peak. (3) Broader third-party benchmarks from the DGX Spark community — including the 35B-A3B MoE sibling — which should tell us whether dense-27B-FP8 or MoE is the right shape for this hardware class.

For now, a single takeaway: the bar for “local flagship-grade coding” moved this week, and it moved on a 49-watt desktop.

Nguồn: @iotcoi on X, Qwen blog, z-lab DFlash, NVIDIA DGX Spark.