DFlash cho Qwen3.6-35B-A3B chính thức GA: speculative decoding 2.9× nhanh hơn, drafter chỉ 0.5B tham số

TL;DR

DFlash for Qwen3.6-35B-A3B vừa chuyển trạng thái từ day-1 preview sang GA: training complete, validation passed, weights finalized. Drafter chỉ 0.5B params (BF16) ghép cặp với target Qwen3.6-35B-A3B để làm speculative decoding. Trên NVIDIA B200 + SGLang, DFlash đạt 2.9× speedup ở concurrency 1 trên Math500 (234 → 682 tok/s) và 6,520 tok/s ở concurrency 32 — tất cả lossless. Code trên GitHub z-lab/dflash, weights trên HuggingFace.

What's new

Ngày 21/04/2026, Zhijian Liu (Z Lab) công bố weights chính thức cho drafter DFlash của Qwen3.6-35B-A3B. Điểm đáng chú ý: cộng đồng đã chạy bản preview ngay từ day-1 trước cả khi training kết thúc — vì Z Lab push drafter preview đồng thời khi Qwen team phát hành Qwen3.6-35B-A3B. Đến giờ thì training xong, validation pass, weights finalized.

Bản release đi kèm 3 thứ:

Drafter weights 0.5B params BF16 trên HF (z-lab/Qwen3.6-35B-A3B-DFlash)
Integration cho SGLang (production-ready) và vLLM (nightly build)
Sliding-window attention cho drafter qua flag --speculative-dflash-draft-window-size — bound KV growth cho agentic/long-context

Why it matters

Speculative decoding bế tắc ở drafter. EAGLE-3 — SoTA hiện tại — draft tokens autoregressively, mỗi token một bước, nên buộc phải dùng kiến trúc cực cạn (1 transformer layer) để giữ latency thấp. Kết quả: speedup thực tế cap ở 2-3×.

DFlash flip toàn bộ bài toán: drafter là block diffusion model, sinh cả block token trong một forward pass duy nhất. Chi phí drafting flat theo số token, nên được phép dùng kiến trúc sâu hơn, expressive hơn mà không trả giá latency. Key insight: thay vì bắt drafter tự reason từ đầu, DFlash extract hidden features từ nhiều layer của target model, fuse qua projection, rồi inject thẳng vào K/V cache của MỌI draft layer — chứ không chỉ feed vào input layer đầu tiên như EAGLE-3. Signal không bị dilute, acceptance length scale theo depth.

Technical facts

Drafter size: 0.5B params, BF16 — so với 7B trong DiffuSpec/SpecDiff-2
Reuse: embedding + LM head của target, chỉ train vài intermediate layers
Drafting: 1 denoising step, block size 8 hoặc 16
Benchmark rig: 1× NVIDIA B200, SGLang, thinking enabled, max output 4096

Throughput Qwen3.6-35B-A3B + DFlash (block size 16):

Task	Concurrency	AR baseline (tok/s)	DFlash (tok/s)	Speedup
Math500	1	234	682	2.9×
Math500	32	2,755	6,520	2.4×
HumanEval	1	238	603	2.5×
GSM8K	1	235	556	2.4×
MBPP	1	235	559	2.4×
MT-Bench	1	233	442	1.9×

Acceptance length (tok/block, B=16): Math500 7.35, GSM8K 6.73, HumanEval 6.44. Reasoning mode (temp=1 + thinking) vẫn giữ ~4.5× acceleration.

Comparison

DFlash vs EAGLE-3 (Qwen3-8B, greedy decoding): DFlash delivers >2.5× higher speedup trên hầu hết benchmark.

Benchmark	EAGLE-3	DFlash
MATH-500	2.18×	6.17×
GSM8K	2.13×	5.20×
HumanEval	2.48×	5.20×
AIME24	2.25×	5.91×

DFlash vs MTP (Qwen3.5-35B-A3B, c=1, B=16): Math500 DFlash 681 tok/s (2.8×) vs MTP 420 tok/s (1.7×); HumanEval DFlash 662 tok/s (2.8×) vs MTP 404 tok/s (1.7×). Acceptance length cũng cao hơn (7.20 vs 6.93 trên Math500).

Vs DiffuSpec/SpecDiff-2: drafter cũ cần 7B params — không khả thi cho serving. DFlash đạt chất lượng tương đương với 0.5B nhờ fuse target features.

Use cases

Serving platforms: vLLM + SGLang native, scale lên 6,500+ tok/s ở concurrency 32 mà không cần gấp đôi GPU. 0.5B drafter không ngốn VRAM.
Agentic workloads: sliding-window drafter bound KV growth cho long-context agents. Pair ngon với Qwen3.6 thinking-preservation để giảm token redundancy qua nhiều lượt tool call.
Coding / dev tools: Qwen3.6-35B-A3B đã được upgrade cho agentic coding + repo-level reasoning. DFlash đẩy thêm tốc độ: acceptance rate lên đến 93.3% trên targeted code (ví dụ quicksort) khi thinking off. Tích hợp với Qwen Code CLI và Qwen-Agent.
Apple Silicon: MLX backend test trên M5 Pro — chạy local với Qwen3/Qwen3.5.

Limitations & pricing

License: MIT cho DFlash, Apache-2.0 cho target Qwen3.6. Hoàn toàn free, self-hosted.
Hybrid target penalty: speedup bị giới hạn với hybrid targets (Qwen3.5, Jamba) vì recurrent state không decompose được theo token position — rejected draft phải rerun target forward pass.
vLLM: integration in-progress, cần nightly build. Production-stable path là SGLang.
llama.cpp: PR #22105 đang draft — DFlash decoder rebuild graph mỗi iteration (chưa có draft-side KV reuse), speculators-format drafts (reduced-vocab) vướng bug tensor mapping. Z-lab-format Qwen3.6-35B-A3B-DFlash convert sạch.
Hardware test: NVIDIA B200, L40S 48GB, 2× RTX 3090 (Ampere, CUDA 12.9), Apple M5 Pro.
Trạng thái: Qwen3.6-35B-A3B-DFlash đang được đánh dấu Preview trong bảng supported models dù training đã finalized.

What's next

Roadmap công khai của Z Lab:

Training recipe open-source — cho phép train DFlash drafter cho bất kỳ LLM nào
vLLM native support ra khỏi nightly
Drafter mới Coming soon: Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, GLM-5.1
llama.cpp: merge PR #22105, thêm draft-side KV cache + CUDA graph support; target-side deferred commit để gỡ hybrid-target penalty

Paper: arXiv 2602.06036 (Feb 5, 2026). Nguồn: z-lab/dflash, HuggingFace model card, Z Lab blog, Zhijian Liu X announcement.

DFlash cho Qwen3.6-35B-A3B chính thức GA: speculative decoding 2.9× nhanh hơn, drafter chỉ 0.5B tham số

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Mind DeepResearch 30B của Li Auto vượt Gemini 3.1 trên benchmark deep research

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding

DeepSeek V4 Pro tự hack 3 challenge PortSwigger và 1 app Android — review bởi Claude Opus 4.7