Unsloth sweeps 22/22: Gemma 4 26B-A4B GGUFs are now SOTA

TL;DR

Unsloth's Gemma 4 26B-A4B GGUFs rank #1 in all 22 of 22 tested quant sizes on mean KL divergence against the BF16 reference, according to an independent benchmark of 80 GGUF variants from 6 uploaders. That's a clean sweep — every size from 9.88 GB UD-IQ2_XXS up through UD-Q4_K_XL at 17.1 GB. The result lands 10 days after Google shipped Gemma 4, and it's the strongest validation yet of Unsloth's Dynamic 2.0 quantization method.

Google Gemma 4 launch banner

What's new

The benchmark, published by localbench, measured ~250,000 tokens across coding, chat, tool-calling, science, non-Latin scripts and long documents. It compared GGUFs from six popular uploaders:

unsloth — 21 quants
bartowski — 26 quants
mradermacher — 21 quants
mudler — 7 quants
lmstudio-community — 3 quants
ggml-org — 2 quants

When matched by quant size, Unsloth's Dynamic 2.0 variants posted the lowest mean KL divergence in every bucket. Unsloth announced the result on X the same day, framing the lineup as SOTA for this model.

Why it matters

KL divergence isn't the usual marketing metric — it's the one that predicts whether a quantized model behaves like the original. The paper Accuracy Is Not All You Need argues that KL divergence correlates with answer "flips" (right-to-wrong and wrong-to-right) far better than a single MMLU score. A model that scores the same on MMLU but diverges heavily from BF16 will produce different answers on different prompts.

So when you pick a GGUF to run locally, you're really asking: which quant stays closest to the real model at this disk size? On Gemma 4 26B-A4B, the answer in every size bracket is now Unsloth.

Technical facts

Unsloth's Dynamic 2.0 differs from standard imatrix quants in two ways:

Per-layer adaptive quant type. Instead of applying one recipe (say, Q4_K_M) uniformly, Dynamic 2.0 picks a different quant scheme layer-by-layer based on sensitivity. Recipes are re-derived per model.
Chat-template-aware calibration. Unsloth uses Calibration_v3 / Calibration_v5 — 1.5M hand-curated tokens that respect instruct chat templates, rather than wikitext-only data that tends to overfit on non-instruct distributions.

Model specs for context:

Property	Gemma 4 26B-A4B
Total params	25.2B
Active params (MoE)	3.8B
Experts	8 active / 128 total + 1 shared
Layers	30
Context	256K tokens
Sliding window	1024 tokens (hybrid attention)
Vision encoder	~550M params
License	Apache 2.0

Comparison

Unsloth's GGUF lineup spans an unusually wide range, with smart-compressed 2-bit variants at the low end:

Quant	Size	Typical use
UD-IQ2_XXS	9.88 GB	Fits on 12 GB VRAM
UD-Q2_K_XL	10.5 GB	Best 2-bit quality
UD-IQ3_S	11.2 GB	Balanced 3-bit
UD-IQ4_XS	13.4 GB	Sweet spot for 16 GB cards
MXFP4_MOE	16.6 GB	MoE-native 4-bit
UD-Q4_K_XL	17.1 GB	Max 4-bit quality

Track record on the prior generation lines up: Unsloth's Gemma 3 27B dynamic 4-bit beat Google's own QAT release at 71.47% vs 70.64% MMLU while being 2 GB smaller. Gemma 3 12B Q4_0 landed at 67.07% MMLU vs 67.15% for full bfloat16 — essentially lossless.

Use cases

With 25.2B total params activating only 3.8B per token, Gemma 4 26B-A4B is built for latency-sensitive local workloads:

Agent workflows — native function calling, structured JSON output, 82.6% MMLU Pro, 77.1% on LiveCodeBench v6.
Long-document reasoning — 256K context with hybrid sliding/global attention.
Local coding copilot — Codeforces ELO 1718, offline-capable at 13–17 GB disk.
Consumer-GPU deployments — UD-IQ2_XXS fits on a 12 GB card; UD-Q4_K_XL fits on 24 GB with headroom.

Limitations & pricing

A few caveats worth flagging:

The full per-quant KL numbers from localbench sit behind a paid subscription. The methodology is public; the ranking table is not. Unsloth's "22/22" claim sources from that benchmark.
KL divergence is a proxy, not a task score. A 2-bit quant can be "best in class" and still lose real quality vs 4-bit.
MoE peak RAM ≈ all experts loaded — even though only 3.8B activate per token, you still need to fit the full quant in memory.
Gemma 4 is Apache 2.0; Unsloth GGUFs are free on Hugging Face. No pricing gate.

What's next

Expect the localbench comparison to expand to Gemma 4's 31B dense variant and the E2B/E4B edge models. Unsloth has said Dynamic 2.0 is a moving target — recipes update as llama.cpp evolves, so the current SOTA lineup on Hugging Face will keep shifting as new quant types land.

Sources: UnslothAI, localbench, Unsloth docs, Google.

Unsloth sweeps 22/22: Gemma 4 26B-A4B GGUFs are now SOTA

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Mind DeepResearch 30B của Li Auto vượt Gemini 3.1 trên benchmark deep research

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding

DeepSeek V4 Pro tự hack 3 challenge PortSwigger và 1 app Android — review bởi Claude Opus 4.7