- An independent benchmark ranked 80 GGUF quantizations of Google's new Gemma 4 26B-A4B across 6 uploaders.
- Unsloth's Dynamic 2.0 GGUFs placed #1 in every single one of the 22 tested quant sizes on mean KL divergence — the cleanest sweep we've seen in open-model quantization.
TL;DR
Unsloth's Gemma 4 26B-A4B GGUFs rank #1 in all 22 of 22 tested quant sizes on mean KL divergence against the BF16 reference, according to an independent benchmark of 80 GGUF variants from 6 uploaders. That's a clean sweep — every size from 9.88 GB UD-IQ2_XXS up through UD-Q4_K_XL at 17.1 GB. The result lands 10 days after Google shipped Gemma 4, and it's the strongest validation yet of Unsloth's Dynamic 2.0 quantization method.

What's new
The benchmark, published by localbench, measured ~250,000 tokens across coding, chat, tool-calling, science, non-Latin scripts and long documents. It compared GGUFs from six popular uploaders:
- unsloth — 21 quants
- bartowski — 26 quants
- mradermacher — 21 quants
- mudler — 7 quants
- lmstudio-community — 3 quants
- ggml-org — 2 quants
When matched by quant size, Unsloth's Dynamic 2.0 variants posted the lowest mean KL divergence in every bucket. Unsloth announced the result on X the same day, framing the lineup as SOTA for this model.
Why it matters
KL divergence isn't the usual marketing metric — it's the one that predicts whether a quantized model behaves like the original. The paper Accuracy Is Not All You Need argues that KL divergence correlates with answer "flips" (right-to-wrong and wrong-to-right) far better than a single MMLU score. A model that scores the same on MMLU but diverges heavily from BF16 will produce different answers on different prompts.
So when you pick a GGUF to run locally, you're really asking: which quant stays closest to the real model at this disk size? On Gemma 4 26B-A4B, the answer in every size bracket is now Unsloth.
Technical facts
Unsloth's Dynamic 2.0 differs from standard imatrix quants in two ways:
- Per-layer adaptive quant type. Instead of applying one recipe (say, Q4_K_M) uniformly, Dynamic 2.0 picks a different quant scheme layer-by-layer based on sensitivity. Recipes are re-derived per model.
- Chat-template-aware calibration. Unsloth uses
Calibration_v3/Calibration_v5— 1.5M hand-curated tokens that respect instruct chat templates, rather than wikitext-only data that tends to overfit on non-instruct distributions.
Model specs for context:
| Property | Gemma 4 26B-A4B |
|---|---|
| Total params | 25.2B |
| Active params (MoE) | 3.8B |
| Experts | 8 active / 128 total + 1 shared |
| Layers | 30 |
| Context | 256K tokens |
| Sliding window | 1024 tokens (hybrid attention) |
| Vision encoder | ~550M params |
| License | Apache 2.0 |
Comparison
Unsloth's GGUF lineup spans an unusually wide range, with smart-compressed 2-bit variants at the low end:
| Quant | Size | Typical use |
|---|---|---|
| UD-IQ2_XXS | 9.88 GB | Fits on 12 GB VRAM |
| UD-Q2_K_XL | 10.5 GB | Best 2-bit quality |
| UD-IQ3_S | 11.2 GB | Balanced 3-bit |
| UD-IQ4_XS | 13.4 GB | Sweet spot for 16 GB cards |
| MXFP4_MOE | 16.6 GB | MoE-native 4-bit |
| UD-Q4_K_XL | 17.1 GB | Max 4-bit quality |
Track record on the prior generation lines up: Unsloth's Gemma 3 27B dynamic 4-bit beat Google's own QAT release at 71.47% vs 70.64% MMLU while being 2 GB smaller. Gemma 3 12B Q4_0 landed at 67.07% MMLU vs 67.15% for full bfloat16 — essentially lossless.
Use cases
With 25.2B total params activating only 3.8B per token, Gemma 4 26B-A4B is built for latency-sensitive local workloads:
- Agent workflows — native function calling, structured JSON output, 82.6% MMLU Pro, 77.1% on LiveCodeBench v6.
- Long-document reasoning — 256K context with hybrid sliding/global attention.
- Local coding copilot — Codeforces ELO 1718, offline-capable at 13–17 GB disk.
- Consumer-GPU deployments — UD-IQ2_XXS fits on a 12 GB card; UD-Q4_K_XL fits on 24 GB with headroom.
Limitations & pricing
A few caveats worth flagging:
- The full per-quant KL numbers from localbench sit behind a paid subscription. The methodology is public; the ranking table is not. Unsloth's "22/22" claim sources from that benchmark.
- KL divergence is a proxy, not a task score. A 2-bit quant can be "best in class" and still lose real quality vs 4-bit.
- MoE peak RAM ≈ all experts loaded — even though only 3.8B activate per token, you still need to fit the full quant in memory.
- Gemma 4 is Apache 2.0; Unsloth GGUFs are free on Hugging Face. No pricing gate.
What's next
Expect the localbench comparison to expand to Gemma 4's 31B dense variant and the E2B/E4B edge models. Unsloth has said Dynamic 2.0 is a moving target — recipes update as llama.cpp evolves, so the current SOTA lineup on Hugging Face will keep shifting as new quant types land.
Sources: UnslothAI, localbench, Unsloth docs, Google.


