TL;DR

Unsloth's Gemma 4 26B-A4B GGUFs rank #1 in all 22 of 22 tested quant sizes on mean KL divergence against the BF16 reference, according to an independent benchmark of 80 GGUF variants from 6 uploaders. That's a clean sweep — every size from 9.88 GB UD-IQ2_XXS up through UD-Q4_K_XL at 17.1 GB. The result lands 10 days after Google shipped Gemma 4, and it's the strongest validation yet of Unsloth's Dynamic 2.0 quantization method.

Google Gemma 4 launch banner

What's new

The benchmark, published by localbench, measured ~250,000 tokens across coding, chat, tool-calling, science, non-Latin scripts and long documents. It compared GGUFs from six popular uploaders:

  • unsloth — 21 quants
  • bartowski — 26 quants
  • mradermacher — 21 quants
  • mudler — 7 quants
  • lmstudio-community — 3 quants
  • ggml-org — 2 quants

When matched by quant size, Unsloth's Dynamic 2.0 variants posted the lowest mean KL divergence in every bucket. Unsloth announced the result on X the same day, framing the lineup as SOTA for this model.

Why it matters

KL divergence isn't the usual marketing metric — it's the one that predicts whether a quantized model behaves like the original. The paper Accuracy Is Not All You Need argues that KL divergence correlates with answer "flips" (right-to-wrong and wrong-to-right) far better than a single MMLU score. A model that scores the same on MMLU but diverges heavily from BF16 will produce different answers on different prompts.

So when you pick a GGUF to run locally, you're really asking: which quant stays closest to the real model at this disk size? On Gemma 4 26B-A4B, the answer in every size bracket is now Unsloth.

Technical facts

Unsloth's Dynamic 2.0 differs from standard imatrix quants in two ways:

  1. Per-layer adaptive quant type. Instead of applying one recipe (say, Q4_K_M) uniformly, Dynamic 2.0 picks a different quant scheme layer-by-layer based on sensitivity. Recipes are re-derived per model.
  2. Chat-template-aware calibration. Unsloth uses Calibration_v3 / Calibration_v5 — 1.5M hand-curated tokens that respect instruct chat templates, rather than wikitext-only data that tends to overfit on non-instruct distributions.

Model specs for context:

PropertyGemma 4 26B-A4B
Total params25.2B
Active params (MoE)3.8B
Experts8 active / 128 total + 1 shared
Layers30
Context256K tokens
Sliding window1024 tokens (hybrid attention)
Vision encoder~550M params
LicenseApache 2.0

Comparison

Unsloth's GGUF lineup spans an unusually wide range, with smart-compressed 2-bit variants at the low end:

QuantSizeTypical use
UD-IQ2_XXS9.88 GBFits on 12 GB VRAM
UD-Q2_K_XL10.5 GBBest 2-bit quality
UD-IQ3_S11.2 GBBalanced 3-bit
UD-IQ4_XS13.4 GBSweet spot for 16 GB cards
MXFP4_MOE16.6 GBMoE-native 4-bit
UD-Q4_K_XL17.1 GBMax 4-bit quality

Track record on the prior generation lines up: Unsloth's Gemma 3 27B dynamic 4-bit beat Google's own QAT release at 71.47% vs 70.64% MMLU while being 2 GB smaller. Gemma 3 12B Q4_0 landed at 67.07% MMLU vs 67.15% for full bfloat16 — essentially lossless.

Use cases

With 25.2B total params activating only 3.8B per token, Gemma 4 26B-A4B is built for latency-sensitive local workloads:

  • Agent workflows — native function calling, structured JSON output, 82.6% MMLU Pro, 77.1% on LiveCodeBench v6.
  • Long-document reasoning — 256K context with hybrid sliding/global attention.
  • Local coding copilot — Codeforces ELO 1718, offline-capable at 13–17 GB disk.
  • Consumer-GPU deployments — UD-IQ2_XXS fits on a 12 GB card; UD-Q4_K_XL fits on 24 GB with headroom.

Limitations & pricing

A few caveats worth flagging:

  • The full per-quant KL numbers from localbench sit behind a paid subscription. The methodology is public; the ranking table is not. Unsloth's "22/22" claim sources from that benchmark.
  • KL divergence is a proxy, not a task score. A 2-bit quant can be "best in class" and still lose real quality vs 4-bit.
  • MoE peak RAM ≈ all experts loaded — even though only 3.8B activate per token, you still need to fit the full quant in memory.
  • Gemma 4 is Apache 2.0; Unsloth GGUFs are free on Hugging Face. No pricing gate.

What's next

Expect the localbench comparison to expand to Gemma 4's 31B dense variant and the E2B/E4B edge models. Unsloth has said Dynamic 2.0 is a moving target — recipes update as llama.cpp evolves, so the current SOTA lineup on Hugging Face will keep shifting as new quant types land.

Sources: UnslothAI, localbench, Unsloth docs, Google.