TL;DR

  • Carnice-V2-27b is a 27B open-source agent model fine-tuned on top of Qwen/Qwen3.6-27B, optimized for the Hermes-Agent harness.
  • It beats its base on IFEval prompt strict by +5 pp (85% → 90%) and cuts held-out assistant-token perplexity by 17.5% (1.835 → 1.513).
  • Fully merged BF16 weights (no LoRA) under Apache 2.0. GGUF quants range from 9.4 GB (IQ2_M) up to 53.8 GB (BF16); Q5_K_M at 19.2 GB is the sweet spot for an RTX 3090.
  • Successor to Carnice-27b (V1, on Qwen3.5-27B), trained on more and cleaner Hermes-style traces.

What is new

Released today by Kai Stephens (kai-os) with credits to NousResearch and Lambda, Carnice-V2-27b is a single-pass supervised fine-tune that ships as a full merged BF16 model rather than an adapter. That detail matters: you do not have to glue a LoRA onto a base checkpoint at inference time, and llama.cpp / vLLM / TGI can load it directly.

Compared to V1, V2 swaps the base from Qwen3.5-27B to Qwen3.6-27B and rebuilds the data mix around three families of agent traces: Carnice in-house, DJLougen Hermes, and Lambda GLM-5.1 Hermes. Author summary on X: “more and better data.”

Why it matters

Open-weight 27B models that are actually tuned for the Hermes-Agent style of tool use are rare. Most open releases are general chat fine-tunes — strong on MMLU, weak when you ask them to drive a terminal, edit a repo, or chain four browser actions without losing the plot. Carnice's whole reason for existing is that narrow agentic harness, which makes it interesting for indie devs who want a local, royalty-free agent backbone instead of paying per token to a closed API.

The author goes further and claims V2 can “beat models 10x the size” inside the Hermes-Agent harness. Treat that as an author claim until reproducible runs land — but the direction is plausible: harness-specific fine-tunes routinely punch above their weight on the exact tasks they were trained for.

Technical facts

PropertyCarnice-V2-27b
Base modelQwen/Qwen3.6-27B
Parameters27B (BF16 safetensors)
Fine-tune typeFull merged SFT (no LoRA)
Train rows / windows3,473 rows → 6,554 windows (8,192 tok, 1,024 overlap)
Eval examples110
Data mix1,508 Carnice + 1,015 DJLougen Hermes + 950 Lambda GLM-5.1 Hermes
LicenseApache 2.0
Architecture quirkqwen35 arch with hybrid attention/SSM — needs a recent llama.cpp build

Reported deltas vs base Qwen3.6-27B (IFEval limit=20 smoke test, plus held-out assistant-token eval):

MetricBaseCarnice-V2Delta
IFEval prompt strict85.0%90.0%+5.0 pp
IFEval prompt loose85.0%90.0%+5.0 pp
IFEval instruction strict90.0%93.3%+3.3 pp
IFEval instruction loose90.0%93.3%+3.3 pp
Held-out eval loss0.6070.414−31.8%
Held-out perplexity1.8351.513−17.5%

The author flags IFEval at limit=20 as a smoke test, not a leaderboard score. Read it as “moved in the right direction during training,” not as a final verdict.

V1 vs V2

AspectCarnice-27b (V1)Carnice-V2-27b
BaseQwen3.5-27BQwen3.6-27B
PipelineTrinity (3 stages: backbone → alignment → polish)Single-pass SFT on a refined data mix
Reproducible benchmarksNot attached at releaseIFEval + assistant-token loss reported
DataCarnice + DJ + LambdaSame families, more rows + cleaner curation

Use cases

Carnice is built for the Hermes-Agent harness, which means the obvious fits are:

  • Terminal-driving agents that need to chain shell commands without losing context.
  • Repo-aware coding agents that read, edit, and reason across many files.
  • Browser automations with multi-step tool use.
  • Local debugging copilots where a closed API call per turn is too slow or too expensive.

If your stack is a general chatbot, Q&A bot, or RAG pipeline, a generic Qwen3.6 instruct or a smaller Carnice-9b will probably serve you better.

Limitations & pricing

Free under Apache 2.0 — but “runs on a 3090” needs an asterisk. The full BF16 checkpoint is 53.8 GB, which does not fit on a 24 GB card. The consumer-GPU story relies entirely on the GGUF release:

QuantSizeFits 3090 (24 GB)?
IQ2_M9.4 GBYes (also fits 16 GB cards)
Q2_K10 GBYes
Q4_K_M16.5 GBYes (max context headroom)
Q5_K_M19.2 GBYes (best quality on 3090)
Q8_028.6 GBNo
BF1653.8 GBNo (server-class only)

Other caveats: you need a recent llama.cpp build because of the hybrid attention/SSM layers; harness-level Hermes-Agent task scores have not been published yet; and there is no formal head-to-head V1-vs-V2 eval in the model card.

What is next

The author hints that reproducible benchmark runs on a dedicated box are still in progress for the 27B line, and Carnice-9b already exists for smaller setups. If V2 holds up under independent harness evals, it becomes one of the more interesting open-weight options for self-hosted agentic stacks alongside the larger Qwen and Llama tunes.

Quick start: pull the GGUF, point a recent llama-cli at it, and try it inside your own agent loop before committing.

llama-cli -m carnice-v2-27b-Q5_K_M.gguf -ngl all -c 8192

Sources: Carnice-V2-27b on Hugging Face, GGUF release, Carnice-27b (V1), original announcement on X.