TL;DR
- Carnice-V2-27b is a 27B open-source agent model fine-tuned on top of Qwen/Qwen3.6-27B, optimized for the Hermes-Agent harness.
- It beats its base on IFEval prompt strict by +5 pp (85% → 90%) and cuts held-out assistant-token perplexity by 17.5% (1.835 → 1.513).
- Fully merged BF16 weights (no LoRA) under Apache 2.0. GGUF quants range from 9.4 GB (IQ2_M) up to 53.8 GB (BF16); Q5_K_M at 19.2 GB is the sweet spot for an RTX 3090.
- Successor to Carnice-27b (V1, on Qwen3.5-27B), trained on more and cleaner Hermes-style traces.
What is new
Released today by Kai Stephens (kai-os) with credits to NousResearch and Lambda, Carnice-V2-27b is a single-pass supervised fine-tune that ships as a full merged BF16 model rather than an adapter. That detail matters: you do not have to glue a LoRA onto a base checkpoint at inference time, and llama.cpp / vLLM / TGI can load it directly.
Compared to V1, V2 swaps the base from Qwen3.5-27B to Qwen3.6-27B and rebuilds the data mix around three families of agent traces: Carnice in-house, DJLougen Hermes, and Lambda GLM-5.1 Hermes. Author summary on X: “more and better data.”
Why it matters
Open-weight 27B models that are actually tuned for the Hermes-Agent style of tool use are rare. Most open releases are general chat fine-tunes — strong on MMLU, weak when you ask them to drive a terminal, edit a repo, or chain four browser actions without losing the plot. Carnice's whole reason for existing is that narrow agentic harness, which makes it interesting for indie devs who want a local, royalty-free agent backbone instead of paying per token to a closed API.
The author goes further and claims V2 can “beat models 10x the size” inside the Hermes-Agent harness. Treat that as an author claim until reproducible runs land — but the direction is plausible: harness-specific fine-tunes routinely punch above their weight on the exact tasks they were trained for.
Technical facts
| Property | Carnice-V2-27b |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Parameters | 27B (BF16 safetensors) |
| Fine-tune type | Full merged SFT (no LoRA) |
| Train rows / windows | 3,473 rows → 6,554 windows (8,192 tok, 1,024 overlap) |
| Eval examples | 110 |
| Data mix | 1,508 Carnice + 1,015 DJLougen Hermes + 950 Lambda GLM-5.1 Hermes |
| License | Apache 2.0 |
| Architecture quirk | qwen35 arch with hybrid attention/SSM — needs a recent llama.cpp build |
Reported deltas vs base Qwen3.6-27B (IFEval limit=20 smoke test, plus held-out assistant-token eval):
| Metric | Base | Carnice-V2 | Delta |
|---|---|---|---|
| IFEval prompt strict | 85.0% | 90.0% | +5.0 pp |
| IFEval prompt loose | 85.0% | 90.0% | +5.0 pp |
| IFEval instruction strict | 90.0% | 93.3% | +3.3 pp |
| IFEval instruction loose | 90.0% | 93.3% | +3.3 pp |
| Held-out eval loss | 0.607 | 0.414 | −31.8% |
| Held-out perplexity | 1.835 | 1.513 | −17.5% |
The author flags IFEval at limit=20 as a smoke test, not a leaderboard score. Read it as “moved in the right direction during training,” not as a final verdict.
V1 vs V2
| Aspect | Carnice-27b (V1) | Carnice-V2-27b |
|---|---|---|
| Base | Qwen3.5-27B | Qwen3.6-27B |
| Pipeline | Trinity (3 stages: backbone → alignment → polish) | Single-pass SFT on a refined data mix |
| Reproducible benchmarks | Not attached at release | IFEval + assistant-token loss reported |
| Data | Carnice + DJ + Lambda | Same families, more rows + cleaner curation |
Use cases
Carnice is built for the Hermes-Agent harness, which means the obvious fits are:
- Terminal-driving agents that need to chain shell commands without losing context.
- Repo-aware coding agents that read, edit, and reason across many files.
- Browser automations with multi-step tool use.
- Local debugging copilots where a closed API call per turn is too slow or too expensive.
If your stack is a general chatbot, Q&A bot, or RAG pipeline, a generic Qwen3.6 instruct or a smaller Carnice-9b will probably serve you better.
Limitations & pricing
Free under Apache 2.0 — but “runs on a 3090” needs an asterisk. The full BF16 checkpoint is 53.8 GB, which does not fit on a 24 GB card. The consumer-GPU story relies entirely on the GGUF release:
| Quant | Size | Fits 3090 (24 GB)? |
|---|---|---|
| IQ2_M | 9.4 GB | Yes (also fits 16 GB cards) |
| Q2_K | 10 GB | Yes |
| Q4_K_M | 16.5 GB | Yes (max context headroom) |
| Q5_K_M | 19.2 GB | Yes (best quality on 3090) |
| Q8_0 | 28.6 GB | No |
| BF16 | 53.8 GB | No (server-class only) |
Other caveats: you need a recent llama.cpp build because of the hybrid attention/SSM layers; harness-level Hermes-Agent task scores have not been published yet; and there is no formal head-to-head V1-vs-V2 eval in the model card.
What is next
The author hints that reproducible benchmark runs on a dedicated box are still in progress for the 27B line, and Carnice-9b already exists for smaller setups. If V2 holds up under independent harness evals, it becomes one of the more interesting open-weight options for self-hosted agentic stacks alongside the larger Qwen and Llama tunes.
Quick start: pull the GGUF, point a recent llama-cli at it, and try it inside your own agent loop before committing.
llama-cli -m carnice-v2-27b-Q5_K_M.gguf -ngl all -c 8192Sources: Carnice-V2-27b on Hugging Face, GGUF release, Carnice-27b (V1), original announcement on X.

