Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Summary post

Kai Stephens drops Carnice-V2-27b, a fully merged BF16 fine-tune of Qwen3.6-27B aimed at the Hermes-Agent harness. Apache 2.0, GGUF quants from 9.4 GB, and benchmark deltas that put it ahead of its base on IFEval — here is what is actually inside it and how to run it on a 3090.

6phút đọc

8mục nội dung

5chủ đề

TL;DR

Carnice-V2-27b is a 27B open-source agent model fine-tuned on top of Qwen/Qwen3.6-27B, optimized for the Hermes-Agent harness.
It beats its base on IFEval prompt strict by +5 pp (85% → 90%) and cuts held-out assistant-token perplexity by 17.5% (1.835 → 1.513).
Fully merged BF16 weights (no LoRA) under Apache 2.0. GGUF quants range from 9.4 GB (IQ2_M) up to 53.8 GB (BF16); Q5_K_M at 19.2 GB is the sweet spot for an RTX 3090.
Successor to Carnice-27b (V1, on Qwen3.5-27B), trained on more and cleaner Hermes-style traces.

What is new

Released today by Kai Stephens (kai-os) with credits to NousResearch and Lambda, Carnice-V2-27b is a single-pass supervised fine-tune that ships as a full merged BF16 model rather than an adapter. That detail matters: you do not have to glue a LoRA onto a base checkpoint at inference time, and llama.cpp / vLLM / TGI can load it directly.

Compared to V1, V2 swaps the base from Qwen3.5-27B to Qwen3.6-27B and rebuilds the data mix around three families of agent traces: Carnice in-house, DJLougen Hermes, and Lambda GLM-5.1 Hermes. Author summary on X: “more and better data.”

Why it matters

Open-weight 27B models that are actually tuned for the Hermes-Agent style of tool use are rare. Most open releases are general chat fine-tunes — strong on MMLU, weak when you ask them to drive a terminal, edit a repo, or chain four browser actions without losing the plot. Carnice's whole reason for existing is that narrow agentic harness, which makes it interesting for indie devs who want a local, royalty-free agent backbone instead of paying per token to a closed API.

The author goes further and claims V2 can “beat models 10x the size” inside the Hermes-Agent harness. Treat that as an author claim until reproducible runs land — but the direction is plausible: harness-specific fine-tunes routinely punch above their weight on the exact tasks they were trained for.

Technical facts

Property	Carnice-V2-27b
Base model	Qwen/Qwen3.6-27B
Parameters	27B (BF16 safetensors)
Fine-tune type	Full merged SFT (no LoRA)
Train rows / windows	3,473 rows → 6,554 windows (8,192 tok, 1,024 overlap)
Eval examples	110
Data mix	1,508 Carnice + 1,015 DJLougen Hermes + 950 Lambda GLM-5.1 Hermes
License	Apache 2.0
Architecture quirk	`qwen35` arch with hybrid attention/SSM — needs a recent `llama.cpp` build

Reported deltas vs base Qwen3.6-27B (IFEval limit=20 smoke test, plus held-out assistant-token eval):

Metric	Base	Carnice-V2	Delta
IFEval prompt strict	85.0%	90.0%	+5.0 pp
IFEval prompt loose	85.0%	90.0%	+5.0 pp
IFEval instruction strict	90.0%	93.3%	+3.3 pp
IFEval instruction loose	90.0%	93.3%	+3.3 pp
Held-out eval loss	0.607	0.414	−31.8%
Held-out perplexity	1.835	1.513	−17.5%

The author flags IFEval at limit=20 as a smoke test, not a leaderboard score. Read it as “moved in the right direction during training,” not as a final verdict.

V1 vs V2

Aspect	Carnice-27b (V1)	Carnice-V2-27b
Base	Qwen3.5-27B	Qwen3.6-27B
Pipeline	Trinity (3 stages: backbone → alignment → polish)	Single-pass SFT on a refined data mix
Reproducible benchmarks	Not attached at release	IFEval + assistant-token loss reported
Data	Carnice + DJ + Lambda	Same families, more rows + cleaner curation

Use cases

Carnice is built for the Hermes-Agent harness, which means the obvious fits are:

Terminal-driving agents that need to chain shell commands without losing context.
Repo-aware coding agents that read, edit, and reason across many files.
Browser automations with multi-step tool use.
Local debugging copilots where a closed API call per turn is too slow or too expensive.

If your stack is a general chatbot, Q&A bot, or RAG pipeline, a generic Qwen3.6 instruct or a smaller Carnice-9b will probably serve you better.

Limitations & pricing

Free under Apache 2.0 — but “runs on a 3090” needs an asterisk. The full BF16 checkpoint is 53.8 GB, which does not fit on a 24 GB card. The consumer-GPU story relies entirely on the GGUF release:

Quant	Size	Fits 3090 (24 GB)?
IQ2_M	9.4 GB	Yes (also fits 16 GB cards)
Q2_K	10 GB	Yes
Q4_K_M	16.5 GB	Yes (max context headroom)
Q5_K_M	19.2 GB	Yes (best quality on 3090)
Q8_0	28.6 GB	No
BF16	53.8 GB	No (server-class only)

Other caveats: you need a recent llama.cpp build because of the hybrid attention/SSM layers; harness-level Hermes-Agent task scores have not been published yet; and there is no formal head-to-head V1-vs-V2 eval in the model card.

What is next

The author hints that reproducible benchmark runs on a dedicated box are still in progress for the 27B line, and Carnice-9b already exists for smaller setups. If V2 holds up under independent harness evals, it becomes one of the more interesting open-weight options for self-hosted agentic stacks alongside the larger Qwen and Llama tunes.

Quick start: pull the GGUF, point a recent llama-cli at it, and try it inside your own agent loop before committing.

llama-cli -m carnice-v2-27b-Q5_K_M.gguf -ngl all -c 8192

Sources: Carnice-V2-27b on Hugging Face, GGUF release, Carnice-27b (V1), original announcement on X.

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

TL;DR

What is new

Why it matters

Technical facts

V1 vs V2

Use cases

Limitations & pricing

What is next

Tiếp tục lướt

Mind DeepResearch 30B của Li Auto vượt Gemini 3.1 trên benchmark deep research

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding