- TeichAI distilled 250 Claude Opus 4.5 high-reasoning traces into an 8B Qwen3 model for $52.3.
- The result: step-by-step Opus-style thinking that runs on consumer hardware via llama.cpp or Ollama.
TL;DR
TeichAI/Qwen3-8B-Claude-4.5-Opus-High-Reasoning-Distill is a new 8B open-weights model fine-tuned from Qwen/Qwen3-8B-Base by distilling reasoning traces from Claude Opus 4.5 at high reasoning effort. Training used just 250 curated samples (2.13M tokens) and cost $52.3. GGUF quants from 4.12GB to 8.71GB fit 6–16GB GPUs — meaning Opus-style step-by-step reasoning now runs locally on a laptop.
What's new
Most reasoning-focused open models retrain on massive synthetic chain-of-thought corpora. TeichAI took a sharper knife: collect a small, high-quality set of Opus 4.5 traces generated with high reasoning effort, then SFT a Qwen3-8B base on them. The pitch is not raw benchmark points — it's behavior transfer. The model learns to decompose problems, plan sub-steps, and verify before answering, the way Opus does, without the Opus price tag.
Author TeichAI ships both Safetensors (BF16) and a full ladder of GGUF quantizations through the GGUF repo, so llama.cpp / Ollama / LM Studio users can plug it in today.
Why it matters
Claude Opus is excellent at multi-step reasoning but it's a closed API with per-token cost and no local option. For devs building agents, offline tools, or privacy-sensitive apps, running something Opus-shaped locally on an 8GB consumer GPU is a big unlock. It also demonstrates a surprising economic point: you do not need millions of samples to transfer a reasoning style. 250 well-chosen Opus traces and ~$50 of GPU time produced a usable artifact.
Technical facts
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-8B-Base |
| Parameters | 8B (all active, BF16) |
| Teacher | Claude Opus 4.5 (high reasoning effort) |
| Dataset | TeichAI/claude-4.5-opus-high-reasoning-250x |
| Training samples | 250 |
| Total tokens | 2.13M (input + output) |
| Training cost | $52.3 USD |
| Training framework | Unsloth (4-bit base) |
| Formats shipped | Safetensors BF16 + GGUF Q3/Q4/Q6/Q8 |
GGUF size & VRAM
| Quant | File size | Min VRAM | Recommended |
|---|---|---|---|
| Q3_K_M | 4.12 GB | 6 GB | 8 GB |
| Q4_K_M | 5.03 GB | 8 GB | 12 GB |
| Q6_K | 6.73 GB | 10 GB | 16 GB |
| Q8_0 | 8.71 GB | 12 GB | 16 GB+ |
Q4_K_M is the sweet spot for an RTX 3060/4060 or an M-series Mac with 16GB unified memory.
Comparison
TeichAI's drop sits inside a fast-growing niche of Claude-distilled open models. Jackrong's Qwen3.5 collection distilled Claude Opus 4.6 traces into 4B / 9B / 27B / 35B variants using ~14,000 samples. Their 9B v2 reports ~20% fewer reasoning tokens while matching or beating the base model on HumanEval/HumanEval+ — strong evidence that Opus-style reasoning compresses well.
TeichAI's bet is the opposite end of the dataset axis: 250 very high-quality samples from a higher reasoning-effort setting. Smaller, cheaper, more targeted. The tradeoff is less coverage — no official benchmark has been published yet — but the model fits a specific slot: consumer-GPU agents that need structured thinking, not Swiss-army generalization.
Running it
Grab a GGUF and load it with the tool you already use. For llama.cpp: ./main -m q4_k_m.gguf -n 512 -p "Your prompt". For Ollama, create a Modelfile pointing at the GGUF and ollama create qwen3-opus -f Modelfile. LM Studio and text-generation-webui auto-detect the chat template. Because the model is trained to emit a structured thinking pass before answering, give it room — set -n 1024 or higher and don't truncate reasoning tokens at generation time. On a 16GB M2 MacBook Air, Q4_K_M averages roughly 25–35 tokens/sec — plenty for interactive agents.
Use cases
- Local coding copilots on 8GB GPUs where sending code to a cloud API is off the table.
- Agentic workflows needing multi-step planning — research agents, browser automation, task decomposition.
- Education & tutoring — the structured "break it down, verify, answer" pattern is pedagogically useful.
- Edge deployment on laptops or mini-PCs, with latency and data-residency benefits over hosted APIs.
- Research into how far tiny curated distillation sets can go.
Limitations & pricing
- No published benchmarks vs base Qwen3-8B or peers — early adopters are doing their own evals.
- No inference providers deployed; run it yourself via
llama.cpp, Ollama, LM Studio, or vLLM. - 250 samples is tiny. Expect strong in-domain behavior and possible brittleness on out-of-domain prompts.
- Licensing isn't spelled out clearly on the card — it inherits base Qwen3 terms plus any dataset constraints. Check before shipping commercially.
- Cost to use: free download; your only cost is local inference compute.
What's next
TeichAI already has companion models — a 4B Qwen3-Thinking variant and a Nemotron-Orchestrator-8B Opus distill — hinting at a multi-agent stack where a small thinker plans and a larger executor acts. Expect community benchmarks, DPO-refined successors, and more size points in the coming weeks. The broader pattern is clear: Claude Opus behavior is escaping into open weights one 8B distill at a time, and the barrier to entry is roughly the price of a dinner.
Sources: TeichAI model card, GGUF repo, Jackrong 9B v2.

