Unsloth Studio chạy 2-bit Qwen3.6-27B trên 12GB RAM — triage 15 GitHub issue, gọi 26 tool call local

TL;DR

Unsloth vừa trình diễn bản 2-bit GGUF của Qwen3.6-27B chạy local trong Unsloth Studio với chỉ 12GB RAM. Trong demo, model thực hiện 26 tool call, triage 15 GitHub issue kèm fix, và repro + test 3 issue mới nhất trong repo chính của Unsloth. Studio cũng vừa có diện mạo mới: Data Recipes theo kiểu visual-node, Model Arena so sánh song song, sandbox code execution theo phong cách Claude Artifacts, và Think toggle cho hybrid reasoning. Tất cả miễn phí, offline, Apache 2.0 / AGPL-3.0.

What's new

Điểm nóng không nằm ở chuyện Qwen3.6-27B tồn tại — nó nằm ở chỗ bản quantize 2-bit vẫn đủ chất lượng để làm việc agentic thật. File UD-Q2_K_XL chỉ 11.8GB, nhét gọn vào máy 16GB RAM phổ thông, và Unsloth đã chứng minh nó không chỉ trả lời được — nó làm được việc nhiều bước: gọi tool lặp đi lặp lại, đọc hiểu context GitHub issue, chỉ ra chỗ cần sửa, và viết code fix.

Song song, Unsloth Studio (đang Beta) được giới thiệu như một web UI no-code chạy 100% local: search + download + inference cho GGUF và safetensors, self-healing tool calling, web search, Bash + Python execution trong sandbox, auto-tune inference params, và export sang GGUF / 16-bit safetensors.

Why it matters

Trước đây, để làm agentic coding nghiêm túc local, bạn cần 32GB+ RAM hoặc card VRAM cao. Con số 12GB RAM + 2-bit đẩy ngưỡng đó xuống laptop thường. Và khác với demo show-off, đây là workload thật: triage bug tracker, đọc issue, map code, chạy test. Nếu một máy 12GB có thể làm basic project triage offline, thì cost và privacy calculus cho dev indie, team nhỏ và enterprise ngại SaaS đều thay đổi.

Technical facts

Base model: Qwen3.6-27B dense, 64 layers, hidden 5120, native context 262,144 token (extend tới ~1M với YaRN), có vision encoder.
2-bit GGUF (Unsloth Dynamic 2.0): UD-IQ2_XXS 9.39GB, UD-IQ2_M 10.8GB, UD-Q2_K_XL 11.8GB (khuyến nghị).
Hardware requirement (unified RAM + VRAM): 3-bit 15GB, 4-bit 18GB, 6-bit 24GB, 8-bit 30GB, BF16 55GB.
KL Divergence (27B): 8-bit 0.0028, 4-bit 0.0227, 3-bit 0.0734 — đặt Unsloth top-performing ở 21/22 size.
Tốc độ community-reported: RTX 5090 Q6_K @123k context → ~50 t/s; RX 7900 XTX Q5_K_M → 30 t/s generate + 625 t/s prefill; M5 Pro 128GB → ~25 t/s; M1 Max 32GB Q4 → ~9 t/s; Strix Halo 128GB Q8 → 20-25 t/s.
Demo agentic 2-bit: 26 tool call, triage 15 issue, repro 3 repo issue. Demo khác: 30+ tool call, duyệt 20 site, chạy Python. User khác: sinh website HTML 16k token, 1.7k dòng từ 1 prompt.

Comparison

So với Qwen3.5 (Feb 2026), 3.6 ưu tiên stability và real-world utility: frontend workflow mượt hơn, repo-level reasoning tốt hơn, và có option mới preserve_thinking giữ reasoning trace xuyên turn — lợi cho agent, cắt redundant token, tối ưu KV cache. Tool-call parsing cũng được cải thiện cho nested object.

So với các model cùng phân khúc:

Match-up	Kết quả community
Qwen3.6-27B vs Gemma 4	Qwen ít lazy tool-call hơn, website output "way better", Gemma 4 hay đi chệch hướng trên coding.
Qwen3.6-27B vs Claude Opus	Opus vẫn dẫn ở việc sửa code phức tạp có sẵn. Qwen tốt làm sub-agent cho frontier coordinator.
Qwen3.6-27B vs MiniMax-M2.7	Qwen thắng trên 3-prompt test (NumPy layer-norm backward, CUDA fused softmax+top-k, KV-cache autoregressive).
Qwen3.6-27B vs GLM-5 (lớn hơn nhiều)	Qwen thắng 2/3 implementation trong cùng test, với k2.6 làm judge.

Use cases

Indie dev: prompt "investigate https://github.com/org/repo/issues/XYZ" là đủ để model chỉ ra chỗ cần nhìn trong codebase. Automate terminal work qua Qwen Code.
Generative web dev: 2-bit one-shot ra website 1.7k dòng HTML.
Researcher / hobbyist: Upload PDF/CSV/JSON/DOCX vào Data Recipes, auto tạo dataset, fine-tune không cần viết script; Model Arena so sánh base vs LoRA side-by-side.
Privacy-first dev và enterprise: 100% offline, không telemetry, encrypted password + JWT; tránh rủi ro SaaS ban account, đổi token limit, hay deprecate model.

Studio refresh có: Data Recipes dạng graph-node visual workflow, Model Arena split-screen, Claude-style Artifacts trong chat, dashboard observability live (loss, gradient norm, GPU util) xem được từ điện thoại, Think toggle cho hybrid reasoning model, auto-tune inference params, edit chat template.

Limitations & pricing

Trade-off 2-bit: chất lượng bắt đầu rơi rõ từ 3-bit xuống, error compound trên long session. 4-bit đã được cộng đồng coi là "far from lossless" cho agentic long-context.
Ollama incompat: Qwen3.6 GGUFs hiện không chạy trong Ollama (do mmproj vision file tách) — dùng llama.cpp-compatible backend.
CUDA 13.2 cảnh báo: gây gibberish output, NVIDIA đang fix.
Giá: free hoàn toàn. Có free Colab notebook.
License: Unsloth dual-license (core Apache 2.0, Studio UI AGPL-3.0). Weights Qwen3.6-27B Apache 2.0.
Availability: Studio Beta. Windows / Linux / WSL đủ tính năng. macOS hiện chỉ chat + Data Recipes (MLX training sắp có). AMD chat OK, train qua Unsloth Core.

What's next

Roadmap của Unsloth: desktop app Studio dự kiến ra trong tháng này (đang test), OpenAI-compatible inference API có thể sớm trong tuần, Apple MLX training + AMD training trong Studio sắp có, multi-GPU major upgrade (phối hợp với NVIDIA), manual context-length slider, Intel Arc GPU inference (đang được cộng đồng request). MLX quant algorithm tiếp tục được tinh chỉnh cho Mac.

Với phần cứng phổ thông đã đủ chạy agentic demo nghiêm túc, câu hỏi không còn là "local có khả thi không?" mà là "khi nào bạn rời SaaS?"

Nguồn: Unsloth Docs — Qwen3.6, Introducing Unsloth Studio, Qwen3.6-27B-GGUF trên Hugging Face, Hacker News thread, GitHub unslothai/unsloth.

Unsloth Studio chạy 2-bit Qwen3.6-27B trên 12GB RAM — triage 15 GitHub issue, gọi 26 tool call local

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Huihui4-8B-A4B: cắt 96 expert khỏi Gemma 4 mà perplexity vẫn đẹp hơn bản gốc

Carnice-V2-27b: a 27B open-source agent model built on Qwen3.6 lands on Hugging Face

Qwen3.6-27B chạy local trên MacBook Pro: model 27B đánh bại 397B trên benchmark coding

Free CLI Agent: Pi + Ollama + Gemma 4 + Parallel Search MCP — $0, No API Keys

SmallClaw: AI agent framework local-first cho small models, chạy ngon trên laptop 8GB RAM