Model-Harness-Fit: Tại sao đổi model không đơn giản như đổi API key

TL;DR

Harness là một phần của effective parameters của model. Quá trình post-training nhúng tool surface, schema shapes, memory rituals và cấu trúc system prompt vào bộ instinct của model. Bạn có thể mang weights sang harness khác, nhưng không mang được instincts. Instincts chỉ kích hoạt khi harness trình bày thế giới đúng cách nó đã được dạy.

Hệ quả: matched pair (model + harness) mới là đơn vị phân tích đúng - không phải model đứng một mình. Và matched pair không tĩnh: harness đúng cho model tháng 3 không phải harness đúng cho model kế nhiệm tháng 10.

Những con số không nói dối

Terminal-Bench 2.0 xếp hạng theo harness + model pair, không phải model đơn thuần. Từ dữ liệu April 30, 2026:

Pair	Pass rate
Codex CLI + GPT-5.5	82.0%
ForgeCode + Claude Opus 4.6	79.8%
Capy + Claude Opus 4.6	75.3%
Claude Code + Claude Opus 4.6	58.0%

Cùng weights Claude Opus 4.6, ba harness khác nhau tạo ra khoảng cách 21.8 điểm. Harness gốc của Anthropic đứng cuối trong số các harness chạy chính model của họ.

Benchmark Endor Labs cùng tuần: GPT-5.5 trong Codex đạt 61.5%, GPT-5.5 trong Cursor đạt 87.2% - chênh 25.7 điểm, không phải từ fine-tuning hay prompting tốt hơn, chỉ từ việc đổi runtime. LangChain tăng từ 52.8% lên 66.5% (top 30 lên top 5) mà không đổi model.

Ba harness, ba contract

Mỗi harness lớn chọn giao thức orchestration khác nhau, và model được train chính xác trên wire format đó:

Codex CLI: typed async protocol - model emit Submission, nhận về stream Event. Model được dạy để emit submissions và consume events.
Claude Code: typed conversation loop trực tiếp - AssistantEvent variants (TextDelta, ToolUse, Usage, MessageStop). Protocol là Anthropic Messages API + in-process tool dispatcher.
GitHub Copilot CLI: supervisor protocol qua JSON RPC over stdio - agent loop chạy trong child process, host nhận session.event notifications. Cho phép một binary phục vụ cả terminal, cloud agent và third-party hosts.

Đây không phải ba implementation của cùng ý tưởng - đây là ba contract khác nhau giữa model và runtime.

Bề mặt tool - nơi post-training lộ rõ nhất

Cursor's research team mô tả chính xác: "OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes."

Codex dùng apply_patch (Lark grammar). Claude Code dùng Edit (old_string / new_string). Đây không phải ưu tiên - đây là byte-level convention baked vào post-training. Ngoài ra mỗi harness có tool riêng hoàn toàn không tồn tại ở harness kia: Codex có 8 verb subagent dispatch, Claude Code có Monitor để stream stdout từ background process. Harness có thể bọc điều này bằng router, nhưng router không thể cho model instinct nó không có từ training.

Tầng memory và citation contract

Ba kiến trúc memory, ba loại "chữ ký" model dùng để nói chuyện ngược lại với harness:

Codex: model emit <oai-mem-citation thread_id="xyz"> sau mỗi lần dùng memory. Harness parse tag, bump usage_count trong SQLite - đây là signal decay memory không được dùng. Claude Code: không có citation tag, dùng Read tool + verification grep làm signal, memory index MEMORY.md luôn load mỗi turn qua system-reminder. Copilot CLI: server-side backend, tracking ở remote.

Hậu quả cross-harness: chạy Codex-trained model trên Claude Code harness - model emit <oai-mem-citation> trong assistant text, harness không parse, user thấy raw XML, và decay loop không bao giờ chạy. Chạy Claude-trained model trên Codex harness - không có citation tag, usage_count không tăng, Codex evict memory tốt vì chúng trông như chưa được dùng. Sáu ký tự XML là ranh giới giữa memory system cải thiện theo thời gian và memory system âm thầm thoái hóa.

Vòng lặp co-evolution - tại sao matched pair càng khó phá vỡ

LangChain's Vivek Trivedy đặt tên cơ chế: "Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in."

Vòng lặp: primitive mới ship - xuất hiện trong hàng triệu trace - trở thành training data cho model tiếp theo - primitive được bake vào instinct. Sang harness ngoại lai = bỏ qua mọi chu kỳ compounding đó.

Nhưng assumptions đó cũng stale theo thời gian. Khi Opus 4.6 ra, "context anxiety" biến mất - cả class scaffolding phòng ngừa trở thành dead code. Ceiling di chuyển lên: giờ cần multi-day memory policy, UI evaluator, multi-agent coordinator. Harness không co lại - nó dịch chuyển.

Hệ quả thực tế

Agent platform builders: ship model + harness như một pair. GitHub Copilot CLI làm đúng - expose apply_patch chỉ khi model là Codex family, ToolSearch deferred loading chỉ cho Claude models, Critic agent với complementary model. Đây là per-model tool routing thực sự, không phải common denominator. Bất kỳ vendor nào pretend model là portable đang underperform trên mọi model họ serve.

Model labs: harness là product strategy, không phải infrastructure. <system-reminder> injection, typed memory taxonomy, 10-section system prompt skeleton - đây là post-training moat khiến model ít interchangeable hơn.

Users: chi phí switch cao hơn vẻ ngoài. Port trung thực = replicate tool surface + citation contract + system prompt structure + memory ritual. Và khi model mới ra, phần lớn scaffolding cũ cần xóa đi. LLMs eat scaffolding for breakfast.

Phía trước

Frontier work 2026 là về harness primitives mới: just-in-time harness assembly (context compose per task thay vì pre-configure per session), hàng trăm parallel agents trên shared codebase, self-tracing agents đọc log của chính mình để patch harness-level failure modes.

Sam Altman, khi được hỏi về tầm quan trọng của harness: "Hard to overstate how critical it is. I no longer think of the harness and the model as these entirely separable things."

Nguon: Terminal-Bench 2.0, LangChain - Anatomy of an Agent Harness, LangChain - Improving Deep Agents, Addy Osmani, MindStudio.