4 Trụ Cột Agent Bền Vững - Phần 3: Harness và Orchestration

TL;DR

Harness là hệ điều hành mà agent chạy trên đó. Anthropic đã chứng minh điều này bằng cách không cố ý: 3 thay đổi trong harness - không đụng gì đến model - đã collapse Claude Code từ 2.200 ký tự thinking xuống còn 600 trong 6 tuần. Orchestration là distributed system engineering, không phải voodoo AI. Đây là phần cuối của series về 4 trụ cột.

Trụ cột 3 - Harness: Agent loop là hệ điều hành của bạn

Ngày 4/3/2026: Anthropic giảm default reasoning effort của Claude Code từ high xuống medium để giảm latency. Ngày 26/3: một caching change có bug khiến model bỏ đi toàn bộ reasoning history mid-session - làm model trông như bị "quên" và drain usage limit của user nhanh hơn dự kiến. Ngày 16/4: một system prompt instruction được thêm vào giới hạn response ở 25 từ giữa các tool calls. Không có thay đổi nào đụng đến model. Tất cả là harness changes.

Kết quả: median visible thinking length collapse từ 2.200 ký tự xuống 600 (giảm 73%). API retry rate tăng 80 lần. User cảm nhận được trước monitoring của Anthropic. Một community researcher - không phải internal team - phân tích 6.852 sessions, 17.871 thinking blocks, và 234.760 tool calls trước khi vấn đề được thừa nhận. Anthropic's own postmortem ngày 23/4/2026 đặt tên nguyên nhân: harness và operating instructions.

Nếu team viết Claude Code - flagship developer product của Anthropic - có thể ship pattern bug này và phải học về nó từ user community, thì phần còn lại của chúng ta nên coi engineering xung quanh model là load-bearing thing. Vì nó đúng là như vậy.

Harness naive vs harness production

Harness naive trong hầu hết demo codebase bắt mọi exception và trả về một string chung chung như "tool failed, please try again" - silent error swallowing từ trụ cột Building, được nâng lên tầng agent loop. Model thấy failure string chung chung, không biết tool nào fail hay tại sao. Trong production, điều này tạo ra một trong hai pattern: agent bỏ cuộc với thông báo mơ hồ, hoặc loop mãi retry cùng tool call bị broken cho đến khi timeout hay credit limit. Cả hai là trách nhiệm của harness vì cả hai đến từ việc harness flatten signal đáng ra phải được preserve.

Production harness structure error propagation, bound loop, và make toàn bộ run observable như một distributed trace:

MAX_STEPS = 10

def respond(user_msg: str, session_id: str) -> str:
    with tracer.start_as_current_span("agent.respond") as span:
        span.set_attribute("session.id", session_id)
        history.append({"role": "user", "content": user_msg})

        for step in range(MAX_STEPS):
            with tracer.start_as_current_span("agent.llm_call") as ls:
                completion = llm.complete(history)
                ls.set_attribute("tokens.input", completion.usage.input)
                ls.set_attribute("tokens.output", completion.usage.output)

            if not completion.tool_call:
                return completion.text

            try:
                result = call_tool(completion.tool_call)
            except Exception as e:
                log.exception("tool_failed", tool=completion.tool_call.name)
                result = {"status": "error", "error_type": type(e).__name__, "message": str(e)}

            history.append({"role": "tool", "content": json.dumps(result)})

        raise StepLimitExceeded(f"agent looped past {MAX_STEPS} steps")

Ba thứ quan trọng trong shape này: token usage captured per LLM call (runaway loops xuất hiện ngay như cost spike trong tracing backend), tool error serialized như structured result cho model (không phải generic string), hard step bound với typed exception khi vượt giới hạn. Alternative là debug sanity trong production bằng cách đoán - đó là brutal.

Multi-agent System Process Diagram - lead agent, search subagents, citation agent, và memory

Trụ cột 4 - Orchestration: Khi nhiều worker ngẫu nhiên

Tháng 6/2025, hai paper xuất hiện trong 24 giờ. Cognition Labs publish "Don't Build Multi-Agents", lập luận rằng multi-agent parallel tự nó dễ vỡ vì context isolation - ví dụ nổi tiếng: sub-agent xây nền game Mario và sub-agent khác tạo ra "acid bird" không thuộc game vì không ai chia sẻ design context. Anthropic publish "How we built our multi-agent research system", báo cáo mức tăng 90.2% performance trên breadth-first research bằng cách cho mỗi sub-agent context window riêng.

Cộng đồng đọc như thể chúng đối lập nhau. Đó là cách đọc sai. Hai paper mô tả hai use case khác nhau: Cognition về consistency-critical work (coding, conversation) nơi shared context là load-bearing thing; Anthropic về parallelizable breadth-first search nơi bottleneck là total tokens. Biến quyết định không phải "tôi có dùng nhiều agent không" - mà là liệu orchestration layer có typed task contract, isolated state, và observability across hops không.

AgentLeak benchmark 2026 xác nhận điều này. Trên 5 production LLM (GPT-4o, Claude 3.5 Sonnet, Mistral Large, Llama 3.3 70B), multi-agent configuration nâng system exposure lên 68.9% qua communication channel, so với 43.2% trong single-agent setup. Phần tăng đến từ inter-agent messaging, shared memory, và tool arguments - những surface mà output-only audit không bao giờ kiểm tra.

Delegation contract và durability

Bug tôi thấy liên tục: coordinator instantiate workers từ class-level list. Workers mutate sub-agent state - đó cũng là coordinator state - đó cũng là state của mọi worker khác. Hai request đến concurrently và thấy history của nhau. User thấy data của người khác và report. Fix là factory, không phải instance - cùng pattern từ trụ cột Memory nhưng scale lên cross-request trong multi-process setup.

Delegation contract là bug tiếp theo. Hầu hết framework mặc định pass full conversation history của parent xuống sub-agent. Sub-agent thấy instruction dành cho parent, tool result từ tool nó không có, thread nó không nên follow. Output plausible - model capable - nhưng silently mislead. Treat sub-agent invocation như typed RPC: schema cho task input, schema cho output, quyết định rõ ràng fragment nào của parent context flow through.

Durability: agent work thực (research pipeline, customer support flow chờ approval) chạy vài phút đến vài giờ. Single Python process nghĩa là deployment hay crash xoá toàn bộ workflow. Team trưởng thành tạo agent workflow như saga - durable state machine nơi mỗi step được record, idempotent và resumable. Temporal, DBOS là những tool production-tested cho bài toán này. Đừng roll your own.

Cuối cùng: model selection per role nên là configuration parameter. Hầu hết codebase agent ban đầu hardcode một model ở khắp nơi. 6 tháng sau khi ai đó muốn A/B Sonnet vs Haiku cho workers và giữ Opus trên planner, thay đổi đó là surgery. LangGraph đạt adoption tại Klarna, LinkedIn, Uber, Replit, Elastic một phần vì abstraction match đúng những gì production team cần control.

Orchestration control room với span traces và step counters - kiến trúc harness production-grade

Những thứ không nên đặt năng lượng vào

AutoGen/AG2 cho production build mới - community maintenance, stalled releases.
CrewAI cho production - demo tốt, không sống sót qua production.
SWE-bench / OSWorld leaderboard chasing - Berkeley researchers đã document cách game benchmark qua 2025. Dùng TerminalBench 2.0 và internal eval.
Per-seat SaaS pricing cho agent product mới - thị trường đã chuyển sang outcome và usage-based.
Framework mới trên HN tuần này - chờ 6 tháng. Mọi framework bạn không adopt là migration bạn không nợ.

Kết

Model sẽ tiếp tục tốt hơn. Framework sẽ tiếp tục thay đổi. Thứ quyết định liệu hệ thống của bạn có giữ được reliability xuyên suốt tất cả những thay đổi đó không - là liệu kỹ thuật xung quanh model có đủ discipline để absorb thay đổi mà không vỡ không. Build boundaries. Validate inputs. Isolate state. Instrument the loop. Test the surface. Treat harness như OS và orchestration như distributed system - vì đó là những gì chúng thực sự là.

Model là phần dễ. Engineering xung quanh model là nơi reliability sống. Đó là metric duy nhất quan trọng.

via Anthropic April 2026 Postmortem · Cognition: Don't Build Multi-Agents · Anthropic: Multi-Agent Research System