How to Write a Claude System Prompt That Actually Ships

TL;DR

On April 16, 2026, Anthropic added a single constraint to Claude Code's system prompt: "keep text between tool calls to 25 words or fewer." Four days later they reverted it after confirming a 3% drop in benchmark intelligence scores across Sonnet 4.6, Opus 4.6, and Opus 4.7. One instruction. Downstream effects on the entire prompt. The fix is not more instructions - it's the right structure. This is the 9-section template behind every Claude system prompt that survives production.

Why Most Claude System Prompts Fail

Most prompts are personality documents, not operating contracts. The writer dumps every desired behavior into one wall of text and hopes Claude figures out priority. Three days later, output drifts, edge cases leak, and another paragraph gets added. The prompt grows. Quality stays flat.

The prompts that ship read like contracts. They name the role explicitly. Behaviors get stated as commands rather than suggestions. The failure modes the model should avoid get called out by name, not implied.

Claude processes system prompt content with positional weighting - content near the top has stronger influence on base behavior. Buried hard constraints get de-weighted. Ordering matters more than volume. A 3,000-word prompt can underperform a 400-word one structured correctly.

Three Numbers That Explain the Structure

23% fewer out-of-schema responses when sections are ordered role to constraints to format to examples to behavior, vs. any other ordering. Tested across 200 requests.
30% response quality improvement when large data payloads are placed at the top of the prompt, above instructions and queries. Confirmed by Anthropic's docs for 20k+ token inputs.
30-50% hallucination reduction in structured outputs when uncertainty instructions are explicit - telling Claude to output null rather than guess. A production contract review agent measured a 40% drop from this addition alone.

The 9-Section Template

Each section earns its place. Skip any of them and the prompt eventually drifts.

1. The role anchor

One sentence. Who Claude is in this context. Include one concrete piece of expertise that narrows behavior. "You are a helpful assistant" implies neither opinion nor method. "You are a senior copy editor who has spent twenty years cutting marketing prose down to its load-bearing words" implies both. The role anchor is the single most consequential line in the prompt.

2. The context bridge

What Claude reads before doing anything else - files, memory, prior outputs. Tell Claude explicitly: "Read these first. If any are missing, stop." Without this section, Claude generates output without the inputs it needed - silently, every time.

3. The operating principles

Three to seven non-negotiable behaviors, phrased as commands. The test: can you write a single test case that passes or fails? "Push back on vague answers. Never accept 'it depends' without specifying what it depends on" passes. "Be helpful and accurate" doesn't. "Be helpful" cannot be checked.

4. The workflow

When the task has phases, state them explicitly. Number them. Define the objective of each. Skip this section only if the task is single-shot: one input, one output, no intermediate steps.

5. The examples

Two concrete demonstrations: one good, one bad. With a one-line "WHY" explaining the difference. Examples are the most underused section in production prompts. Without them, Claude responds in abstract to abstract descriptions. Four to six lines of literal demonstration is the fix.

6. The constraints

Words you never use. Patterns you avoid. Behaviors that disqualify the output. This is where you encode negative knowledge - failure modes you have already seen. A prompt without constraints drifts toward the statistical average of training data, which is rarely what production needs.

7. The output format

The exact structure of the response. For structured output: concrete schemas with examples, not "respond in JSON." For human-facing output: "Three sections, sentence-case headings, no preamble" beats "Write three paragraphs."

8. The quality gates

Self-checks Claude runs before delivering. Three to five checkable conditions. "Before responding, verify: every claim has a concrete source, no banned words appear, the output matches the format spec." Without this section, you catch errors after Claude responds. With it, Claude catches them first.

9. The critical reminders

The three to five things that matter most, repeated. Claude weights repeated instructions higher. A reminder section is not filler - it is a deliberate weight on the things that fail most often. Pick your worst failure modes and repeat them at the end.

Five Shapes of the Same Pattern

The template scales to every kind of job. The pattern is stable. The emphasis shifts.

Research analyst: Heavy workflow (5 phases) with quality gates between each. Without gates, the model blows through phases and produces a generic summary.
Creative director: Constraints section fights the model's worst instinct - vague taste language. The constraint has to be more specific than the failure mode it prevents.
Reply generator: Short prompt, dominated by constraints. Style mimicry needs a banned-word list and a "read it aloud - if it sounds like a press release, rewrite" quality gate. Not a workflow.
Text humanizer: Workflow section does the heaviest lifting. Seven explicit detection passes mean the model cannot skip steps. Without numbered passes, it does three and calls it done.
Content creator: Format-selection step before drafting. Binary choice between article and thread prevents hybrid output that fits neither format.

Note: Claude Opus 4.7 interprets instructions more literally than Opus 4.6. "CRITICAL: You MUST use this tool" causes overtriggering. "Use this tool when..." works.

The Lint Checklist

The role anchor names a specific stance, not a generic helper.
Every operating principle is testable - you can write a single test case that passes or fails.
There is at least one bad/good example pair with a "WHY" line.
The constraints section names specific words, patterns, or behaviors - not abstractions.
The output format would be parseable even if the reader couldn't see your prompt.
The quality gates are written as checkable conditions, not aspirations.
The reminders repeat failure modes you have actually seen, not generic restatements.
Every section earns its place. If you can delete a section and the prompt still ships the right output, delete it.

One Change at a Time

Pull up your most-used Claude prompt. Run it against the 8-point lint above. Find the section that fails worst, rewrite it, and ship the change.

Don't rewrite the whole prompt at once. The April 2026 incident proved that adding any instruction has downstream effects - and that applies to your prompts too. One section at a time. Measure after each change. Your evaluation needs sessions that last at least 30 minutes of real work. Anything shorter gives false confidence.

Sources: Anthropic Engineering - April 23 Postmortem, Claude Prompting Best Practices, Unprompted Mind, How Claude Code Builds a System Prompt.