Harness Engineering: The Skill That Separates AI-Native Devs

Sun, 12 Apr 2026 00:00:00 +0000

Harness Engineering: The Skill That Separates AI-Native Devs

Harness engineering is the discipline of building everything around an LLM to make it a reliable production system. Not prompt engineering. Not picking the right model. The real skill is designing the tools, memory, guardrails, and orchestration that turn raw model intelligence into consistent, useful output. If you’ve been wondering why some developers ship 10x with AI while others struggle with the same models, this is the answer.

The equation is simple: Agent = Model + Harness. The model provides intelligence. The harness provides direction. And in 2026, the harness is where all the engineering value lives.

I’ve been building this way for months without having a name for it. My Obsidian vault, my CLAUDE.md files, my custom CLI tools, my skills workflows. All of it maps directly to what Martin Fowler, OpenAI, and Anthropic are now formalizing as harness engineering. Here’s what the discipline actually looks like in practice.

What Is a Harness, Exactly?

A harness is every piece of code, configuration, and infrastructure that is not the model itself. It’s the interface between the LLM and the real world. It manages:

How context is loaded (what the model sees)
Which tools are available (what the model can do)
How failures are handled (what happens when things break)
How state persists (what the model remembers across sessions)

Think of it like this: the model is a powerful engine. The harness is the chassis, steering, transmission, and brakes that make it a usable vehicle. Without the harness, you just have raw power with no direction.

The Six Core Principles

After synthesizing research from Martin Fowler’s Thoughtworks team, OpenAI’s Codex group, Anthropic’s engineering blog, and practitioners like Dex Horthy, six principles keep showing up across every source.

1. Build Context into the Environment

Stop cramming documents into chat windows. Instead, build a structured environment (a vault, a docs directory, a well-organized repo) that the AI can search and read as needed.

This is progressive disclosure. Give the model a map, not a 1,000-page manual. A short CLAUDE.md or AGENTS.md that acts as a table of contents pointing to deeper documentation. The model pulls what it needs, when it needs it.

Because everything here is new, there’s no perfect recipe. You test, evaluate against your own use case, and iterate. Harness engineering is exactly that: shaping the environment and refining it until it’s resilient enough for agentic code.

For example, you can use hooks to preload context or make the LLM aware of its environment before it acts.

brain/
├── CLAUDE.md           ← entry point, the "map"
├── docs/
│   ├── architecture.md
│   ├── conventions.md
│   └── decisions/
├── .claude/
│   └── skills/         ← automated workflows
└── content/
    └── ...

2. The Filesystem Is King

High-performance harnesses use plain markdown files and git history as the primary state mechanism. Not vector databases. Not complex RAG pipelines. Markdown files that are human-readable, version-controlled, and cheap to maintain.

This sounds almost too simple, but it works. Git gives you version history. Markdown gives you portability. The filesystem gives you a search interface the model already understands.

3. Verification Multiplies Quality

Giving a model a way to verify its own work (linters, tests, a dedicated evaluator agent) can improve output quality by 2-3x. This is one of the most underappreciated principles.

Anthropic’s approach uses a Generator-Evaluator architecture inspired by GANs. One agent produces the work. A separate, skeptical agent grades it against concrete criteria. The key insight: models are inherently poor at evaluating their own output. They’ll praise mediocre work if you ask them to self-review.

4. Feedforward and Feedback Controls

Martin Fowler’s team frames harness components as two types of controls:

Guides (feedforward): Steer behavior before the model acts. Your CLAUDE.md, your coding standards, your project conventions. These are the guardrails.
Sensors (feedback): Observe results after the model acts. Linters, test suites, type checkers. These let the model self-correct before a human reviews.

The combination is powerful. Guides prevent errors. Sensors catch what slips through.

5. Agent Legibility Over Human Aesthetics

OpenAI is pushing hard on this idea: optimize your codebase for agent reasoning, not just human stylistic preferences. This means favoring predictable structures, “boring” technologies, and explicit boundaries that the AI can easily navigate.

Practically, this looks like:

Consistent file naming conventions
Clear module boundaries with documented interfaces
Architectural decisions recorded in markdown, not in someone’s head
Error messages that include remediation instructions (so the model can fix what it breaks)

6. ReAct Loops as the Execution Model

Harnesses use a Reasoning and Acting (ReAct) loop: observe state, reason about the next step, take an action via a tool, observe the result. This is the fundamental execution pattern behind tools like Claude Code, Cursor, and every serious coding agent.

The loop is non-deterministic. Unlike traditional orchestration with rigid DAGs, agent loops evolve based on the model’s reasoning. This is a fundamental shift from traditional software engineering.

The Frameworks: RPI and QRSPI

Two methodologies have emerged for structuring how agents work within a harness.

RPI (Research, Plan, Implement)

RPI keeps context windows small and focused by splitting work into three phases:

Research: Open a fresh context window. Scan the codebase objectively to understand the system. No preconceptions.
Plan: Outline exact steps with file names, line snippets, and testing procedures. Build a vertical plan (mock API, then UI, then real database) instead of a horizontal one.
Implement: Execute the plan in a clean context window to avoid “context anxiety,” the degradation that happens when a model’s context fills past 40-60% capacity.

The key insight is the separation between phases. Each runs in fresh context so the model stays in what practitioners call the “Smart Zone,” the performance sweet spot below 40% context usage.

QRSPI (“Crispy”)

QRSPI is an evolution of RPI that adds more structure for complex features:

Questions, Research, Design, Structure, Plan, Worktree, Implement, PR

The critical addition is the Design and Structure phase. Before the agent writes thousands of lines of code, you align on a ~200-line markdown artifact. This is the human checkpoint. You review the design, not the implementation.

QRSPI also enforces an instruction budget: each phase stays under 40 instructions. This comes from the finding that models reliably follow about 150-200 instructions total. Monolithic prompts with 85+ instructions lead to skipped steps and inconsistent output.

Where the Industry Disagrees

The most interesting part of this research was finding where leading teams disagree. These aren’t settled questions.

Topic	Position A	Position B
Should you read the code?	OpenAI: Steer, don’t code. Corrections are cheap, human review is expensive.	Dex Horthy: Tried not reading code for 6 months. “Did not end well.”
Context resets vs compaction	Anthropic: Full resets with clean handoffs produce better results.	Others: Intentional compaction to markdown preserves valuable context.
Instruction budgets	Horthy: Keep prompts small (~150-200 instructions max).	OpenAI/Anthropic: Use mechanical enforcement (linters, evaluators) instead.
Merge gates	OpenAI: Minimal blocking, agent-to-agent reviews.	Horthy: Humans must own and review the code.

My take: you need to read the code, especially at this stage. The models are good, not perfect. Skipping review is a shortcut that compounds into tech debt you won’t understand because you didn’t write it.

What This Means for Your Career in 2026

The role of the software engineer is moving one level up. Your value isn’t in writing implementation code. It’s in:

Designing feedback loops that catch errors before they ship
Building context systems that make agents more effective over time
Specifying intent clearly enough that agents can execute reliably
Reviewing and owning the output, because your name is still on it

OpenAI reported a small team shipping 1 million lines of code with 0 manually-written lines at 3.5 PRs per engineer per day. That’s not a future prediction. That’s happening now.

The developers who learn harness engineering will ride that wave. The ones who keep pasting prompts into chat windows will wonder why their output stays flat.

Where This Goes Next: Harness as Code (HaC)

Here’s where I think this goes. If the harness is the new codebase, then the next step is codifying harnesses themselves. Developers won’t just write application code, they’ll write harness templates that spin up agent-ready environments: context maps, skills, hooks, evaluators, and guardrails all defined as code.

Call it HaC (Harness as Code). The same way Terraform let teams scale infrastructure, HaC will let teams scale agent environments. That’s where I think the real leverage shows up.

How to Start Today

You don’t need to overhaul your workflow overnight. Start with these five steps:

Create a CLAUDE.md or AGENTS.md in your project root. Define your agent’s role, coding standards, and project context. This is Layer 1 of the harness.
Structure your documentation in markdown files within the repo. Architectural decisions, conventions, and design patterns. If the agent can’t find it, it doesn’t exist.
Set up an agentic runtime. Claude Code, Cursor, or similar. The specific tool matters less than having one.
Apply RPI to your next feature. Research in one context, plan in another, implement in a third. Notice the quality difference.
Automate one repetitive workflow. Turn a recurring process into a slash-command skill. This is where the compound interest starts.

The harness is the new codebase. Start building yours.

FAQ

What’s the difference between harness engineering and prompt engineering?

Prompt engineering focuses on crafting individual messages to get better responses. Harness engineering is the broader discipline of building the entire infrastructure around the model: tools, memory, verification loops, context management, and orchestration. Prompts are one small piece of the harness.

Do I need to know AI/ML to do harness engineering?

No. Harness engineering is software engineering applied to AI systems. You need to understand context management, tool orchestration, and system design. The model handles the ML part. You handle everything else.

Which framework should I start with, RPI or QRSPI?

Start with RPI. It’s simpler and teaches the core principle of separating research, planning, and implementation into distinct phases. Move to QRSPI when you’re building features complex enough to need the design/structure alignment step.

What is Harness as Code (HaC)?

HaC is the idea that harnesses themselves will be codified and shared, the same way infrastructure became Infrastructure as Code. Instead of hand-rolling a harness per project, developers will write harness templates that spin up agent-ready environments: context maps, skills, hooks, evaluators, and guardrails defined as code. It’s an early concept, but I think it’s where the real scale comes from.

Is harness engineering only for coding agents?

No. The principles apply to any AI system: writing assistants, research tools, customer support agents, automation pipelines. Anywhere you have Agent = Model + Harness, the discipline applies.

Claude-Code on Isac Builds

Harness Engineering: The Skill That Separates AI-Native Devs

Harness Engineering: The Skill That Separates AI-Native Devs

What Is a Harness, Exactly?

The Six Core Principles

1. Build Context into the Environment

2. The Filesystem Is King

3. Verification Multiplies Quality

4. Feedforward and Feedback Controls

5. Agent Legibility Over Human Aesthetics

6. ReAct Loops as the Execution Model

The Frameworks: RPI and QRSPI

RPI (Research, Plan, Implement)

QRSPI (“Crispy”)

Where the Industry Disagrees

What This Means for Your Career in 2026

Where This Goes Next: Harness as Code (HaC)

How to Start Today

FAQ

What’s the difference between harness engineering and prompt engineering?

Do I need to know AI/ML to do harness engineering?

Which framework should I start with, RPI or QRSPI?

What is Harness as Code (HaC)?

Is harness engineering only for coding agents?