- AI Engineering
- Posts
- Code Is the New Agent Harness
Code Is the New Agent Harness
... PLUS: Trace, Debug, and Fix Your AI Agents
In today’s newsletter:
Trace, Debug, and Fix Your AI Agents
Code Is the New Agent Harness
Reading time: 5 minutes.
When an AI agent breaks in production, the usual workflow is: check the observability platform to see what went wrong, switch to your IDE to fix it, re-run, and discover the fix broke something else.
Ollie, a coding agent built into Opik, closes that entire loop in one place. It reads your traces, analyzes your codebase, writes the fix, and reruns the agent to verify it worked.
What It Does
Full trace analysis: reads inputs, outputs, latencies, and token counts across the entire span tree to understand what the agent actually did
Code access: connects to source files via
opik connect, proposes edits for review before applying anythingLive verification: reruns the agent using inputs from the failing trace to confirm the fix works
Regression tests: converts traces into test cases with assertions so the same bug can't reappear silently
A Real Example
An evaluation agent was testing a RAG pipeline by comparing generated answers against ground truth. The problem: tests kept passing when answers were clearly wrong.
The test asked "How many engineers are on the platform team?" Ground truth was "23 engineers." The RAG returned "45 engineers." The eval agent marked it correct.
The trace showed why. The agent was using semantic similarity to score answers. It gave a 0.92 similarity score because both sentences mentioned engineers with a similar structure. Factually wrong, semantically close.
After connecting Ollie to the codebase with opik connect, it read the trace, identified the root cause, and proposed a fix: replace semantic similarity with value extraction and direct comparison.
Fix approved. Test rerun. The agent correctly failed because 45 does not equal 23.
Most research focuses on what agents can do and overlooks what agents should be built on.
Code as Agent Harness is a 100+ page survey that makes one central argument. Code is no longer just what agents produce. It's the infrastructure they run on.
The Shift
For the past few years, code was the output. You prompted an agent, it wrote a script, you ran it yourself.
That's changing. In agentic systems today, code is the harness. It's how agents plan, how they interact with tools, how they model their environment, and how they verify their own work.
The paper calls this "code as agent harness" and treats it as a unified lens for understanding how agent infrastructure actually works.
The Three Layers
The survey organizes everything around three layers of the harness.
The interface layer. How code connects the agent to the outside world. Reasoning, action execution, environment modeling. This is the edge where the agent touches real systems.
The mechanism layer. How code supports the agent's inner workings. Planning across long tasks, memory that persists between steps, tool use, and feedback loops that let the agent detect and correct its own mistakes.
The coordination layer. How code scales from one agent to many. When multiple agents work together, shared code artifacts become the common ground. One agent writes a function, another reviews it, a third runs tests. Code is what makes that coordination legible.
Why Code Specifically
The paper makes a case for code over other approaches (plain language instructions, YAML configs, tool definitions) by pointing to four properties that code has and others don't.
Executable. You can run code and get a verifiable result. A prompt instruction can be followed loosely. Code either works or it doesn't.
Inspectable. You can read code and understand what an agent did and why. Black-box behavior is harder to audit and harder to fix.
Stateful. Code can hold and update state across steps. An agent that writes to a variable can pick up exactly where it left off.
Governed. Code can be wrapped in access controls, permissions, and constraints. You can define what an agent is allowed to touch and enforce it programmatically.
The paper argues that future agent systems need all four. And code is currently the best substrate that provides them together.
The Open Problems
The survey is honest about what's unsolved. Evaluating agents beyond whether they completed the final task. Verifying work when feedback is incomplete or delayed.
Updating a harness without breaking what already works. Keeping shared state consistent when multiple agents are writing to it. Maintaining human oversight for actions that can't be undone.
The paper covers applications across coding assistants, GUI and OS automation, scientific discovery, DevOps, and enterprise workflows.
That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.
PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.
Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.
WORK WITH US
Looking to promote your company, product, or service to 200K+ AI developers? Get in touch today by replying to this email.

