Harness Engineering Clearly Explained

... PLUS: Agentic Search Model from Glean

In today’s newsletter:

  • Glean Waldo: The Agentic Search Model

  • Harness Engineering Clearly Explained

Reading time: 5 minutes.

Most agentic tasks start the same way. The system searches internal docs, reads results, refines the query, and repeats until enough context is gathered to answer.

Today, frontier models handle this entire loop. They decide what to search, when to stop, and then generate the final answer from the retrieved context.

Search and reasoning are executed inside the same model. But these are two different jobs:

  • Search planning is deciding what to query and when enough information has been collected.

  • Synthesis is reasoning over retrieved context to produce an answer.

Waldo removes search planning from the LLM entirely. It runs before the LLM, handling query decisions and stopping criteria. The LLM only receives retrieved context and focuses on synthesis.

What Waldo Is

Waldo is a 30B MoE model built on Nvidia Nemotron 3 Nano. It handles only the search planning layer. It runs first, before the frontier model, decides which queries to run across Glean Search, Employee Search, and Web Search, determines when enough context has been gathered, then hands off to the frontier model with retrieved context already in place.

How It Was Trained

Phase 1 (DPO): Waldo learned when to search, when to stop, and when to hand off from production tool-use patterns. Training data captured which tools were called, in what sequence, and whether the plan succeeded.

Phase 2 (RL): The model was trained against production queries and rewarded based on document recall: whether its searches surfaced the same documents that appeared in successful final answers. This pushed the model to find relevant documents in fewer search iterations.

The Results

10x faster per call: 250ms versus 3s. Half of all queries now run on this fast path. Across Glean's production workload, that translates to roughly 50% lower latency and 25% fewer tokens compared to routing everything through a frontier model.

The Pattern Worth Paying Attention To

Waldo is a concrete example of a design principle that is becoming more common in production agent systems: specialized small models for focused, repetitive tasks. Frontier models for reasoning and synthesis.

Search planning is pattern matching at scale. It does not need GPT-5-level intelligence. It needs a model trained specifically on the structure of good search sequences, running fast and cheap, so the frontier model can do what it is actually good at.

The question for every multi-step `agent you are building is the same: which parts of this workflow are pattern matching, and which parts actually need deep reasoning? The answer usually justifies a split.

Harness Engineering Clearly Explained

Developers building AI agents often face a similar problem. The demo works, but breaks a few steps in. The model loses context, tool calls fail silently, and changing models or prompts doesn’t fix it.

That's because the model isn't the problem. Everything around the model is. All of that surrounding infrastructure is what’s now called the agent harness.

From Prompts to Context to Harnesses

The way we work with AI has shifted through three distinct phases, each solving the bottleneck of its time.

Prompt engineering (2022-2024) was about crafting better instructions. The model's capability was the bottleneck, so teams studied how to unlock more from it through better prompting.

Context engineering (2025) became necessary when models got stronger but information management became the problem. Andrej Karpathy helped popularise the term. Teams focused on RAG, memory systems, and context window management.

Harness engineering (2026) came about when models became good enough to be useful but not reliable enough to trust. This involves not just the instructions or the context given to a model, but the environment, tools, constraints, and feedback loops around the model.

Each layer contains the one before it. You still need good prompts and well-managed context. But when agents started doing real work (writing code, making tool calls, and operating across tasks spanning hundreds of steps), the infrastructure around the model became the defining factor between demos that impress and systems that ship.

Understanding the Agent Harness

In simple terms: an agent = the model + everything that runs it (the harness).

This means the harness is every piece of code, configuration, and execution logic that isn't the LLM itself. The orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. All of it.

When someone says "I built an agent," what they actually built is a harness pointed at a model.

How a Harness Actually Runs

A harness is not just a set of components. It is a continuous loop that coordinates everything the model does.

At the core is the Think-Act-Observe (TAO) cycle, sometimes called the ReAct loop:

Think: The harness assembles the current context and sends it to the model. The model reasons about what to do next based on the goal, available tools, and previous observations.

Act: The model outputs an action, a tool call, a code edit, or a shell command. The harness parses this output, validates it against constraints, and executes it in a controlled environment.

Observe: The harness captures the result, success, failure, error message, output data, and feeds it back to the model as an observation. This becomes part of the context for the next cycle.

The loop repeats until the task is complete or a stop condition is reached.

What makes this loop reliable:

Verification gates run between steps. Before accepting an action, the harness can run linters, type checkers, or tests. If they fail, the error is fed back to the model for correction instead of letting bad outputs propagate.

Bounded execution prevents runaway loops. The harness sets limits on how many steps an agent can take, how long it can run, or how many times it can retry a failed action.

Error recovery treats failures as information. When a tool call fails, the harness doesn't hide it. The error message becomes an observation, and the model adjusts its next action based on what went wrong.

Context pruning keeps the loop efficient. As the agent works through dozens or hundreds of steps, the context window fills up. The harness decides what to keep (reasoning traces, critical errors), what to summarise (long outputs), and what to discard.

This loop is the core of every agent system. If the loop is well-designed, the agent can recover from mistakes, self-correct through verification, and maintain coherence across long tasks. If any part is weak, the whole system degrades even if the model is strong.

Your Agent Is Only as Good as Its Harness

When building an agent, think in terms of the harness first, not the model. The key questions the harness answers are:

  • How does context flow through the system?

  • How are tool results validated and fed back?

  • How are failures handled during execution?

  • What constraints prevent the agent from repeating known mistakes?

  • What verification gates ensure output quality?

These decisions define how the agent behaves in practice. The model is the engine. The harness is everything that makes it usable in the real world.

Enterprise agent failures trace back to harness defects. Specifically, context drift, schema misalignment, and state degradation.

LangChain proved this with Terminal Bench 2.0. Their coding agent jumped from 52.8% to 66.5%, moving from rank 30 to rank 5, by changing only the harness. Same model. Different harness. Dramatically better results.

Optimizing the model without stabilizing the harness yields diminishing returns.

Build the harness.

That’s all for today. Thank you for reading today’s edition. See you in the next issue with more AI Engineering insights.

PS: We curate this AI Engineering content for free, and your support means everything. If you find value in what you read, consider sharing it with a friend or two.

Your feedback is valuable: If there’s a topic you’re stuck on or curious about, reply to this email. We’re building this for you, and your feedback helps shape what we send.

WORK WITH US

Looking to promote your company, product, or service to 200K+ AI developers? Get in touch today by replying to this email.