We optimised our agent harness for months. It still hid 90% overhead.

We measured 949 agent turns from our own harness and found that about 90% of every turn's input was overhead — system prompt, tool schemas, injected preload. The task itself, the thing the user and the assistant actually exchanged, was the other 10%. Of roughly 30 tool schemas loaded into each turn, about 2.6 were ever called.

Here is the honest version: this is a minor problem. We had already been optimising this harness across its layers for months — retrieval, caching, routing, the prompts themselves. The overhead is not a crisis, and it is not really the headline. It is the thing that slipped through everything else we were tuning, because nothing was watching it over time. Nobody chose it. It accumulated underneath the work.

The story is not the overhead. It is that a harness we had optimised hard still hid it — and only measurement surfaced it.

Build versus buy is a real fork

Most teams shipping agents in 2026 reach for an off-the-shelf harness — the editor-grade coding agents, the hosted frameworks. That is the right call for most of them. There is serious funding and serious engineering behind those tools, and you inherit it for free.

Building your own is the minority choice. It earns exactly one thing that matters: you can personalise every part of the loop. The retrieval step, the tool surface, the memory, the way the agent is told what it is for — all of it bends to the use-case instead of the use-case bending to the harness. For some products that fit is the whole point.

The part nobody mentions is what comes with it. Build your own and you also make the harness mistakes yourself. The biggest one is the quietest: the bought tools have objective measurement baked in, and the home-grown version usually ships without it. The flaws are not loud. They are invisible until you look.

The data exists. It just isn't watched.

The frustrating thing is that the data to catch these flaws is already there in principle. Tracing is connected. Tool results, system prompts, caching behaviour, token composition per turn — it is all flowing through somewhere. An inspector or a tracing project has it.

But "it is there" and "you can act on it" are different claims. The data is rarely retrievable in a shape you can analyse, and the interesting failures only surface when you watch specific users and specific use-cases over a long enough window. A single turn looks fine. A single user over ten days starts to show a pattern. The overhead in our harness was not a bug anyone introduced. It was the sum of reasonable decisions — load the tools the agent might need, inject the preload that keeps it grounded — none of which were ever weighed against each other on real traffic.

The overhead wasn't a decision anyone made. It accumulated. You don't see it until you watch the same use-case over time — which is exactly the discipline the off-the-shelf tools gave you for free.

What we pulled, and what it said

So we ran the exercise on ourselves. Per-turn context snapshots were pulled read-only from local and staging over a roughly ten-day window — aggregates only, tool names and token counts, no message content. The sample was 949 turns across 694 traces, touching 59 distinct tools. A single org. A current-state measurement of our own harness looking at itself, not a controlled benchmark.

The shape was stark. Each turn carried about 30 tool schemas and called about 2.6. Seventeen percent of turns — 157 of them — called zero tools and still shipped the full ~30 schemas. The mean per-turn budget broke down like this: task content 2,829 tokens, system prompt 6,520, tool schemas 6,496, injected preload 6,572 — 22,417 in total. The system prompt, the schemas, and the preload each weighed more than twice what the actual conversation did.

Then the ablation: strip the tool schemas a turn never invokes, keep every tool it does. Tool-schema tokens fell from 6,496 to 675 per turn. Total input dropped from 22,417 to 16,597 — about 5,800 tokens saved on every single turn. The safety gate held: no strategy that dropped a tool a turn had actually used was allowed to count, so no capability was removed.

token-composition.svg

Per-turn token composition, full harness versus stripped. Same x-scale; task, system prompt, and preload are untouched — the tool-schema segment collapses from 6,496 to 675 tokens, reclaiming about 5,800 per turn. Means across 949 turns; tiktoken cl100k_base.

The honest limits

A field report that only quotes the headline is a pitch, so here is where it gives ground. The obvious lever — send the heavy preload once and reference it after — barely moved anything, because the system prompt and the injected memory dominate the budget regardless of how the tools are handled. The overhead ratio shifted only about 3 percentage points. The absolute token saving is the real story; the ratio is not. And that is the real point here: against everything already tuned on this harness, this was a small slip — which is exactly why it is worth noticing. The expensive failures are rarely the loud ones.

overhead-by-strategy.svg

Overhead ratio by strategy (overhead_bps, lower is better). The y-axis starts at 8,550 so the gap is visible at all — every strategy lands within ~3% of the baseline. Stripping unused schemas (8,704) barely beats the full harness (8,995), and sending the preload once sits exactly on the baseline. The ratio is flat; the absolute-token saving above is where the win is.

And the measurement is exactly that — a measurement. One org, ten days, our own harness. There is no model re-run proving answer quality held after the context got smaller; the safety gate only guarantees that every tool a turn invoked was still present, not that the agent behaves identically with less around it. The schema weights are reconstructed from recorded per-tool counts, so the absolutes are approximate even if the ranking between strategies is sound. None of this is shipped. It is a current-state photograph, not a benchmark and not a fix.

The data was always there. It just wasn't retrievable, and nobody was watching it over time.

The thing worth keeping is not the 5,800 tokens — that saving is small. It is the loop that found them. A human-owned question, a frozen evaluation, one agent-editable experiment file, a single metric, keep-or-discard on git — the same small research harness that surfaced this could run continuously, against new traffic, against the next use-case that drifts.

self-improving-loop.svg

The loop that found the saving, not the saving itself: a human-owned question, a frozen evaluation, one agent-editable experiment, one metric, keep-or-discard on git — then run again on the next slice of traffic.

Build your own harness and the personalisation is the easy part to want. The harder, less glamorous part is the discipline the bought tools handed you without asking: pull your own data, run the analysis, feed it back. A harness that keeps measuring itself is the only kind that improves on its own.

We optimised our agent harness for months. It still hid 90% overhead. The waste wasn't the story. That a tuned harness hid it — and only measurement caught it — was.

Build versus buy is a real fork

The data exists. It just isn't watched.

What we pulled, and what it said

The honest limits

We're in the fifth era of AI at work. Copy-paste, in-editor agents, CLI, planning, loops — and what an agentic company actually requires.