An 8B model matched frontier quality at 462× lower cost.

Picture the bill for a single reply to an email. A model with a million-token context window, reasoning dialled to its highest setting, the full frontier price per token — spun up to produce three polite sentences and a sign-off. The work it can do is enormous. The work the task needs is small. The gap between the two is the part you pay for and never use.

That gap has a name now, or at least a number. The frontier has overshot the everyday job.

The build question is no longer "which model is smartest." It is "which captured workflow can a small, cheap, specialised model now run at near-frontier quality" — and the published answer is most of them.

The overshoot

Frontier capability keeps climbing. Most white-collar tasks do not. Answering a support ticket, booking a trip, walking a customer through a claim — these are bounded procedures with a fixed shape. They do not get harder when the model gets smarter. They just get more expensive to run on the smart model.

Why pay for a million-token window and extra-high reasoning to write a reply to an email? You just need the email answered.

The era of tracking tokens and cost per agent invocation has stopped being an afterthought. When the model can do far more than the task requires, the interesting engineering moves from capability to cost per unit of work done. And once you start measuring that, a ladder appears — three rungs, each cheaper than the last, and they stack.

Three rungs

Rung one — route. Send the hard tasks to a big model and the easy ones to a small one. This is the most familiar move, and it works: a dynamic router classifies the incoming task and picks the cheapest model that can clear the bar. It trims the price per token. It is also the rung most teams stop at.

Rung two — trim the context. The cheapest token is the one you never send. Most of an agent's per-turn bill is overhead nobody chose on purpose: tool schemas for tools the agent never calls, a fixed preload re-injected on every message, instructions stacked turn over turn. Routing trims the price per token; context discipline trims the token count. They are different savings and they multiply.

Rung three — compile. For a procedure that is fixed, you can do something stranger. Bake the entire workflow into a small model's weights and throw the external orchestrator away at runtime. No router deciding what to inject this turn. No prompt scaffolding rebuilt on every message. The model has learned the procedure and runs it directly.

The first two rungs are operational discipline. The third is the one a paper just put hard numbers against.

phase-rail.svg

Fig. 1Three rungs on the cost ladder — route and trim are operational; compile removes the orchestrator at runtime.

What "compile" actually means

In May 2026, a team at the University of Melbourne (Dennis, Patil, Shabahang and Guo) published Compiling Agentic Workflows into LLM Weights (arXiv:2605.22502). The method is direct enough to describe in four steps.

Define the procedure as a flowchart — nodes are conversational turns, edges are the transitions between them.
Generate synthetic conversations by traversing every valid path through the graph.
Full-parameter fine-tune a small model on those conversations.
Deploy with no external orchestrator. The user talks straight to the model, which has learned to self-orchestrate.

They call the result a subterranean agent: the procedure lives in the weights rather than being injected at runtime, turn after turn. The orchestrator does not disappear — it is used to generate the training data. It just isn't in the loop when a real conversation happens.

compile-pipeline.svg

Fig. 2Four steps — the orchestrator builds the training set; the deployed model runs alone.

surface-vs-compiled.svg

Fig. 3Surface orchestration vs the compiled agent. Insurance domain per-conversation cost from the paper's Table 6.

The student models are small — Qwen 2.5 3B for travel booking, Qwen3-8B for support and insurance. The teacher and frontier baseline is Claude Sonnet 4.5, roughly 70 times the parameters of the 3B. The domains span a clean procedure (travel booking: 14 nodes, 86 paths) up to a genuinely tangled one (insurance claims: 55 nodes, 6 decision hubs, 2,381 paths, conversations running 9 to 39 turns). Every condition was evaluated at 200 scenarios.

The numbers that move the question

Quality first, because cost is only interesting if quality holds. On a 1–5 scale measured against the in-context frontier baseline, the 8B model on the 55-node insurance procedure landed at 92–98% of in-context quality. On support, 92% on graceful handling and 97% on naturalness. The headline range across the work is 87–98% of frontier quality.

Then the cost, measured per conversation. Travel booking falls 128× ($0.133 to $0.0010). Support falls 296× ($0.103 to $0.0003). Insurance claims fall 462× ($0.327 to $0.0007).

That reduction decomposes into two effects that stack rather than one magic number.

Lever	Source of the saving	Multiplier
Self-host the small model	~$0.05/$0.23 per M tokens vs $3/$15 frontier API rates	~65×
Drop per-turn orchestration	No re-injected procedure each turn — fewer tokens sent	~7–22×
Combined, per conversation	The two stack across the three domains (Table 6)	128–462×

Two multipliers, stacked. Self-hosting cuts the price per token; dropping the orchestrator cuts the number of tokens.

Self-hosting an 8B model on an A100 runs at roughly $0.05 per million input tokens and $0.23 per million output, against the frontier's published $3 and $15 — about a 65× drop in price per token. On top of that, dropping the per-turn orchestration overhead cuts the token volume by something like 7 to 22×. Multiply the two and you land in the 128–462× range. "Two orders of magnitude" is the conservative framing.

The setup cost is small and one-time: roughly $40 to generate the training data and $10–40 of fine-tuning compute, call it $50–80 per workflow. A recompile cycle runs 30–50 minutes on 8 H200s, or a few hours on a single A100.

None of this came from nowhere. NVIDIA argued the position in mid-2025 — Small Language Models are the Future of Agentic AI (arXiv:2506.02153) made the case that sub-10B models are sufficient, more suitable, and more economical for the many repetitive specialised calls inside an agentic system, and that heterogeneous mixes of model sizes are the natural design. The compilation paper is the mechanism and the measured numbers underneath that position.

Where it loses

A field report that only quotes the headline is a pitch. The honest part is where compilation gives ground, and the paper is candid about it.

The remaining quality gap is not evenly spread. It concentrates in information accuracy — about 87% — and that is exactly the place a compiled model should struggle. Memorising a procedure does nothing for broad world knowledge. A small model that has learned an insurance workflow perfectly is still a small model when a question reaches outside the workflow into open-domain facts. The external orchestrator still leads on information accuracy across all three domains.

Failure rates are mixed rather than a clean sweep. On the complex graphs the compiled model wins big — 5.5% failure on travel versus 24% for the orchestrated baseline, 9% versus 17% on insurance. On the simplest graph it loses slightly: 11% versus 9% on support. Compilation pays off most where the procedure is genuinely complicated.

And the structural cost: every change to the procedure means a recompile. Cheap, but not free, and not instant. This suits stable procedures — the ones that look the same this quarter as last. A workflow that changes weekly is the wrong candidate; you would spend the savings regenerating training data.

There is even an honesty note on the evaluation itself. The frontier model both generated the training data and judged the outputs, so the authors re-ran the scoring with a different judge — where the compiled models still hit 83–99% of in-context quality.

The cost curve has been bending one way for a while: the model layer gets cheaper, the workflow intelligence gets richer. Routing trims the price per token. Context discipline trims the token count. Compilation, for a fixed procedure, removes the orchestrator from the runtime entirely and lands within a few points of frontier quality for a fraction of the cost. Three rungs, and the third is the one that changes the question. The next thing worth building is not a smarter model — it is the captured shape of a white-collar task, fine-tuned into a model small enough that running it stops being something you have to budget for. The question was never which model is smartest. It is which workflow you captured well enough that a small one can run it.

An 8B model matched frontier quality at 462× lower cost. Compiling a fixed workflow into small model weights — and what it costs to stop.