Cheaper per token. Dearer per task.

Claude Sonnet 5 shipped on 30 June at $2 per million input tokens and $10 per million output — cheaper per token than the Sonnet before it, and less than half the rate-card price of Opus 4.8. Every instinct that has governed model upgrades for the last couple of generations said the same thing: swap the key. Within a day, independent measurement inverted the win. On Artificial Analysis's Intelligence Index, Sonnet 5 came in at roughly $2.29 per completed task — about twice the prior Sonnet, and around 15% more than Opus 4.8, the model it undercuts on the rate card.

A per-token price cut is no longer a price cut. When a model is trained to act more, the only honest unit is cost-per-completed-job — and no leaderboard measures yours.

What "cheaper" looks like from the inside

We already had the lesson, months before this launch, in our own platform.

On that platform there is no credit balance. "Credits" is a per-user monthly spend cap, computed by summing message cost. When a user crosses it, the system silently rewrites their chosen model to a cheaper fallback — no banner, no note, nothing in the UI. The fallback tier returns empty completions roughly a fifth of the time on longer, non-English prompts. To the user, the assistant just goes quiet.

The team burned four debugging turns chasing a "routing bug" before anyone realised the cap had fired. The cost control was the bug report. Internal staff dogfooding the deep agent on very large prompts exhausted a month's cap in under two weeks, which is how the failure surfaced at all.

A cheaper model isn't cheaper if it fails a fifth of the time and the failure looks like a product bug.

Nobody had switched models to save money. A budget rail had switched for them, on the rate card's logic — pick the lower price per token — and the job it produced was worse in a way no price sheet showed. That is the whole argument in miniature. "Cheaper" measured on the token counts against a completed task that quietly failed.

Why intelligence now costs more turns

The Sonnet 5 inversion is the same shape at frontier scale, and it is structural rather than a fluke.

The model is reinforcement-learned for agentic work. On knowledge-work evals it spends around 40% more output tokens and roughly three times the tool-calling turns per task than the Sonnet before it. "More intelligence" now means, quite literally, more turns and more thinking. The price of a token fell; the number of tokens a job consumes rose faster.

Turns are where it compounds. Prompt caching only caches the static system prefix, not the conversation, and the between-turn cache has a short TTL — so a long, tool-heavy exchange pays full freight on a growing context every single turn. A chattier agent bills you twice: once for the tokens it emits per turn, and again for the context it drags forward into the next one. We learned that the hard way when a bulk enrichment over a few hundred records blew a main agent's context window. The rule that fell out — pass a reference and a bounded summary back, never the full payload — exists precisely because more turns and un-cached context pull cost in the wrong direction together.

Put the two ledgers side by side and the launch stops looking like a discount.

rate-card-vs-receipt.svg

Fig. 1Rate card versus receipt. Sonnet 5's per-token price fell — $2/$10 per M, less than half Opus 4.8 — yet the cost to finish a job rose to about $2.29, roughly 2x the prior Sonnet and about 15% above Opus 4.8. The cause is structural: RL'd for agentic work, it spends around 40% more output tokens and about 3x the tool-calling turns per task. Source: Artificial Analysis.

The blind key-swap is dead

The old rule was simple, and for years it was correct. A new model lands at the same price, benchmarks clearly higher, so you swap the key and move on. Must-change. Free.

That rule assumed price per token and cost per task were the same number. They are not any more. On an agentic workload, a blind swap to a model that thinks and acts more per job can quietly multiply the bill instead of trimming it. The swap is no longer free; it is a budgeting decision you cannot make from a rate card.

This is not a reaction to a launch. We already route simple background-agent tasks to a cheaper tier, on the plain principle that not everything needs the top model. Matching model to task difficulty — cheap for a quick question, expensive for a multi-system job — roughly halves blended cost, and it is the single highest-impact lever we have on margin. A newer model that makes more tool calls is a routing problem, not a pricing win.

If you manage AI cost across an org and you are staring at a version bump: an intelligence increase does not make the tasks you already run any better, and it does not make a task you couldn't solve suddenly solvable. If the new model genuinely does something the old one could not, switch — that is the entire point of a better model. For the day-to-day work you are already completing, a blind org-wide bump just buys you the same output at a higher price.

Which benchmark do you even trust for your agents?

So you want to check before you switch. Against what?

Coding is well served. SWE-bench, SWE-bench Pro and Terminal-Bench are mature and widely trusted, and they genuinely track what a coding agent is good for. The day-to-day white-collar work that most teams actually point agents at all day — research-to-email chains, document lookups, back-office reasoning across a few systems — has nothing settled. GDPval and AA-Briefcase are early attempts, not answers, and should be read as nascent.

The leaderboards that do exist disagree by domain, which is the tell. Anthropic positions Sonnet 5 as close to Opus 4.8, in some cases matching. Independent benchmarks put Opus ahead on the hard problems and Sonnet 5 ahead on knowledge work — the split laid out below. "Better" is not one number. It is better at what, measured by whom, on which task.

Benchmark (higher is better)	Sonnet 5	Opus 4.8
SWE-bench Pro	63.2	69.2
USAMO	79.5	96.7
OSWorld-Verified	81.2	83.4
GDPval-AA v2 (knowledge work)	1618	1615

The intelligence is real. Sonnet 5 edges Opus 4.8 on knowledge work and trails only on the hardest tasks — which is exactly why a rate-card upgrade feels safe until you read the receipt. Sources: Artificial Analysis, MarkTechPost, llm-stats, the-decoder.

My own view, and I will flag it as opinion: a fair few of the benchmarks the industry leans on to crown a model use measurement methods that are outdated or easy to game. The whole field reaches for a handful of scores to say "this one is better," and rarely finishes the sentence.

A model release is only a fire drill if you hard-wired production to the old one. Stay model-agnostic and a Sonnet 5 launch becomes a routing decision instead of a scramble — you A/B it on your own task mix and cost curve, and re-key nothing on the strength of a leaderboard. The number that decides it is cost-per-completed-job at your real mix of work, and that number lives nowhere on the rate card. What the industry is missing is not a smarter model. It is more benchmarks, and more logs — per job, not per leaderboard — so a team can finally say what each model is actually best for.

Cheaper per token. Dearer per task. Why the blind key-swap era ended the week Sonnet 5 shipped.

What "cheaper" looks like from the inside

Why intelligence now costs more turns

The blind key-swap is dead

Which benchmark do you even trust for your agents?

We optimised our agent harness for months. It still hid 90% overhead. The waste wasn't the story. That a tuned harness hid it — and only measurement caught it — was.