The model on your laptop.

Gemma 4 booted on the workstation behind an OpenAI-shaped local endpoint. Claude Code was reconfigured to point at it. The same dev loop — read a file, propose an edit, call a tool, plan the next move — was running entirely on the laptop. Not in a notebook. Not in a benchmark harness. In the editor that runs the work.

It works. What that means in detail is the field-notes part of this post, and it isn't written yet. The framing it lives inside is.

The runtime question

Once a model runs on the laptop, and a model runs behind a hosted API, and the dev loop holds together against both, the natural framing shifts. It is no longer "which model is better". It is "which loops want which runtime".

That answer doesn't come from a benchmark. It comes from running the dev loop against each option for a real day of work, watching where the cost lands and where the seams show. The framing the answer lives inside, though, is older than open-weights inference. It is what makes the local-model conversation interesting in 2026 rather than the same conversation that has been happening since 2023.

Runtime is an axis, not a brand

A platform's service catalogue eventually grows the same shape every time. Some capabilities run on infra the team owns. Some ship into a customer's cluster, where the team never sees the runtime at all. Some are managed services with a vendor's logo on the invoice. The schema that tries to describe "where does this run" always grows a discriminator, and the resource columns always grow nullable fields where one runtime owns a property the others don't.

That shape is not new. It has shown up in every infra decision the team has had to make — database hosting, secrets storage, queue runtime, identity provider. The same three columns, every time. The same nullable fields. The same uncomfortable conversations when someone tries to flatten the columns into one.

Inference is the same conversation. It just hasn't been written down yet, because the brand on the API has been doing the work the architecture should have been doing.

There are at least three runtimes worth thinking about. They aren't interchangeable.

Discriminator	Local-self	Self-hosted	Vendor-hosted
Who runs the runtime	The engineer, on the workstation	The team, on infra they own	The vendor, behind an API
Who pays per token	Nobody — the cost is the hardware you own	The team — electricity, ops, headroom	The customer of every keystroke — metered
Who patches the model	You pull the next weights	You pull the next weights, on a schedule	The vendor — silently, sometimes overnight
What is nullable in schema	Latency budget, context ceiling	Per-call cost, vendor SLA	CPU and memory — you do not own the resource
Failure that leaks to UI	OOM at the context boundary, slow tokens	Capacity exhaustion, your own pager	Vendor outage, rate limit, model swap mid-session

The runtime axis. Each loop wants a different one.

The bill is part of the architecture

The reason this question is sharper now than it was a year ago is not that local models suddenly got useful. It is that the hosted bill has stopped being a free-tier hand-wave and started being a product constraint.

Production AI products have to do cost work. Token tracking that's accurate enough to budget against, not estimates. Per-user caps on the operations that get expensive — image generation, long-context calls. Per-dataset rate limits that throttle ingest before it eats the daily quota. None of that work is glamorous. All of it is the price of running hosted inference at production volume without having the bill quietly eat the unit economics.

When the hosted bill is already real enough to need rate caps, the receipts exist to argue that even a less-capable local model is worth the swap on certain loops. The economic case has been pre-paid by the work that was required to ship.

That is the link the discourse keeps missing. The local-vs-hosted argument is not really about model quality. It is about who has done the cost arithmetic, and who has been paying the bill on hand-wave intuitions. Teams that have already built the production controls have a different relationship to the open-weights option than teams who have only ever paid for it on a credit card with someone else's name on it. The receipts make the swap thinkable in the first place.

Some loops belong local. Some belong on the wire.

The temptation, when the local model holds up for some workload, is to declare victory and unplug. The opposite temptation, when it stumbles, is to declare the experiment a failure and switch back. Both temptations are the same mistake.

Multi-runtime systems tend to settle, eventually, into asymmetric shapes. Different sides of the system land in the runtime that fits them, even when the original diagram drew them as symmetric. The honest move is not to drag them back into symmetry. The honest move is to write the operational playbook around the asymmetry, and to stop pretending the diagram had been right in the first place.

The local-vs-hosted question has the same gravity. The most likely steady state is not "all local" or "all hosted". It is most loops local, on the workstation, where they are cheap and private and good enough — and a smaller number of loops escalating to the hosted frontier model on demand, when the prompt has earned it. Cheap inference before expensive inference. Local reasoning until the task asks for more. The deep agent on the production path stays on the hosted runtime, because the bar for production-grade routing, planning, and tool delegation is still high enough that the local option is not yet honest about clearing it. The dev loop is a different story.

What the experiment really tested

The experiment did not test Gemma 4. It tested whether the dev loop tolerates a runtime swap underneath it. It does. Which specific loops kept working and which stumbled — that is the field-notes section of this post, and it isn't written yet. The shape of the answer is the part the architecture argument needs. The specifics are the part the next pass of writing will carry.

A week from now it won't matter that the model on the laptop is Gemma 4 specifically. There will be another release, then another. What will matter is that the runtime question has been pulled out from under the brand and put back where it belongs — alongside every other infra decision the team already knows how to make.

The brand was doing the work the architecture should have been doing. That part of the conversation is over. The next part starts on the workstation.

The model on your laptop. Pointing Claude Code at Gemma 4, and what the swap actually reveals.