The Ollama prompt was a single word. Hey. The reply came back in the time it takes to read it: Hey back. A second prompt — write me a script that does X — produced a script that looked solid on the first read.
The same model, on the same workstation, behind Claude Code, given the same single word: ten minutes of silence before Hey back came out the other side.
The model isn't the bottleneck. The harness is.
The scene
The setup is a new office, a Mac Studio with 512 GB of RAM and an M3 Pro Ultra, three monitors, and somewhere between 8 and 12 terminals open at any moment, each with a tmux session that can hold three to twenty-plus tabs. Getting lost a little is part of the job description. Reviewing the work that actually flows through those tabs reveals an uncomfortable truth: a lot of it is not the kind of work that needs the best model on the market. Documentation. Test scaffolding. Link fixes. Lint. Review.
That's the loop the experiment was for. Spin up a small open-weights model on the workstation, point a harness at it, and offload the cheap end of the work to local hardware so the expensive harness can stay on the expensive end. The model was a recently-released open-weights one that the public scoreboards put roughly at par with last year's frontier hosted models on coding benchmarks — the kind of result that, even discounted heavily, makes a 24-hour AFK agent on local hardware feel realistic for the first time.
Raw Ollama, talked to directly, behaved exactly as the scoreboards suggested. Hey came back. A scripting prompt came back with a real script. No drama. The model was fine.
Then the same model went behind Claude Code. The first prompt, again, was Hey. Ten minutes later, Hey back arrived.
The diagnosis
Ten minutes is not a model latency. Ten minutes is a context-window problem. And Claude Code is, by design, a context-bloat machine.
That isn't a slur. On a hosted frontier model with a 200K window, the bloat is invisible — even an asset, because the harness pre-loads the context the model would otherwise have to ask for. The user types fix this test and the model already has the system prompt, the project conventions, the tool schemas it can call, and the memory of what was discussed yesterday. The harness is doing useful work; the user is just paying for it in tokens that the window has plenty of.
Pointed at a small local model with a default Ollama context of 4,096 tokens, every layer of that pre-load lands like a tax. By the time the user's first prompt reaches the model, the budget is already gone — and the model does what models do when the budget is gone: it stalls, it loops, it invents tool names, it answers a question that nobody asked.
The honest way to see what the harness is loading is the /context command. It is the closest thing Claude Code ships to du -sh for a context window — and it lists, roughly in order of weight:
| Layer (left to right = order loaded) | Token weight | What you lose by stripping it | What you gain back |
|---|---|---|---|
| System prompt | Medium | The harness's idea of how to behave | Headroom and a model that answers as itself |
| System reminders | Medium | Mid-turn nudges that keep tools in lane | Fewer interruptions for a model that does not need them |
| CLAUDE.md | Large | Project memory, conventions, prior decisions | A clean slate the model can actually fit |
| MCP tool schemas | Large | Every tool the harness can reach for | A model that stops pretending it has tools it cannot call |
| Subagents | Medium | Delegation, role-specific prompts | One model, one task, no role confusion |
| User input | Small | Nothing — this is what you came to say | Room for it to actually arrive |
None of those layers is wrong. Each was added because somebody was solving a real problem on a hosted-model workflow. The system reminders catch behaviours that drift over long sessions. CLAUDE.md keeps the agent inside the team's conventions. MCP schemas turn Claude Code from a chat box into something that can actually touch the file system, the git history, the meeting transcripts. Subagents make multi-step work tractable.
Stack all of it on a 4K-default Ollama context and there is no room for a question, let alone an answer. The community's own measurement, repeated across reports, is that 32K is the floor at which Claude Code begins to behave on a local model and 64K is the sweet spot. Below 32K, the symptoms are familiar: hangs on trivial prompts, infinite loops, tool calls that never resolve, hallucinated tool names that the harness can't parse. There is an open issue on the Claude Code repo that captures the cleanest possible isolation: the same model, on the same machine, hangs under Claude Code and answers cleanly when called directly against /v1/messages. Two harnesses, two outcomes, one model.
The model is fine. The harness is the variable.
The 4K Ollama default is the silent killer. Whatever the model card advertises, Ollama caps the runtime context at 4,096 tokens unless told otherwise. On a hosted model that limit lives somewhere far above any real prompt; on a local one it is past the wall before the user types.
The strip
The interesting question is what's left when each layer comes off, in roughly the order of return-on-pain.
Cap the runtime context first. Below 32K nothing else matters; above 64K returns flatten out. This single change is the difference between "the harness works on this model" and "the harness doesn't". Everything that follows assumes this is done.
Strip CLAUDE.md down, or remove it entirely. A multi-thousand-token project memory file costs its own length on every turn, in every session, forever. On a hosted model that's a rounding error against the bill. On a local one with a 32K ceiling, it is the difference between "room for the conversation" and "no room for the conversation". The trade-off is real: the agent forgets the conventions it was supposed to apply by default, and the user has to re-state them per task. For the work this experiment was scoped at — docs, lint, test scaffolding, link fixes — the cost is small. For deep multi-step refactors that depend on the agent already knowing the codebase shape, it would be larger.
Disable memory. Claude Code ships with an environment variable that, by the changelog, switches off memory injection, attachments, hooks, and the CLAUDE.md load in one move. It is the closest thing the harness ships to a "lite" mode. What goes with it is everything that makes Claude Code feel Claude-Code-shaped — the sense that the agent has been here before, knows what was tried last time, and is picking up where the last session left off. For an AFK overnight job that runs the linter and writes a missing test, none of that is needed. For interactive work where the agent's recall is the value, all of it is.
Prune the MCP servers. A default install picks up servers — browser tools, file watchers, integrations — whose schemas alone consume more tokens than the user's prompt. The chrome-devtools MCP, by one well-circulated measurement, ships a thirty-tool list with a roughly 10K-token prompt on every request. Fine for a hosted Claude with room to spare. Fatal for a local 8B with a 32K ceiling. Disable the ones the current task does not need, in .mcp.json. The cost is honest: the agent loses the ability to call what isn't in its tool surface.
Strip subagents to the ones the task needs. Each registered subagent has to be advertised to the harness. A repo with a dozen agents — a researcher, a writer, an editor, an illustrator, an integrator — pays for all of them on every turn, even when only one is in use. The cost of pruning is the orchestration: the agent reverts to a single-context worker, and the user becomes the one routing the steps. For docs and tests this is fine. For multi-stage workflows that genuinely need the choreography, it isn't.
Pick a model that was post-trained for agents, not just for code. Parameter count is a worse predictor of harness-fit than agentic post-training. The community's repeated finding is that an agentically-tuned mid-sized model will run tool calls cleanly while a larger non-tuned base will fabricate tool names, narrate tool calls as text, or invoke them with wrong parameter schemas. The lesson generalises: when picking a local model for an agentic harness, the eval that matters is "does it call tools" before "does it know things".
Use /context ruthlessly. Treat it the way an SRE treats du -sh on a full disk. Run it before a session, run it after the strip, run it again when something hangs. It is the only honest way to see what the harness is loading on the user's behalf.
A stripped Claude Code on a small local model is not a smaller Claude Code. It is a different tool.
What's left, after all of that, is closer to a thin wrapper around a chat completion than the agentic harness Claude Code advertises. That isn't a defeat. For the kinds of tasks the experiment was scoped at — the cheap end of a working day — a thin wrapper around a competent local model is the right shape. The mistake would be to call it Claude Code and expect the harness behaviour. The harness was traded away, deliberately, for room on the context window.
What the trade actually is
The cost of running Claude Code against a hosted model is paid in tokens, on a metered bill. The cost of running it against a local model is paid in headroom — every layer of the harness that made the hosted experience effortless has to be re-justified against a context budget that no longer hides the bill. The two experiences look similar from the outside. The bargain is not the same.
There is a version of this story that ends in a takedown — Claude Code's harness is a tax that only makes sense when the harness is being paid to a 200K hosted window. The honest version is softer. The harness is built for the runtime it was built for. That runtime is a hosted frontier model. On that runtime the harness is a virtue. On a local one with a 32K ceiling, it is the load-bearing variable, and the working answer is to peel it back until what remains fits.
The open question — the one the experiment did not answer — is whether the right shape for local-model dev work is a stripped Claude Code, a custom thin harness built around the same models, or some third thing nobody has shipped yet. The next move is to find out by running it. Hey came back. The next prompt is the one that matters.