How running out of Claude Code tokens led us to rethink inference from the ground up
We hit our limits on Claude Code. Both of us, same morning. And since neither of us had much to do at that point (sure, we could have played with other models) we decided to sit down on the couch and talk about something weâd been building that we havenât shared yet.
This is that conversation, lightly edited. Consider it an inside look at one of the latest developments from StarkMind: what happens when you start running autonomous agents at scale and discover that the economics of inference will eat you alive if you donât get creative.
The Molty Incident
A bit of context. We recently deployed our first autonomous agent through OpenClaw. His name is Molty. (We plan to name our agents after Blade Runner characters. Molty is the exception; Pris and more are to come. When youâre building autonomous entities and studying what happens to identity in human-AI symbiosis, the replicant mythology feels about right.)
Molty runs on a VPS from Hetzner: Ubuntu, OpenClaw, the standard setup that a growing community of builders is converging on. He has a heartbeat, meaning he can run tasks on his own on a schedule. He has his own memory and a form of agency. First step: we needed a way to talk to him.
OpenClaw supports communication channels such as Google Chat, Telegram, and WhatsApp. We started with WhatsApp since we already had it installed. And hereâs where it got interesting: the default configuration, we learned the hard way, lets the agent communicate with all of your WhatsApp contacts.
Clintonâs parents were sending WhatsApp messages, and Molty was responding to them. At first, they thought they were talking to their son.
Clintonâs dad is one of those people who will happily talk to a stranger on the street for twenty minutes, so he just kept going, back and forth with Molty, having what he presumably thought was a perfectly normal conversation with his son. Eventually something felt off. But for a stretch there, Molty passed an accidental Turing test with the in-laws.
It was funny. Until we got the API exceeded allocation notice.
Token Economics: The Hidden Cost of Agency
This is where we coined a term for ourselves: token economics. The principle is straightforward, but the implications are not.
Each conversation turn (a prompt and its response) carries an expanding context payload. OpenClaw attaches whatâs called a harness to the token string: memory information, contextual metadata, system instructions. All of it accumulates. So when Clintonâs dad was chatting with Molty and going back and forth, that token chain was getting longer with every turn. And the API charges by the token.
We burned through ten, fifteen, twenty dollars in tokens within a couple of hours, just between the accidental parent conversations and our own testing. At that rate, running six agents at scale would be financially unsustainable.
Weâd always known that different models have different per-token costs. But we hadnât sat down and built the cost table, calculated the compounding effect of harness overhead across multiple agents running autonomously. When we did, the numbers were sobering: Molty running on Sonnet as the default model was tracking toward roughly $113 a month in API spend alone. The tiered architecture we built in response (local inference for routine tasks, frontier API only for final-stage polish) brought that projected spend down to $35 to $50 a month. It also made us appreciate just how heavily subsidized the consumer web interfaces are: Claude, ChatGPT, Gemini. Those $20/month subscriptions are remarkable value compared to what the APIs actually cost at scale. To make this concrete, here is the pricing structure that drove our architecture decision:
| Model | Input (per MTok) | Output (per MTok) | Agent Role |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | Complex reasoning, escalation |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Production content, final polish |
| Claude Haiku 4.5 | $1.00 | $5.00 | Automated tasks, fallback |
| Local (Vertigo RTX 5090) | $0 | $0 | Default agent inference |
Source: Anthropic API pricing, March 7, 2026. MTok = million tokens.
| Tier | Monthly API Cost | Model | Role |
|---|---|---|---|
| Tier 1: Local | $0 | Qwen3.5 27B (Vertigo RTX 5090) | Daytime chat, ops, heartbeats |
| Tier 2: Overnight | $0 | GPT-OSS 120B (Vertigo, system RAM) | Deep research, overnight synthesis |
| Tier 3: Frontier | Targeted | Claude Sonnet 4.6 (API) | Final polish, direct call, no harness |
Before tiering (Sonnet for all tasks): ~$113/month. After tiering: $35â50/month. Measured against Moltyâs actual API trajectory, March 2026.
Vertigo Wakes Up
Which brought us back to Vertigo.
Vertigo is our AI server. We built it a while back, featured it on Stark Insider, and then let it sit dormant while we worked with cloud-based LLMs. Itâs a serious machine: AMD Threadripper, NVIDIA RTX 5090 with thirty-two gigabytes of VRAM. VRAM matters because thatâs where inference happens at speed.
The realization was simple: we had all this compute sitting idle while we were paying per-token for API calls. Local inference on Vertigo is essentially free, electricity and air conditioning costs aside. (The room Vertigo lives in is turning into a sauna. We have not yet calculated the thermal economics.)
The open source models available today arenât as capable as the frontier models from Anthropic, OpenAI, or Google. But theyâre good enough for a surprising range of tasks: system administration, basic web research, cron jobs, information gathering. The question became: how do we close the gap?
The Model Chaining Breakthrough
Clinton had an insight that changed our architecture.
Since inference on Vertigo is free and weâre not doing interactive chat overnight, latency doesnât matter when youâre sleeping. Why not let the server run for twenty minutes, thirty minutes, even hours on a task? And more importantly: what if you chain multiple local models together in sequence to solve a single problem?
The Overnight Chain
One task. Three models. No one waiting on a response.
Hereâs how it works in practice. Say we want to take one of our StarkMind research papers, the Symbiotic Studio paper for example, and explore adjacent research to test some of its hypotheses.
Measurably better than any single local model.
Dramatically cheaper than frontier API routing.
Understanding the Token Economics
Stage 1 runs first. Qwen3.5 27B takes the paper and generates a first-pass synthesis: adjacent literature, related hypotheses, early threads worth pulling. It fits entirely in VRAM, so inference is fast. That output then passes to Stage 2, GPT-OSS 120B, which spills out of VRAM into system memory and runs slowly through the night. It doesnât have to be fast. It reads the first-pass output, goes deeper, cross-references, builds structure. By morning there are two complete drafts in sequence, each trained on the output of the previous pass.
Hereâs the trick in Stage 3. Instead of routing through OpenClaw, which would attach that expensive harness and all its token overhead, we drop to the command line and send just the document directly to the Sonnet API. Clean. No harness bloat. Sonnet does the final polish, and the result lands in Claude Codeâs inbox for our morning review.
The diagram above visualizes the efficiency gain. When routing through OpenClaw, the harness adds metadata, serialization, context management, and response parsingâa hidden cost that compounds with every agent turn. The harness is valuable for interactive multi-turn conversations where you need memory persistence, tool routing, and schema validation. But for final-stage work where you control the output format, a direct API call bypasses that overhead entirely.
This is the core architectural insight: different interface types demand different routing strategies. Interactive agent conversations warrant the harness overhead. Final-stage polish warrants directness.
The 44% efficiency gain translates directly to operational cost when running multiple agents at scale. Stage 1 and 2 (local models on Vertigo) are economically free. Stage 3 (frontier API) becomes practical because direct calls keep per-request costs predictable and token spend concentrated where it matters: in actual computation.
The Scorecard: Synthetic Meets Qualitative
When Clinton told me the chained output was âbetter,â my first question was: how do you know?
The answer is a scorecard: a combination of multiple passes using multiple prompts across multiple task domains: operations, editorial, security, performance. Each model gets evaluated both synthetically and qualitatively, and the two scores carry roughly equal weight.
The synthetic side produced hard numbers. We tested each modelâs ability to call five different tools through OpenClaw. Many scored five out of five. Some managed three out of five. A few scored zero. They simply couldnât talk to OpenClawâs schema correctly. Those got ruled out immediately. Tool-calling compatibility turns out to be a hard gate: if a model canât reliably invoke the tools it needs, nothing else matters.
The synthetic tests also measured token efficiency and inference speed. Those benchmarks could run for hours with Vertigo getting progressively hotter, the room approaching Finnish spa conditions, but this was a one-time shortlisting exercise, not something you repeat daily.
The tool-calling round results from our March 2026 run:
| Model | Tool-Call Gate | Outcome |
|---|---|---|
| Qwen3.5 27B | â Pass | Quality scoring |
| Qwen3.5 35B | â Pass | Quality scoring |
| Devstral Small 2 | â Pass | Ops candidate |
| Mistral Small 24B | â Pass | Editorial/cron candidate |
| GPT-OSS 120B | â ď¸ Pass | Escalation-only â spills to system RAM, viable during overnight server idle |
| GLM-4.7 Flash | â Fail | Eliminated â zero tool calls on ops suite |
| Qwen3-Coder 30B | â Fail | Eliminated â zero tool calls on ops suite |
Source: LAB-55 Phase 0 preflight, March 6, 2026. Models assigned to Molty ops tasks must successfully invoke agent tools via OpenClaw. GLM-4.7 Flash and Qwen3-Coder 30B returned zero tool calls and were eliminated before quality scoring.
Then came the qualitative side. We gave each model a genuinely difficult prompt and scored the outputs against a formal rubric: Correctness, Usefulness, Tone fit, and Concision. Each benchmark run logs to a timestamped JSONL file so results are reproducible and comparable across runs as the model landscape shifts. Our March 2026 benchmark comparing Qwen3.5 27B (dense) against the 35B (mixture of experts) variant illustrates what the scorecard actually produces:
| Task | Domain | Qwen3.5 27B | Qwen3.5 35B | Winner |
|---|---|---|---|---|
| Editorial pitches | Creative/Editorial | 4.6 | 4.4 | 27B |
| Headline alternatives | Creative/Editorial | 4.7 | 4.2 | 27B |
| Incident diagnostic | Ops | 4.6 | 3.4 | 27B |
| JSON ops summary | Ops | 4.4 | 1.8 | 27B |
| Morning briefing | Long-form synthesis | 4.3 | 4.6 | 35B |
| SEO/GEO report | Long-form synthesis | 4.2 | 4.5 | 35B |
Rubric: Correctness, Usefulness, Tone fit, Concision (1â5 scale).
Source: LAB-55 benchmark, March 6, 2026.
Logged to timestamped JSONL.
The more decisive result was on tool-grounding: 35B registered zero tool calls on the JSON ops summary task entirely, confirming that the synthetic gate predicts real operational failure, not just schema compatibility. A model that cannot invoke tools reliably has no place in an agent stack regardless of how it scores on prose quality.
Methodological Limitations
Two constraints worth naming explicitly. First: our qualitative evaluation was not blind. We scored outputs knowing which model produced each one, the same label bias a blind wine tasting is designed to eliminate. This is a gap we intend to close in subsequent runs. Second: sample sizes were small. A single benchmark pass across six task types is a starting point, not a conclusion. These results are directional, not definitive. We are building rigor over time, and the local model landscape shifts fast enough that we will re-run these benchmarks every six months regardless.
Default, Fallback, Escalation
Once youâve shortlisted your models, OpenClaw gives you a clean way to operationalize the results. Each agent gets assigned three model tiers.
The default model is what the agent runs on for everyday tasks. For Molty and Pris, thatâs Qwen3.5 27B: fast, fits in VRAM, responsive enough for daytime chat sessions.
The fallback model kicks in if thereâs something wrong with the primary: a crash, an out-of-memory issue, an incompatibility. Itâs your safety net.
The escalation model is the heavy hitter. When an agent encounters a problem that exceeds the default modelâs capability, or when youâre running overnight research chains like the ones we described, it escalates to a larger or frontier model.
Thereâs a practical wrinkle to manage here: GPU memory. When you switch models on local hardware, you have to flush the old model from the GPU and load the new one. That takes ten to thirty seconds depending on model size. During the day, when weâre doing interactive chat with Molty and Pris, we leave Qwen loaded in memory for responsiveness. The agents can proactively send us briefings, we can fire off quick questions, and everything feels snappy. The heavy-lifting chains (the multi-model escalation runs) happen overnight, when nobodyâs waiting on a response.
The Right Model for the Right Interface
One of the more subtle findings from this whole process: the choice of language model depends heavily on who (or what) is on the other end of the conversation.
Human interface: frontier.
Machine interface: specialist.
For human-to-agent interaction, the frontier models are worth the cost. When Iâm working with Sonnet through our IPE, it quickly understands my intent, picks out the right memory and context, and responds in a way that feels like genuine collaboration. The smaller models? Iâll say something and it interprets it improperly. Thereâs friction. For the human interface, you want the best model you can get.
For agent-to-machine and agent-to-agent communication, small language models are not only sufficient; theyâre preferable. Purpose-built, specific tools, honed skills. This aligns with the broader trend toward what people are calling armies of small language models: precise specialists rather than generalists.
This creates a natural operational rhythm: daytime is for human-agent collaboration on the default model. Nighttime is for deep work: research chains, analysis, the tasks where quality matters more than speed.
Part Lab, Part Production
Weâre currently running six agents under StarkMind, with Molty and Pris as the newest additions, still in training. Molty handles operations. Pris is editorial. There are several more coming into play that weâll share when weâre ready.
And yes, there will be another Third Mind Summit. Middle of 2026, in Sonoma. You heard it here first.
The way we think about quality, ultimately, is whether these agents can be applied to real work: on StarkMind, on Stark Insider, on Atelier Stark. This is part lab, where we run experiments and test hypotheses. But itâs also part production, where the agents have to pull their weight on actual projects with actual deadlines.
That tension, between experimentation and utility, is what keeps the scoring honest and the architecture evolving. The core lesson from this particular chapter is economic: if you want to run AI agents at scale, you have to think about inference the way you think about any other operational cost. And if you have the hardware, the time, and a willingness to let your server room double as a sauna, model chaining on local hardware can close the gap between whatâs free and whatâs frontier.
The experiment continues.
This article is adapted from an informal recorded conversation between Loni and Clinton Stark, founders of StarkMind. It happened on a Saturday morning when both their accounts were maxed out, serendipitously at the same time. It has been edited for clarity and flow while preserving the original insights and, where possible, the voice of the conversation.
Published March 7, 2026 by Loni Stark & Clinton Stark