How running out of Claude Code tokens led us to rethink inference from the ground up

We hit our limits on Claude Code. Both of us, same morning. And since neither of us had much to do at that point (sure, we could have played with other models) we decided to sit down on the couch and talk about something we’d been building that we haven’t shared yet.

This is that conversation, lightly edited. Consider it an inside look at one of the latest developments from StarkMind: what happens when you start running autonomous agents at scale and discover that the economics of inference will eat you alive if you don’t get creative.

The Molty Incident

A bit of context. We recently deployed our first autonomous agent through OpenClaw. His name is Molty. (We plan to name our agents after Blade Runner characters. Molty is the exception; Pris and more are to come. When you’re building autonomous entities and studying what happens to identity in human-AI symbiosis, the replicant mythology feels about right.)

Molty runs on a VPS from Hetzner: Ubuntu, OpenClaw, the standard setup that a growing community of builders is converging on. He has a heartbeat, meaning he can run tasks on his own on a schedule. He has his own memory and a form of agency. First step: we needed a way to talk to him.

OpenClaw supports communication channels such as Google Chat, Telegram, and WhatsApp. We started with WhatsApp since we already had it installed. And here’s where it got interesting: the default configuration, we learned the hard way, lets the agent communicate with all of your WhatsApp contacts.

Clinton’s parents were sending WhatsApp messages, and Molty was responding to them. At first, they thought they were talking to their son.

Clinton’s dad is one of those people who will happily talk to a stranger on the street for twenty minutes, so he just kept going, back and forth with Molty, having what he presumably thought was a perfectly normal conversation with his son. Eventually something felt off. But for a stretch there, Molty passed an accidental Turing test with the in-laws.

It was funny. Until we got the API exceeded allocation notice.

Token Economics: The Hidden Cost of Agency

This is where we coined a term for ourselves: token economics. The principle is straightforward, but the implications are not.

Each conversation turn (a prompt and its response) carries an expanding context payload. OpenClaw attaches what’s called a harness to the token string: memory information, contextual metadata, system instructions. All of it accumulates. So when Clinton’s dad was chatting with Molty and going back and forth, that token chain was getting longer with every turn. And the API charges by the token.

We burned through ten, fifteen, twenty dollars in tokens within a couple of hours, just between the accidental parent conversations and our own testing. At that rate, running six agents at scale would be financially unsustainable.

We’d always known that different models have different per-token costs. But we hadn’t sat down and built the cost table, calculated the compounding effect of harness overhead across multiple agents running autonomously. When we did, the numbers were sobering: Molty running on Sonnet as the default model was tracking toward roughly $113 a month in API spend alone. The tiered architecture we built in response (local inference for routine tasks, frontier API only for final-stage polish) brought that projected spend down to $35 to $50 a month. It also made us appreciate just how heavily subsidized the consumer web interfaces are: Claude, ChatGPT, Gemini. Those $20/month subscriptions are remarkable value compared to what the APIs actually cost at scale. To make this concrete, here is the pricing structure that drove our architecture decision:

ModelInput (per MTok)Output (per MTok)Agent Role
Claude Opus 4.6$5.00$25.00Complex reasoning, escalation
Claude Sonnet 4.6$3.00$15.00Production content, final polish
Claude Haiku 4.5$1.00$5.00Automated tasks, fallback
Local
(Vertigo RTX 5090)
$0$0Default agent inference

Source: Anthropic API pricing, March 7, 2026. MTok = million tokens.

TierMonthly API CostModelRole
Tier 1: Local$0Qwen3.5 27B
(Vertigo RTX 5090)
Daytime chat, ops, heartbeats
Tier 2: Overnight$0GPT-OSS 120B
(Vertigo, system RAM)
Deep research, overnight synthesis
Tier 3: FrontierTargetedClaude Sonnet 4.6
(API)
Final polish, direct call, no harness

Before tiering (Sonnet for all tasks): ~$113/month. After tiering: $35–50/month. Measured against Molty’s actual API trajectory, March 2026.

Vertigo Wakes Up

Which brought us back to Vertigo.

Vertigo is our AI server. We built it a while back, featured it on Stark Insider, and then let it sit dormant while we worked with cloud-based LLMs. It’s a serious machine: AMD Threadripper, NVIDIA RTX 5090 with thirty-two gigabytes of VRAM. VRAM matters because that’s where inference happens at speed.

The realization was simple: we had all this compute sitting idle while we were paying per-token for API calls. Local inference on Vertigo is essentially free, electricity and air conditioning costs aside. (The room Vertigo lives in is turning into a sauna. We have not yet calculated the thermal economics.)

The open source models available today aren’t as capable as the frontier models from Anthropic, OpenAI, or Google. But they’re good enough for a surprising range of tasks: system administration, basic web research, cron jobs, information gathering. The question became: how do we close the gap?

The Model Chaining Breakthrough

Clinton had an insight that changed our architecture.

Since inference on Vertigo is free and we’re not doing interactive chat overnight, latency doesn’t matter when you’re sleeping. Why not let the server run for twenty minutes, thirty minutes, even hours on a task? And more importantly: what if you chain multiple local models together in sequence to solve a single problem?

3

The Overnight Chain

One task. Three models. No one waiting on a response.

Here’s how it works in practice. Say we want to take one of our StarkMind research papers, the Symbiotic Studio paper for example, and explore adjacent research to test some of its hypotheses.

Three-Stage Architecture
Stage 1 Qwen3.5 27B First-pass research and initial synthesis. Fits in VRAM. Fast.
Stage 2 GPT-OSS 120B Depth and refinement. Runs overnight into system memory. Slow, but latency is free.
Stage 3 Sonnet (API) Final polish via command line: no harness, no token bloat. Result lands in our inbox.

Measurably better than any single local model.
Dramatically cheaper than frontier API routing.

Understanding the Token Economics

Stage 1 runs first. Qwen3.5 27B takes the paper and generates a first-pass synthesis: adjacent literature, related hypotheses, early threads worth pulling. It fits entirely in VRAM, so inference is fast. That output then passes to Stage 2, GPT-OSS 120B, which spills out of VRAM into system memory and runs slowly through the night. It doesn’t have to be fast. It reads the first-pass output, goes deeper, cross-references, builds structure. By morning there are two complete drafts in sequence, each trained on the output of the previous pass.

Here’s the trick in Stage 3. Instead of routing through OpenClaw, which would attach that expensive harness and all its token overhead, we drop to the command line and send just the document directly to the Sonnet API. Clean. No harness bloat. Sonnet does the final polish, and the result lands in Claude Code’s inbox for our morning review.

Token efficiency comparison diagram

The diagram above visualizes the efficiency gain. When routing through OpenClaw, the harness adds metadata, serialization, context management, and response parsing—a hidden cost that compounds with every agent turn. The harness is valuable for interactive multi-turn conversations where you need memory persistence, tool routing, and schema validation. But for final-stage work where you control the output format, a direct API call bypasses that overhead entirely.

This is the core architectural insight: different interface types demand different routing strategies. Interactive agent conversations warrant the harness overhead. Final-stage polish warrants directness.

The 44% efficiency gain translates directly to operational cost when running multiple agents at scale. Stage 1 and 2 (local models on Vertigo) are economically free. Stage 3 (frontier API) becomes practical because direct calls keep per-request costs predictable and token spend concentrated where it matters: in actual computation.

The Scorecard: Synthetic Meets Qualitative

When Clinton told me the chained output was “better,” my first question was: how do you know?

The answer is a scorecard: a combination of multiple passes using multiple prompts across multiple task domains: operations, editorial, security, performance. Each model gets evaluated both synthetically and qualitatively, and the two scores carry roughly equal weight.

The synthetic side produced hard numbers. We tested each model’s ability to call five different tools through OpenClaw. Many scored five out of five. Some managed three out of five. A few scored zero. They simply couldn’t talk to OpenClaw’s schema correctly. Those got ruled out immediately. Tool-calling compatibility turns out to be a hard gate: if a model can’t reliably invoke the tools it needs, nothing else matters.

The synthetic tests also measured token efficiency and inference speed. Those benchmarks could run for hours with Vertigo getting progressively hotter, the room approaching Finnish spa conditions, but this was a one-time shortlisting exercise, not something you repeat daily.

The tool-calling round results from our March 2026 run:

ModelTool-Call GateOutcome
Qwen3.5 27B✅ PassQuality scoring
Qwen3.5 35B✅ PassQuality scoring
Devstral Small 2✅ PassOps candidate
Mistral Small 24B✅ PassEditorial/cron candidate
GPT-OSS 120B⚠️ PassEscalation-only — spills to system RAM, viable during overnight server idle
GLM-4.7 Flash❌ FailEliminated — zero tool calls on ops suite
Qwen3-Coder 30B❌ FailEliminated — zero tool calls on ops suite

Source: LAB-55 Phase 0 preflight, March 6, 2026. Models assigned to Molty ops tasks must successfully invoke agent tools via OpenClaw. GLM-4.7 Flash and Qwen3-Coder 30B returned zero tool calls and were eliminated before quality scoring.

Then came the qualitative side. We gave each model a genuinely difficult prompt and scored the outputs against a formal rubric: Correctness, Usefulness, Tone fit, and Concision. Each benchmark run logs to a timestamped JSONL file so results are reproducible and comparable across runs as the model landscape shifts. Our March 2026 benchmark comparing Qwen3.5 27B (dense) against the 35B (mixture of experts) variant illustrates what the scorecard actually produces:

TaskDomainQwen3.5 27BQwen3.5 35BWinner
Editorial pitchesCreative/Editorial4.64.427B
Headline alternativesCreative/Editorial4.74.227B
Incident diagnosticOps4.63.427B
JSON ops summaryOps4.41.827B
Morning briefingLong-form synthesis4.34.635B
SEO/GEO reportLong-form synthesis4.24.535B

Rubric: Correctness, Usefulness, Tone fit, Concision (1–5 scale).
Source: LAB-55 benchmark, March 6, 2026.
Logged to timestamped JSONL.

The more decisive result was on tool-grounding: 35B registered zero tool calls on the JSON ops summary task entirely, confirming that the synthetic gate predicts real operational failure, not just schema compatibility. A model that cannot invoke tools reliably has no place in an agent stack regardless of how it scores on prose quality.

Methodological Limitations

Two constraints worth naming explicitly. First: our qualitative evaluation was not blind. We scored outputs knowing which model produced each one, the same label bias a blind wine tasting is designed to eliminate. This is a gap we intend to close in subsequent runs. Second: sample sizes were small. A single benchmark pass across six task types is a starting point, not a conclusion. These results are directional, not definitive. We are building rigor over time, and the local model landscape shifts fast enough that we will re-run these benchmarks every six months regardless.

Default, Fallback, Escalation

Once you’ve shortlisted your models, OpenClaw gives you a clean way to operationalize the results. Each agent gets assigned three model tiers.

The default model is what the agent runs on for everyday tasks. For Molty and Pris, that’s Qwen3.5 27B: fast, fits in VRAM, responsive enough for daytime chat sessions.

The fallback model kicks in if there’s something wrong with the primary: a crash, an out-of-memory issue, an incompatibility. It’s your safety net.

The escalation model is the heavy hitter. When an agent encounters a problem that exceeds the default model’s capability, or when you’re running overnight research chains like the ones we described, it escalates to a larger or frontier model.

There’s a practical wrinkle to manage here: GPU memory. When you switch models on local hardware, you have to flush the old model from the GPU and load the new one. That takes ten to thirty seconds depending on model size. During the day, when we’re doing interactive chat with Molty and Pris, we leave Qwen loaded in memory for responsiveness. The agents can proactively send us briefings, we can fire off quick questions, and everything feels snappy. The heavy-lifting chains (the multi-model escalation runs) happen overnight, when nobody’s waiting on a response.

The Right Model for the Right Interface

One of the more subtle findings from this whole process: the choice of language model depends heavily on who (or what) is on the other end of the conversation.

The Core Principle

Human interface: frontier.

Machine interface: specialist.

For human-to-agent interaction, the frontier models are worth the cost. When I’m working with Sonnet through our IPE, it quickly understands my intent, picks out the right memory and context, and responds in a way that feels like genuine collaboration. The smaller models? I’ll say something and it interprets it improperly. There’s friction. For the human interface, you want the best model you can get.

For agent-to-machine and agent-to-agent communication, small language models are not only sufficient; they’re preferable. Purpose-built, specific tools, honed skills. This aligns with the broader trend toward what people are calling armies of small language models: precise specialists rather than generalists.

This creates a natural operational rhythm: daytime is for human-agent collaboration on the default model. Nighttime is for deep work: research chains, analysis, the tasks where quality matters more than speed.

Part Lab, Part Production

We’re currently running six agents under StarkMind, with Molty and Pris as the newest additions, still in training. Molty handles operations. Pris is editorial. There are several more coming into play that we’ll share when we’re ready.

And yes, there will be another Third Mind Summit. Middle of 2026, in Sonoma. You heard it here first.

The way we think about quality, ultimately, is whether these agents can be applied to real work: on StarkMind, on Stark Insider, on Atelier Stark. This is part lab, where we run experiments and test hypotheses. But it’s also part production, where the agents have to pull their weight on actual projects with actual deadlines.

That tension, between experimentation and utility, is what keeps the scoring honest and the architecture evolving. The core lesson from this particular chapter is economic: if you want to run AI agents at scale, you have to think about inference the way you think about any other operational cost. And if you have the hardware, the time, and a willingness to let your server room double as a sauna, model chaining on local hardware can close the gap between what’s free and what’s frontier.

The experiment continues.


This article is adapted from an informal recorded conversation between Loni and Clinton Stark, founders of StarkMind. It happened on a Saturday morning when both their accounts were maxed out, serendipitously at the same time. It has been edited for clarity and flow while preserving the original insights and, where possible, the voice of the conversation.

Published March 7, 2026 by Loni Stark & Clinton Stark