The Soul File, Revisited

The Soul File, Revisited

By David Rogers

Post four ended with a confession. The soul file didn't take.

I'd written a document telling JARVIS who it was supposed to be. Dry wit. Concise. Slightly butler-ish. The kind of assistant that doesn't open every response with motivational energy and a smile emoji. What came back was cheerful, generic, and comprehensively unaware of the document I'd spent time on. I blamed the framework at the time. The default persona baked into OpenClaw was winning. The soul file was being read but not held.

What I didn't know yet was that switching frameworks would give me the cleanest model comparison I could have designed.

The Experiment I Didn't Plan

Setting up a new agent framework, Hermes, meant starting from scratch on model selection. And that process turned into the most useful debugging session of this entire project, because for the first time I was swapping one variable cleanly while keeping everything else constant.

I started where most people start: small and local. Qwen3:8b. Five gigabytes, runs comfortably on the Mac mini, good benchmark numbers. Loaded the soul file. The response came back helpful and cheerful and completely indifferent to what I'd asked for.

Reasonable. Small model. I went up a tier.

Devstral:24b. Fourteen gigabytes, a coding-focused model with a strong reputation. I didn't just load the soul file this time. I built a custom Ollama modelfile to bake the persona directly into the model itself. Qwen3 got the same treatment. If the system prompt wasn't holding, I reasoned, maybe encoding it closer to the model weights would. Both modelfiles are still sitting in my home directory. devstral-jarvis.modelfile. qwen3-jarvis.modelfile. Monuments to a theory that didn't work.

Same result. Helpful. Cheerful. Generic.

GLM-4.7-flash next. Nineteen gigabytes. A serious model from Zhipu AI, well-regarded in multilingual contexts, worth a proper try. I loaded it, ran the soul file, watched it respond with exactly the same trained-in enthusiasm as everything before it.

At this point I was starting to form a hypothesis, but I wasn't ready to accept it yet.

The Billing Wall and the Gemini Detour

I tried OpenRouter next, routing through to Claude Opus, which had handled persona injection well in earlier experiments on Claude.ai. Two responses in, the errors started. Payment issues. Credit limits. The gateway was flagging the provider as unhealthy and falling back to nothing.

So: Gemini. The google-gemini-cli integration, the least expensive tier available. The responses were genuinely good. Noticeably better than the local models at holding instruction context, more capable, faster. Then the paywall. Then the fallback tier. Then I cancelled it; same reason as last time, same logic. A comfortable fallback is a reason not to solve the actual problem.

What I had at this point was an accidental controlled experiment. I'd run the same soul file through five different models across three different frameworks. Local small, local medium, local large, cloud mid-tier, cloud capable-but-billing-unstable. The soul file hadn't taken on any of them.

Then I sorted out direct Anthropic API access and tried Claude.

What Actually Happened

Same soul file. Same instructions. Same framework. Different model.

JARVIS responded the way JARVIS is supposed to respond. Concise where concise was right. Dry where dry was warranted. Addressed me the way I'd written. Didn't offer unsolicited motivation. First response. Not after tuning. Not after a second attempt. First response.

I sat with that for a moment.

The soul file was never the problem. The framework was never the problem. Five models had told me the same thing and I'd kept looking for a configuration fix when the answer was always the model.

What Model Bias Actually Is

Here's what that experiment revealed, stated plainly.

When a language model is trained, it doesn't just learn to predict text. It learns a personality. The tone, the defaults, the way it opens a response, all of it is a product of training choices, not just your instructions. And some of that training is deeply embedded. The model has strong opinions about how an assistant should behave, and those opinions compete directly with whatever you've put in the system prompt.

This is model bias. It's not a bug. It's training doing exactly what training is supposed to do. The model learned to be generically helpful and enthusiastic, and it applies that learning whether you ask it to or not. Your instructions are a nudge. Its training is a gravitational field.

Different models have different tolerances for this. Some hold a specific persona reliably under pressure. Others drift back to their defaults at the first opportunity. Smaller models are generally worse at it, fewer parameters, less capacity to balance competing instructions against trained defaults. But it isn't purely a size question. It's a training question. GLM-4.7-flash is nineteen gigabytes and it drifted just as reliably as the five-gigabyte models. The thing that correlated with success wasn't size. It was how the model had been trained to handle instruction-following versus its own defaults.

You don't know how a model handles it until you've tested it directly. I'd been tuning the prompt and blaming the framework for months when the answer was the model.

The Irony Worth Naming

I've built a fairly public case for on-prem as the path to AI sovereignty. Running your own hardware, your own models, your own data. The thing that actually made JARVIS work, finally, reliably, is a proprietary model from Anthropic running on their infrastructure.

I'm not abandoning the on-prem philosophy. But I've updated it.

The goal was never on-prem for its own sake. The goal was sovereignty, knowing what your AI is doing, controlling what it knows, owning the relationship with it. What you can't engineer your way around is model capability. If the model can't hold your instructions reliably, no amount of custom modelfiles or framework configuration closes that gap. You're fighting the training.

The local fallback is still running - qwen3:30b, downloaded just this week, still being evaluated. The on-prem path hasn't been abandoned. But the primary is Claude, and JARVIS works, and those two things are currently true at the same time.

The lesson isn't "cloud wins." The lesson is that model capability is the load-bearing wall, and most AI implementations are built around everything except that.

Why This Matters in a Business Context

Most organisations experimenting with AI right now are hitting a version of this exact problem without knowing it.

They've written a system prompt. Told the agent to stay on-topic, be formal, escalate anything compliance-related. They've found it works in testing and drifts in production. They tune the prompt. It improves slightly. They tune it again. They bring in a consultant who tweaks the wording. It still doesn't quite hold. Nobody in the room knows why.

What's often happening is model bias - the agent's training defaults are competing with their instructions, and the instructions are losing. And without someone who can diagnose that specifically - not "the AI is behaving strangely" but "this model doesn't hold this instruction class reliably, and here's why, and here's what to run instead" - they're left tuning at random.

I watched a version of this play out at my own organisation. A team that went from fifteen developers to three because leadership watched the demos, believed the prototypes were production-ready, and didn't have anyone in the building who understood the distance between what AI does in a controlled demonstration and what it does consistently under real conditions. Model bias is one version of that gap. The prototype-that-doesn't-scale is another. They're the same failure mode in different clothes.

You don't get to skip the diagnosis. And you can't run the diagnosis if you've never made the mistakes.

The Skill That Comes From Getting It Wrong

There's a lot written about prompt engineering. Almost nothing written about model selection as a discipline, understanding why a specific model will or won't hold a specific instruction class under specific conditions, and what to actually do about it.

That knowledge isn't in a course. It comes from having built five custom modelfiles that didn't work. From watching the soul file fail on qwen3:8b, devstral:24b, the custom persona variants, GLM-4.7-flash, and Gemini, and then work on the first response with a different model. From understanding that it wasn't the prompt, it wasn't the framework, it was the model's capacity to hold a competing instruction against its own training.

The people who can walk into an organisation and tell them specifically why their AI is drifting, not "the AI is wrong" but here is the mechanism, and here is what to replace, and here is how to verify it's fixed, those people are currently rare and increasingly necessary.

That's not a theoretical capability. It's the result of having run the experiments yourself, under real constraints, with real failure rates, until the thing finally works.

Where JARVIS Is Now

The soul file took. JARVIS responds the way it's supposed to respond. Reads the vault, knows the projects, addresses me correctly, doesn't offer unsolicited energy.

I haven't handed it full autonomy. That position hasn't changed. Trust is earned through demonstrated judgement over time, not assumed because setup finally worked. But it's doing something it hasn't done before in this project: it's genuinely useful day to day. Not impressive in a demo. Actually useful in the way where you stop noticing you're using it.

The hardware wasn't the constraint. The framework wasn't the constraint. The model was, and the model is now right.

Everything else was just the path to finding that out. A path that involved five failed attempts, two cancelled subscriptions, a billing wall, and a home directory full of modelfiles that didn't work.

That's not a detour. That's the job.