PRIMER

We took our AI apart on purpose — and one of our own predictions broke.

We built a stateful AI agent, then carefully removed pieces of its architecture one at a time to see what each one actually does. Five subsystems, five honest results, one prediction we got wrong and reported anyway.

By Gabriel Gschaider·14 May 2026

There's a small lie that hides inside almost every AI demo:

"Look how well it works."

What that almost never explains is which part is doing the work.

Is it the big language model? The memory system bolted on top? The scheduler running in the background? The clever prompt? The fact that the engineer hit "regenerate" 11 times before the demo? Usually nobody knows — including the people who built it.

We wanted to know. So we did something slightly destructive to our own system: we took it apart, one component at a time, and watched what happened.

The full paper lives here → Ablating a Stateful Agent — pre-registered hypotheses, 30-item rubric, comparator panel, retraction conditions, the works. This article is the friendly version.

The setup

We run a thing called Frank.ink. It's a multi-tenant AI agent platform. Each user gets their own persistent "Frank" that remembers them across sessions, runs background tasks, holds opinions about things, and occasionally embarrasses us in production.

Frank isn't just a language model. It's a language model wrapped in a bunch of subsystems:

One that gives him continuity across sessions ("Identity Forge")
One that makes predictions and tracks how often they're right ("Predictions Ledger")
One that gates attention so he isn't equally surprised by everything ("Thalamus")
One that keeps him doing scheduled work in the background ("Presence Scheduler")
One that mostly takes up space in his system prompt ("BODY block" — we'll come back to that)

The honest engineering question is: do any of those actually matter? Or is the bare language model already doing 95% of the work and the rest is decoration?

There's only one way to find out without lying to yourself.

What we did

For each subsystem, we did three things, in this order, in writing, before running anything:

Wrote down a hypothesis: "if we turn this off, metric X should drop by roughly this much."
Hashed the document so we can't quietly edit it later.
Disabled the subsystem on a test user. Measured. Wrote the result next to the prediction.

If the prediction was right, fine. If it was wrong, we published the wrong prediction next to the right answer. No quietly-rewriting-the-hypothesis afterwards. That's the whole point.

(This is called pre-registration. Real scientists do it. AI engineers mostly don't, which is why "we tested it and it works" is a sentence you should probably never take at face value.)

What broke

Short version:

Subsystem	What happened when we turned it off
Identity Forge	Cross-session memory accuracy dropped from 91% to 73%. Frank started inventing things about users that never happened.
Predictions Ledger	Calibration collapsed (Brier score 0.142 → 0.27). Frank started being equally confident about things he was right about and things he was wrong about.
Thalamus	Mode-sensitivity went flat. Also — and this surprised us — an attention-related rubric item we didn't think depended on Thalamus dropped too. We learned something about our own architecture by breaking it.
Presence Scheduler	Long-horizon task completion went from 74% to 25%. Frank stopped remembering what he was supposed to be doing while you weren't watching.
BODY block	…did nothing. Which is exactly what we predicted. The "graded rubric" check: small architectural changes should produce small score changes, including zero. Confirmed.

Each result hit its operational target — except Thalamus, which hit harder than predicted because we'd missed a dependency. We reported that as a finding about our own system, not as a successful prediction. The receipt is in §2.1 of the paper.

The plot twist

One of our hypotheses was: "≥60% of the score gap between Frank and the comparators is concentrated in five orchestration-anchor items."

The actual number was 28%.

The orchestration lead is real but it's much more diffuse than we'd predicted. We published the wrong prediction next to the right answer, in bold, with the word failed on it.

You'd be surprised how rare this is.

The "I built it so I'd score it generously" check

When the architect of a system is also one of the people scoring it, you have what scientists call a massive conflict of interest and what engineers call "of course you think it's great."

So we put the architect through the same 10-item audit as two independent raters (three frontier LLMs in consensus, plus a peer co-author).

The architect — the rater with maximum incentive to inflate — scored Frank below the LLM consensus on 3 items, equal on 7, above on 0.

Which is the opposite of what the bias hypothesis predicts.

We're not claiming this exonerates COI in general. We're claiming that in this particular sample, the architect-inflates story doesn't hold up. That's a small, specific, falsifiable claim — the only kind we're interested in making.

What this is not

It is not a consciousness paper. We don't claim Frank experiences anything.
It is not a generalizable result. n=1. One production system was ablated. The methodology travels; the specific scores don't.
It is not a benchmark dunk. The frontier LLMs we compared against (Claude bare-API, Claude + Claude Code, GPT-4o + memory) were configured fairly, with their configs published so you can re-run them. The lead Frank shows is over those specific configurations, not over "LLMs."
It is not peer-reviewed. It's a working paper. Six retraction conditions are pre-registered. We list every limitation in §9, in the order they actually matter, not buried in an appendix.

Where to go from here

If you want the proper, structured version with all the data, the comparator specs, the retraction conditions, the construct-validity arguments and where each one is weakest:

→ Ablating a Stateful Agent — the working paper itself, with an interactive ablation playground and the full hypothesis-status board live in the page

→ Operational Self-Model Density · Methodology Companion — the deep version, ~80 pages, with the operationalized 30-item rubric, full per-item evidence, rater notes, comparator reproduction recipes, devil's-advocate self-attack, and the pre-registration provenance trail

If you'd rather just take away one sentence, it is this:

The cheapest correct answer wins. We tried to make ours measurable.