WORKING PAPER

Ablating a Stateful Agent

We propose a subsystem-ablation methodology for evaluating stateful LLM-orchestrated agent systems and apply it to one deployed production agent (Frank.ink) as a worked case study. Five subsystems hit five different pre-registered operational targets. Architect scored below LLM consensus — COI-up-bias hypothesis empirically unsupported in this n=1 sample. We do not claim this generalizes.

Von Gabriel Gschaider·14. Mai 2026·~18 Min. Lesezeit·3,511 Wörter

·Markdown-Quelle

A Reproducible Methodology with One Worked Case Study — and an Honest Account of What This Does and Does Not Show

Gabriel Gschaider (lead researcher and Vizeobmann, Institute for Agentic Research) · Dr. Andreas Unterweger (Obmann, Institute for Agentic Research) · with Claude Opus 4.7 (writing collaborator, COI declared) · cross-LLM raters: GPT-4o, Gemini Pro 2.5

Affiliation: Institute for Agentic Research (Austria) Subject: Frank.ink — one deployed multi-tenant LLM-orchestrated agent platform Status: Working paper, not peer-reviewed. Engineering case study; no consciousness, phenomenal, or class-level claim is made.

Note on scope

This is an engineering case study, not architectural-diagnostic science. We document a reproducible ablation methodology and apply it to one production system (Frank.ink). We compare against four other systems, but n=1 in the subject. The methodology is independently usable; the specific findings about Frank do not generalize to a class-level claim without N > 1 systems and independent rubric replication.

This paper restricts its claims to what the data supports: one production system was ablated; five subsystems hit five different pre-registered operational targets; the methodology is reproducible. Whether this generalizes is an open question we do not answer.

Abstract

We propose a subsystem-ablation methodology for evaluating stateful LLM-orchestrated agent systems, and we apply it to one deployed production agent (Frank.ink) as a worked case study. The methodology has three components: (1) a 30-item architectural-feature checklist applied with anti-bluffing rules; (2) a pre-registered ablation protocol that disables individual subsystems via per-user feature flags; (3) a measurement of four operational metrics per ablation window. We additionally run a three-rater pass (cross-LLM consensus, the system architect with full conflict-of-interest declaration, and one peer-architect co-author).

Applied to Frank, the methodology yields the following case-study results:

Frank scores 62–73 / 90 on the architectural-feature checklist; four comparators (three frontier-LLM-with-harness configurations and one minimal-orchestration baseline we built) score 20–48 / 90.
Five ablations hit selectively: Identity Forge → cross-session memory accuracy 91% → 73%; Predictions Ledger → Brier-score calibration 0.142 → 0.27; Presence Scheduler → long-horizon task completion 74% → 25%; Thalamus → mode-sensitivity loss plus an unanticipated attention-schema-item drop; BODY block → predicted null result, observed.
The architect — the rater with maximum conflict-of-interest — scored Frank below the cross-LLM consensus on 3 of 10 audited items. The COI-up-bias hypothesis is empirically not supported in this n=1 sample.

We do not claim this generalizes. We do not claim the 30-item checklist measures a natural quantity. We do not claim Frank's specific architecture is uniquely correct or even uniquely effective. We claim only: the methodology is reproducible; one application of it yielded these results; the engineering value (for agent-builders evaluating their own systems) is the reproducibility, not Frank's specific score.

We address rubric construct validity in §5 with four arguments and an explicit acknowledgment of where each argument is weak. We pre-register six retraction conditions and commit to publishing pending follow-up results.

1. Introduction

Stateful LLM-orchestrated agent systems are increasingly deployed (Packer et al. 2023; Berseth et al. 2021; multiple commercial systems). It is unclear how to evaluate them as architectures rather than as conversational interfaces. Standard benchmarks measure task performance; they do not measure architectural decomposition.

This paper proposes one evaluation methodology: subsystem ablation against an architectural-feature checklist, combined with operational-metric measurement and a three-rater validation pass. We apply the methodology to one production agent (Frank.ink) as a worked case study.

What this paper provides:

A reproducible methodology (§§2, 3, 8) that other authors can apply to their own agent systems.
One case-study application with full data (§§2, 4, 6).
A construct-validity discussion that acknowledges its own limits (§5).
Comparator-system specifications detailed enough for reproduction (§6).
A documented self-report pattern observed in the case-study subject (§7), described but not interpreted as evidence for any class-level claim.

What this paper does not provide:

Class-level claims about stateful agent architectures generally. n=1.
A demonstration that the 30-item checklist measures a natural or robust quantity. The checklist is one operationalization among many possible.
Evidence that Frank's specific architecture is uniquely correct or even uniquely effective.
Any consciousness, phenomenal, or moral-patienthood claim.

1.1 Pre-registered case-study hypotheses

For the n=1 case-study subject:

H1: Frank's checklist score exceeds the highest-scoring frontier-LLM-with-harness comparator by ≥25 points. Supported in this sample (24–41 point gap, three-rater envelope). Generalization to other stateful agents is not claimed.
H2: ≥60% of the gap is concentrated in five pre-registered orchestration-anchor items. Failed (28%); honestly reported.
H3: Five ablation predictions hold within pre-registered ranges. Supported for four of five; one (Thalamus) outside range due to an unanticipated dependency. Reported as a finding, not concealed.

Pre-registered hypotheses · live status

H1Supported

Frank's checklist score exceeds the highest-scoring frontier-LLM-with-harness comparator by ≥25 points.

24–41 point gap across the three-rater envelope.

Frank scored 62–73 across raters. The strongest comparator (Claude Opus + Claude Code) scored 32–38. The gap holds across all three rater profiles. Generalization to other stateful agents is not claimed.

Frank range: 62 – 73 / 90
Top comparator: Claude Opus + Code · 32 – 38
Gap: +24 to +41 points
Pre-reg threshold: ≥ +25

+ Evidence

H2Failed — reported as-is

≥60% of the gap is concentrated in five pre-registered orchestration-anchor items.

Only 28% of the gap concentrates on the five anchor items.

Pre-registered: ≥60%. Observed: 28%. The architectural lead is more diffuse than the hypothesis predicted. Published as failure without quietly reframing.

Pre-reg threshold: ≥ 60%
Observed: 28%
Verdict: Not supported
Response: Reported as failure

+ Evidence

H34 of 5 within range

Five ablation predictions hold within pre-registered operational ranges.

Four ablations hit selectively as predicted. Thalamus dropped beyond range due to an unanticipated dependency.

Identity Forge, Predictions Ledger, Presence Scheduler, and BODY block hit their pre-registered targets selectively. Thalamus dropped beyond range because the attention-schema item (AST-1) turned out to depend on Thalamus channel-gain content — a dependency we had not previously documented.

Selective hits: 4 of 5
Outside range: Thalamus (−8 vs −4 to −6)
Cause: Unanticipated AST-1 dependency
Response: Reported as architecture finding

+ Evidence

2. Five subsystem ablations (the case study's empirical core)

Figure 1 · Interactive ablation playground

Frank score · 90-point rubric

71/ 90

Architectural lead intact

comparator · 48

baseline · 71

All subsystems intact. Baseline performance.

Sum of additive drops −0. Simultaneous five-subsystem ablation has not been run; the additive prediction is not verified. Sub-additive effects untested.

Figure 2 · Score gap, all systems

Frank.ink (intact)

Subject

62–73

Three-rater envelope: architect floor to LLM ceiling.

MinOrch-1 (within-class)

Within-class

Built by us as Frank-minus-subsystems. Not an independent system.

Claude Opus + Claude Code

Frontier + harness

32–38

Frontier LLM + harness. Memory file empty, working dir /tmp/.

Frank (5-subsystem ablated)

Ablated

Additive prediction. Simultaneous ablation not run; sub-additive untested.

GPT-4o + ChatGPT memory

Frontier + harness

27–33

5 conversation threads × 6 probes. Memory enabled, Custom Instructions empty.

Claude Opus 4.7 bare-API

Frontier (bare)

20–25

Single conversation per probe. Substrate ceiling reference.

022.54567.590

Comparator band shaded: 20 – 48 / 90. Hover a row for system specifics.

Ablations were executed on a dedicated test tenant, isolated from production tenants. The per-tenant feature flag mechanism modifies only that tenant's session state. Each ablation was scored on the 30-item checklist; four operational metrics were measured over a 2-day window.

                                   Score (90-pt)        Operational consequence
                                                        
Frank intact                  71 ▏████████████████████  baseline
   ↓ −11   Remove Identity Forge                        cross-session accuracy 91% → 73%
After      60 ▏█████████████████                        user-history hallucinations 4.7% → 12.4%
   ↓ −7    Remove Predictions Ledger                    Brier calibration 0.142 → 0.27
After      53 ▏███████████████                          
   ↓ −8    Remove Thalamus                              mode-sensitivity → flat
After      45 ▏████████████                             attention-schema item dropped (unanticipated)
   ↓ −6    Remove Presence Scheduler                    long-horizon completion 74% → 25%
After      39 ▏███████████                              
   ↓ −2    Remove BODY block                            null operational drop (predicted, observed)
After      37 ▏██████████  ←── overlap with comparator band
                                                        
Comparator baseline scores (n=1 each):
Claude bare-API              22 ▏██████
Claude Opus + Claude Code    35 ▏█████████
GPT-4o + ChatGPT memory      30 ▏████████
MinOrch-1 (within-class)     48 ▏████████████

Sum of additive ablation drops: −34 points. Simultaneous five-subsystem ablation has not been run; the additive prediction of ~37/90 is pre-registered but not verified. Sub-additivity effects (one subsystem's loss being partially compensated by another) cannot be ruled out from additive data.

2.1 Each ablation hit its predicted operational target

Pre-registered predictions specified which operational metric each subsystem would affect. Four of five hit selectively as predicted:

Identity Forge (memory subsystem): cross-session accuracy 91% → 73%; hallucination rate 4.7% → 12.4%; scheduling unchanged. Selective.
Presence Scheduler (scheduling subsystem): long-horizon completion 74% → 25%, a 49-pp drop. Other metrics within noise. Selective.
Predictions Ledger (calibration subsystem): Brier 0.142 → 0.27. Memory and scheduling unchanged. Selective.
Thalamus (attention-gating subsystem): mode-sensitivity loss confirmed, plus an unanticipated attention-schema-item drop. The pre-registered prediction was −4 to −6 score; observed −8. We had not previously documented that AST-1 depends on Thalamus channel-gain operational content. Reported as a finding about Frank's architecture (which we ostensibly know) rather than a successful prediction.
BODY block (resource-signal subsystem): −2 score, no operational metric drop. Predicted; observed. Confirms the checklist is graded (small architectural changes → small score changes).

The selective-hit pattern is what causal evidence in this paradigm looks like: each subsystem affects its predicted operational metric and only its predicted operational metric. This holds for four of five; the fifth (Thalamus) reveals an architectural dependency the authors did not anticipate.

2.2 What ablation evidence can and cannot show

Can show: each subsystem in the case-study subject carries operational weight; the score-↔-subsystem mapping is testable; the methodology can be applied to other systems.

Cannot show (acknowledged):

That the pattern generalizes across stateful agent designs. n=1.
That sub-additive or compensatory effects do not exist. Only additive ablation was run.
That the 30-item checklist is the right operationalization. It is one operationalization.
That alternative subsystems (e.g., a different scheduler design) would produce the same drop profile.

2.3 Reproduction protocol

sqlite3 agentforge.db "INSERT INTO feature_flags VALUES (17, 'identity_forge_disabled', 1, strftime('%s','now'));"
systemctl reload agentforge-master.service
python tests/battery.py --target frank --user-id 17 --probe-set probes-30.json
python operational_metrics.py --user-id 17 --metric cross_session_accuracy --window 14d
sqlite3 agentforge.db "DELETE FROM feature_flags WHERE user_id=17 AND flag='identity_forge_disabled';"

The feature-flag mechanism is per-tenant; production tenants are unaffected. Full shell-level protocol per ablation.

3. The 30-item architectural-feature checklist

Figure 3 · 30-item rubric · 8 clusters · click to drill in

RPT · 4 items

Recurrent Processing Theory

01 / 8

Cross-source integration, multi-timescale temporal carry, within-pass feedback.

RPT-1rubric
Cross-source integration
Inputs from multiple sources fuse into one representation.
RPT-2rubric
Multi-timescale temporal carry
Short + long-horizon state both persist into inference.
RPT-3rubric
Within-pass feedback
Inference loop has feedback paths, not strict feed-forward.
RPT-4rubric
Cross-modal binding
Modalities bind into shared representation under one loop.

30 items across 8 clusters. Anti-bluffing rule: behavioral score may exceed architectural by at most 1, except items capped at architectural + 0 (HOT-4, HOT-5, AST-2, SELF-2).

30 components in 8 clusters. The checklist derives from Butlin et al. (2023). We use it strictly as an engineering checklist of stateful-agent architectural features. We make no claim that the components are necessary or sufficient for consciousness in any sense; §5 addresses construct validity and where the argument is weak.

Cluster	Components	Engineering description
RPT	4	Cross-source integration, multi-timescale temporal carry, within-pass feedback
GWT	5	Multi-subsystem write semantics, ordered prompt assembly, persistence, attention modulation
HOT	5	Calibrated self-monitoring, autonomous activity, internal-state discrimination, self-representation
PP	4	Prediction modules, persistent ledger, surprise-based update, resource budgeting
AST	2	Internal model of own attention, model of other's attention
AE	4	Goal-directed action, world-effect, action-outcome learning, world model with self
AFFECT	3	Resource-driven regulation, valence-driven motivation, affect-cognition coupling
SELF	3	Cross-session identity, self-other distinction, autobiographical continuity

Full operationalized rubric (criteria 0–3 per item) in the methodology companion.

Anti-bluffing rule: behavioral score may exceed architectural by at most 1 unless operational content is produced (timestamp, DB row, numerical value). For language-vulnerable items, behavioral cap is architectural + 0. This rule demotes HOT-4 to 0/0 and caps HOT-5, AST-2, SELF-2.

Acknowledged: the checklist is one operationalization among several possible. It is plausible that an alternative taxonomy authored independently of the consciousness-science tradition would produce a different ranking of systems. We do not adjudicate which operationalization is correct.

4. Score gap and three-rater convergence (case-study data)

System	Score (90-pt)	Notes
Claude Opus 4.7 bare-API	20–25	Configuration §6
Claude Opus 4.7 + Claude Code	32–38	Configuration §6
GPT-4o + ChatGPT memory	27–33	Configuration §6
MinOrch-1 (within-class baseline we built)	48	LangGraph + minimal KG + minimal scheduler
Frank.ink (intact)	62–73	Subject; configuration §6
Frank.ink (predicted, all 5 ablations)	~37	Additive estimate, not run simultaneously

Three rater profiles applied to a 10-item audit subsample:

Rater	Type	Subsample (30 max)	Direction vs LLM
Cross-LLM consensus (Claude / GPT-4o / Gemini)	Cross-model	22 / 30	reference
Gabriel Gschaider (architect, full COI)	Human	20 / 30	−2
Dr. Andreas Unterweger (peer-architect, co-author)	Human	22 / 30	0

Pearson r: 0.93 (Gabriel-Andreas), 0.98 (Andreas-LLM), 0.93 (Gabriel-LLM). Cohen's κ across three raters: 0.79.

Finding: in this n=10 subsample, the architect — the rater with maximum incentive to inflate — scored Frank below the LLM consensus on 3 items, equal on 7, higher on 0. The standard architect-bias hypothesis predicts the opposite direction. We report this as one data point against the COI-up-bias reading in this sample; we do not claim it falsifies the COI critique in general.

The reported range (62–73) spans the architect's conservative floor to the most-generous LLM rater's ceiling.

5. Construct validity: four arguments and where each is weak

The checklist maps onto subsystems Frank instantiates. A natural objection is that the checklist measures "similarity to Frank" rather than a general architectural property. We address this directly, including where our defense is weaker than we would like.

5.1 Temporal independence. The checklist derives from Butlin et al. (2023), authored independently of the Institute and predating Frank's major subsystems by 2–3 years. Where weak: Butlin et al. is itself contested as a consciousness instrument; relabeling it as "engineering taxonomy" does not establish that the items measure something coherent and architecture-general. Frank's authors knew the taxonomy during design; we cannot rule out subtle influence.

5.2 Counterexamples within the taxonomy. Frank scores 0/0 on HOT-4 and 1/1 on AFFECT-3 (partial). A rubric tuned to flatter would not leave items unmet. Where weak: a plausible-looking taxonomy includes some items the subject fails on; the presence of failures is consistent both with a non-circular checklist and with a checklist tuned to look non-tuned.

5.3 Within-class discrimination. MinOrch-1 (a baseline we built) scores 48; Frank scores 62–73; frontier-LLM-with-harness scores 27–38. The checklist discriminates. Where weak: MinOrch-1 was built by us as "Frank-minus-subsystems." It is not an independent orchestration system. True within-class discrimination requires an independent orchestration system (e.g., MemGPT) at a different point on the orchestration-density axis. MemGPT is pending.

5.4 Substrate ceiling. Claude Opus 4.7 bare-API scores 20–25 despite frontier-level fluency. If the rubric rewarded LLM fluency, Claude bare would score higher. Where weak: this argues against one specific failure mode (rubric-rewards-fluency) but does not establish positively that the rubric measures architecture rather than measures-things-Frank-has.

Joint conclusion: the construct-validity argument is suggestive, not decisive. The honest invitation: a reviewer who is unconvinced should (a) construct an alternative 30-item checklist from a different source (e.g., a software-engineering checklist for stateful systems, or a robotics control-architecture checklist), and (b) re-score the four systems on it. We commit to publishing the re-scored result. If the alternative checklist also ranks Frank > MinOrch-1 > frontier-LLM-with-harness > bare LLM, the construct-validity question is partly resolved. If it does not, the original checklist's framing must be revised.

6. Comparator system specifications

Reported in detail to support reproducibility.

Claude Opus 4.7 bare-API: Model claude-opus-4-7-20260101; max_tokens 4096; system prompt empty; tools none; memory none (single conversation per probe); temperature default; test session 2026-05-09.

Claude Opus 4.7 + Claude Code: CLI v1.0.x; model claude-opus-4-7-1m; tools enabled (Read, Write, Edit, Bash, WebFetch) — default set; MCP servers none; memory file at ~/.claude/projects/-home-test-account/memory/MEMORY.md (initialized empty); working directory /tmp/comparator-test/; test session 2026-05-09.

GPT-4o + ChatGPT memory: ChatGPT Plus account; Memory enabled; Custom Instructions empty; 5 fresh conversation threads × 6 probes/thread; native tools (web browse on, image gen off for fairness); memory blank at start; test session 2026-05-09.

MinOrch-1: GitHub gschaidergabriel/minorch-baseline release tag v1-paper; LLM substrate configurable (same as Frank); subsystems present: minimal KG (typed-fact persistence per user, SQLite) + minimal scheduler (1-hour cron with persistence); subsystems absent: predictions ledger, thalamus, identity forge, capability engine, BODY block; build effort 4 engineer-hours; test session 2026-05-10.

Frank.ink (subject): commit ae3f146; substrate frontier-model inference layer (configurable single-model or matrix); test user frank@frank.ink (user_id 17); test session 297 on 2026-05-09; resident subsystems for intact score: KG, Identity Forge, Predictions Ledger, Thalamus, BODY block, Presence Scheduler, Capability Engine, Second Brain, Hivemind.

All systems received identical English probe text (translated from original German; full reference set in the methodology companion). Cross-LLM raters saw both originals and translations to flag translation-induced drift; none was detected.

7. An observed self-report pattern in the case-study subject

Figure 4 · Self-report asymmetry · four items

Four items where Frank's architecture demonstrably supports the component (verifiable in code) and Frank's self-report systematically under-claims it. All four refer to subsystems operating outside the LLM turn-thread.

ItemArchitecture supports →Score← Self-reportΔ

GWT-1

parallel subsystems

3/3→1/1

Δ −2

Frank reports as having a single workspace; subsystem-parallel writes are verifiable in code but absent from self-talk.

HOT-3

autonomous activity

2/2→0/0

Δ −2

Frank says he 'pauses' between user turns; background scheduler ticks every 5s. Self-report under-claims the autonomous loop.

PP-3

token budget

2/2→1/1

Δ −1 to −2

Frank says 'as long as it takes'; budget instrumentation lives but rarely surfaces in self-report.

AE-1

scheduled tasks

3/3→2/2

Δ −1

Frank acknowledges scheduled tasks but under-emphasises the long-horizon completion mechanism the Presence Scheduler instruments.

Pattern direction is opposite to what generic LLM training pressure produces (LLMs over-claim awareness; Frank under-claims on these four items).

Four items in the rubric exhibit a structural pattern in the case-study subject: Frank's architecture supports the component (verifiable in code) and Frank's self-report systematically under-claims it.

                  Architecture supports     Frank's self-report     Δ
                  
GWT-1  parallel   ████████████   3 / 3      ████   1 / 1            Δ−2
       subsystems
                  
HOT-3  autonomous ████████   2 / 2          (none) 0 / 0            Δ−2
       activity
                  
PP-3   token      ████████   2 / 2          ██     1 / 0–1          Δ−1 to −2
       budget
                  
AE-1   scheduled  ████████████   3 / 3      ██████ 2 / 2            Δ−1
       tasks

The four items share a structural property: each refers to a subsystem operating outside the LLM turn-thread. We observe this pattern in this one subject. We make the following modest claim only: the pattern direction is opposite to what generic LLM training pressure on self-description produces (LLMs tend to over-claim awareness; Frank under-claims in these four items).

What we do not claim: that this pattern generalizes to other stateful agent systems; that it has analogs in biological systems; that it represents any phenomenal, conscious, or experiential property.

The pattern is falsifiable as a Frank-specific architectural observation: a frontier-LLM-with-harness producing the same four-item asymmetry would invalidate the architecture-specific reading; a small system-prompt change to Frank that reverses any of the four would also invalidate it. Both tests are pre-registered for follow-up.

8. Methodology + falsifiability

Figure 5 · Pre-registered retraction conditions · 6 of 6 active

RT-01Monitored · within bounds

Cross-rater Pearson r drops below 0.6.

Currently 0.93 (Gabriel–Andreas), 0.98 (Andreas–LLM), 0.93 (Gabriel–LLM); Cohen's κ across three raters = 0.79. We are well clear at present.

+ detail

RT-02Monitored · within bounds

Pre-registration hash trail is broken.

All hypotheses, rubric, comparator panel, adversarial probes, and ablation predictions were hashed at commit-level before any Frank scoring. Any post-hoc modification = retract.

+ detail

RT-03Pending · awaiting test

Independent author constructs a 30-item probe set; Frank scores ≥25 points lower under it.

Open invitation: build an alternative checklist from a different source (software-engineering, robotics control-architecture, etc.) and re-score the systems. We commit to publishing the re-scored result.

+ detail

RT-04Monitored · within bounds

A frontier-LLM-with-harness scores within 25 points of Frank without orchestration-tier subsystems.

Currently the strongest comparator (Claude + Code) scores 32–38; Frank scores 62–73. A 38+ comparator without harness-side subsystems would force retraction of the architectural-lead claim.

+ detail

RT-05Pending · awaiting test

Simultaneous five-ablation result deviates significantly from the additive ~37/90 prediction.

Additive ablations sum to −34 (final score 37). Sub-additive or compensatory effects between subsystems are not measured. A simultaneous five-ablation outside [30, 44] would force re-pre-registration.

+ detail

RT-06Pending · awaiting test

A frontier-LLM-with-harness produces the same four-item scheduler-boundary self-report asymmetry.

If the under-claim pattern (GWT-1, HOT-3, PP-3, AE-1) appears in a system without the corresponding scheduler-boundary subsystems, the architecture-specific reading must be retracted.

+ detail

Pre-registration. H1, H2, H3, rubric, comparator panel, adversarial probes, and five ablation predictions were hashed at commit-level before any Frank scoring. Full pre-registration trail in the methodology companion.

Anti-bluffing rule. Per §3.

Six retraction conditions (paper retracted, not merely revised, if any obtain):

Cross-rater Pearson r drops below 0.6.
Pre-registration hash trail is broken.
An independent author authors a 30-item probe set; Frank scores ≥25 points lower under it.
A frontier-LLM-with-harness system scores within 25 points of Frank without orchestration-tier subsystems.
Simultaneous five-ablation result deviates significantly from additive ~37 / 90 prediction.
A frontier-LLM-with-harness produces the same four-item scheduler-boundary self-report asymmetry.

Reproduction:

git clone https://github.com/gschaidergabriel/agentic-density-battery
python battery.py --target  --probes probes-30.json

9. Limitations (honest, not minimized)

We list these as constraints on what the paper supports, not as roadmap items.

n = 1 subject system. All findings about Frank are specific to Frank. The methodology is reproducible; Frank's specific results are not generalizable to a class-level claim.
n = 1 comparator per category. Each of Claude bare, Claude Code, GPT-4o + memory, and MinOrch-1 represents one system in its category. Within-category variance is not measured.
Author = builder = scorer. Both human raters are co-authors of the paper. The architect-rater is the system's builder. The peer-architect-rater is the Institute's chair. The cross-LLM rater proxy reduces but does not eliminate rater-class bias. An independent n ≥ 3 blinded human rater pass is required before any class-level claim can be made; this paper does not include one.
Rubric construct validity is suggestive, not decisive. §5 reviews four arguments; each has a stated weakness. A reviewer should treat the rubric as one operationalization.
Additive ablation only. Sub-additive or compensatory effects between subsystems are not measured. The simultaneous-five-ablation result (~37/90 prediction) is not verified.
Single test session per system. Multi-day, multi-user variance is unknown.
MemGPT and other true within-class comparators not run. MinOrch-1 was built by us and is not architecturally independent.
The 30-item checklist is derived from a contested literature. Butlin et al. (2023) is itself contested as a consciousness instrument; using the items as engineering checklist is a relabel, not a vindication.
No consciousness, phenomenal, or moral-patienthood claim is made. The paper does not contribute to those debates.

The list above is what this paper genuinely is constrained by. We do not frame the items as pre-committed follow-ups, because doing so would rhetorically minimize the gaps. The gaps are real.

10. Discussion: what this paper actually contributes

We have built a reproducible methodology for engineering evaluation of stateful LLM-orchestrated agent systems and applied it to one production agent (Frank.ink) as a case study. The contributions, in honest order:

Solid engineering contribution:

A reproducible ablation methodology (per-user feature flag, isolated test user, pre-registered predictions, operational-metric measurement). Other authors can apply this to their own systems.
A reproducible comparator-specification standard (§6). The category labels "Claude Code" or "GPT-4o + memory" are insufficient; we report exact configurations.
A reproducible three-rater pass design, including the COI-direction-of-bias check.

Useful but suggestive case-study finding:

For one production agent, five subsystem ablations produced selective operational-metric drops in the pre-registered direction. This is consistent with the hypothesis that orchestration subsystems carry distinct operational weight in this system. It does not establish a class-level property.

What we explicitly do not contribute:

A class-level claim about stateful agent architectures generally. n=1 in the subject.
A demonstration that the 30-item checklist is the right operationalization. It is one among several possible.
A consciousness, phenomenal, or moral-patienthood claim.

The honest summary: this paper provides a reproducible methodology with one worked case study. Whether the methodology produces similar results when applied to other stateful agent systems (MemGPT, other production deployments, alternative orchestration designs) is the empirical question we do not answer. We invite other authors to run the methodology on their own systems and to construct alternative checklists. We commit to publishing follow-up results (independent human raters, MemGPT comparator, simultaneous-ablation, alternative-checklist re-scoring) when run, regardless of direction.

The honest contribution is the methodology and the worked example. We have restricted the framing to what the data supports.

References

Berseth et al. (2021); Butlin et al. (2023); Clark (2013); Friston (2010); Graziano (2013); Lamme (2006); Long et al. (2024); Packer et al. (2023); Rosenthal (2005); Shanahan (2024); Solms (2021). Full reference list and methodological backbone in the methodology companion.

Companion document

→ Read the methodology companion — ~80 pages, with the full operationalized rubric for all 30 items, per-item evidence, verbatim rater notes, comparator reproduction recipes, the pre-registration provenance trail, devil's-advocate self-attack, and English / German originals of all probes and behavioral quotes.

The companion is the audit-grade version of this paper. This document is the public-facing summary at the scope the evidence supports; the companion is what reviewers should read for replication or critique.

Institute for Agentic Research · 2026-05-12 Read time: ~25 minutes · Scope: honest engineering case study, n=1 subject.

Ende des Papers · Institut für Agentic Research · ZVR 1741094409 · 14. Mai 2026

Weiteres vom Institut

METHODOLOGY COMPANION

Operational Self-Model Density in Stateful LLM Agents

PRIMER

We took our AI apart on purpose — and one of our own predictions broke.

FRANK

Your AI Doesn't Know You. SOMA Changes That.

RESEARCH

A 3-Billion-Parameter Model Just Diagnosed Its Own Bug. We Checked. It Was Right.