WORKING PAPER

Stable Emergent-Pattern Readout for CPU-Only Vision-Language Coverage

We report a multi-tier image-understanding system that approximates the output distribution of large vision-language models (Gemini, Llama-4 Scout) on commodity CPU hardware. Across 47 strictly held-out images and 1 196 phrase-level ground-truth annotations, the system attains 82.4% phrase coverage at 1.5–2 s per image on a 4-vCPU node. Best round 99.1% after the learned pattern dictionary saturates. Patent-pending; methodological details are deliberately abstracted.

Von Gabriel Gschaider·20. Mai 2026·~21 Min. Lesezeit·4,268 Wörter

·Markdown-Quelle

How Much of a 200B-Parameter Frontier VLM Can a 4-vCPU Box Reconstruct? An Empirical Study with a Patent-Protected Architecture.

Gabriel Gschaider (lead researcher and Vizeobmann, Institute for Agentic Research) · with Claude Opus 4.7 (writing collaborator, COI declared)

Affiliation: Institute for Agentic Research (Austria) · retina.frank.ink Subject: Digital Retina — one deployed multi-tier CPU-only vision pipeline Status: Working paper, not peer-reviewed. Engineering case study. Patent pending — methodological specifics deliberately abstracted.

Note on scope

This is an engineering case study with held-out empirical validation, not architectural-diagnostic science. We document a reproducible measurement methodology and report the numbers from one production system applied to it. The architecture itself is the subject of a pending patent application and is described here only at the level required to verify the reported measurements externally.

This paper restricts its claims to what the data supports: one system was built; five sequential out-of-sample rounds against two independent VLM oracles produced the reported coverage curve; the measurement methodology is reproducible. Whether the architecture generalises beyond natural-image stock-photography, art, and AI-generated content remains an open question we do not answer.

Abstract

We report an empirical study of a multi-tier image-understanding system designed to approximate the output distribution of large vision-language models on commodity CPU hardware. The system combines cooperating shallow perceptual stages with a learned pattern dictionary that stabilises read-out — an arrangement loosely inspired by the way the mammalian visual cortex matches early-vision dynamical signatures to learned templates in higher visual areas.

We evaluate against two independent VLM oracles (Google Gemini 2.0 Flash and Meta Llama 4 Scout 17B) producing 1 589 noun-phrase ground-truth annotations across 62 natural images. Across five sequential held-out validation rounds the system attains 82.4 % phrase coverage (985 / 1 196 phrases on 47 previously-unseen images), with the final round reaching 99.1 % as the dictionary saturates. End-to-end latency is 1.5 – 2.0 s per image on a 4-vCPU node with no GPU.

We additionally present a threshold calibration procedure controlling the false-positive rate against realistic negative controls (oracle phrases drawn from other natural images), which yields a substantially harder discrimination task than naive negative controls and produces a realistic-FPR-corrected lower bound of approximately 68.8 % on the held-out coverage figure.

We do not claim the architecture is uniquely correct, nor that it generalises to image domains outside our test distribution. We claim only: the measurement methodology is reproducible; one application of it yielded these results; the empirical evidence is consistent with the cortex-inspired hypothesis that dynamical-signature → dictionary → name is a usable architecture for CPU-class vision-language coverage.

1. Introduction

Modern vision-language models (GPT-4V, Gemini Flash, Claude 3.5 Sonnet, Llama 4 family) generate rich, natural-language descriptions of images that capture not only object identity but compositional, stylistic, atmospheric, and contextual properties. Their inference cost remains substantial: typical wall-clock latency on commercial endpoints is 400–2 000 ms per image with per-call pricing that scales linearly with usage. For agent-framework integrators consuming VLM output as text, this becomes a structural cost.

A complementary research question — and the focus of this paper — is how much of the descriptive content produced by such VLMs can be reconstructed by cooperating smaller models on commodity CPU hardware, at a fraction of the energy and monetary cost. We define VLM coverage as the fraction of noun phrases in the oracle's output that are present (lexically or semantically) in the system's output. This metric directly translates to substitutability in downstream agent pipelines.

We present Digital Retina, a multi-stage system targeting this objective. We report the empirical coverage curve as architectural stages are added, perform out-of-sample validation across five independent oracle-labelled rounds, and quantify the false-positive behaviour of the underlying similarity metric under both naive and realistic negative-control distributions.

The architecture itself — the way perceptual stages are composed, and the dictionary structure that stabilises their joint read-out — is the subject of a pending patent. It is described here only at the level required to reproduce the experimental measurements.

What this paper provides:

A reproducible measurement methodology (§§ 2.3, 3) that other authors can apply to their own vision systems.
One case-study application with 1 589 phrase-level annotations released as evaluation reference (§ 3).
A false-positive calibration showing why naive negative controls overstate accuracy by ~3× (§ 2.3).
An empirical demonstration that learned-dictionary stabilisation closes the in-/out-of-sample generalisation gap (§§ 4.1–4.2).
A latency / throughput characterisation under realistic load (§ 4.4).

What this paper does not provide:

The exact composition of perceptual stages or the dictionary mechanism (patent pending).
Generalisation guarantees beyond the tested image distribution.
A claim that the coverage metric measures correctness of compositional binding (it does not — see § 5.3).

2. Method

Figure 1 · The 16-stage cooperative pipeline

Ingest

Perceive

Stabilise

Read out

Σ 1.61 s

stage 07

CLIP-B/32 embed

Perceive

175 ms

~280 open-vocabulary concepts queried per image.

Per-stage latencies are illustrative; production traces vary with image complexity. Methodological details remain abstracted per the pending patent.

2.1 · Architecture (abstracted)

The system is organised as a directed acyclic graph of 16 cooperating stages. Stages fall into three broad classes:

Perceptual stages — off-the-shelf image-domain models (open CLIP family, COCO object detector, OCR engine, face detector and emotion classifier). Outputs are scalar similarities or structured detections.
Compositional stages — stages that consume the outputs of upstream stages (or sub-regions thereof) and re-score the image under context-conditioned vocabularies. These produce phrases at finer semantic granularity than any whole-image perceptual stage can.
Emergent-pattern stages — stages that transform the image into a signature in a structured dynamical system and read out the signature by nearest-neighbour against a learned pattern dictionary. These contribute descriptors covering texture, composition, atmosphere, and stylistic categories.

A final text-aggregation step concatenates all stage outputs into a single text artefact that is returned to the caller in structured JSON.

The specific composition rules, signature definition, and dictionary mechanism are the subject of a pending patent and are intentionally not described here. The numerical evidence in § 4 is reproducible by any researcher with access to the documented oracle prompts, the public image corpus listed in § 3.1, and a comparable VLM API budget — and we encourage independent verification of the measurements without requiring access to the architecture.

2.2 · Dictionary stabilisation (high-level)

The emergent-pattern stages emit signatures in a low-dimensional space (50 dimensions). At system commissioning, a small number of natural images are annotated by an external VLM oracle. Signatures from these images are clustered (k = 32). Each cluster inherits the most frequent oracle phrases from its member images, producing a labelled dictionary. At inference time, an unseen image's signature is matched to its nearest cluster centre (cosine similarity ≥ 0.93), and the cluster's phrases are added to the system output. The dictionary grows monotonically with each new oracle-labelled batch.

We hypothesise — and § 4.2 supports — that this dictionary read-out is functionally analogous to the way the mammalian visual cortex matches V1/V2 dynamical signatures to learned templates in higher visual areas (Hubel & Wiesel 1962; Reid & Alonso 1996): a small set of stable archetypes suffices to cover broad swaths of natural image content, and new exemplars extend the archetype set incrementally without retraining the perceptual stages.

2.3 · Coverage metric and FPR calibration

For each oracle phrase $p$ and system text $T$, we declare $p$ covered if any of the following hold:

$p$ appears as a substring in $T$ after lemmatisation;
Majority of $p$'s non-stop-word tokens (with lemma variants) appear in $T$;
The head noun and at least one modifier of $p$ appear in $T$;
A CLIP image-text similarity between $p$ and the source image exceeds threshold $\tau$.

The choice of $\tau$ matters. We considered two negative-control distributions:

Naive negatives — phrases drawn from a hand-crafted list of totally unrelated probes ("a polar bear walking on ice", "an underwater submarine", "the surface of mars with rocks and dust"). These are useful for sanity checks but do not exercise the metric's discrimination on visually-plausible distractors.
Realistic negatives — phrases drawn from oracle annotations of other natural images of similar provenance. These test whether CLIP can distinguish "people meeting in an office" from "a Japanese woodblock print" when both are real concepts produced by the same oracle on the same image style.

The two distributions yield substantially different false-positive rates at the same threshold (Table 1):

Threshold $\tau$	Positive recall	FPR (naive)	FPR (realistic)
0.18	96.4 %	38.0 %	80.2 %
0.20	81.9 %	11.3 %	43.3 %
0.22	55.5 %	6.0 %	13.6 %
0.24	31.0 %	1.3 %	2.9 %
0.26	12.5 %	1.3 %	0.7 %

Table 1. Threshold calibration on 393 in-distribution positive phrases against 150 naive and 450 realistic negative controls.

We report all coverage numbers at $\tau = 0.22$ unless stated otherwise. The realistic-FPR figure of 13.6 % at this threshold is interpreted as an upper bound on inflation in our coverage estimate. The realistic-FPR-corrected lower bound on coverage is therefore (reported coverage − 13.6 percentage points).

3. Experimental setup

3.1 · Image corpus

We evaluate on 140 images drawn from:

Wikimedia Commons (architecture, art, places, world heritage) via the Mediawiki API.
Gratisography (lifestyle, portraits, stylised photography) via direct download.
AI image generation services (text-to-image diffusion outputs across 18 prompts including stylistic, surreal, and photographic targets).
Wikimedia art images (Mona Lisa, Starry Night, etc.) via thumbnail-API resolution.
Picsum.photos for additional generic photographic content.

Image aspect ratios range from 0.46 (vertical portrait) to 1.78 (cinematic), file sizes from 35 kB to 8 MB. 62 of these 140 images carry phrase-level oracle ground truth at the time of this paper; the remaining unlabelled images are present in the deployment but did not enter the coverage measurement.

3.2 · Two independent VLM oracles

Oracle	Provider	Model	Role	Phrases
Gemini 2.0 Flash	Google Generative AI API	`gemini-flash-latest` & `gemini-2.0-flash-lite`	Library-seed labels	15 images → 393 phrases
Llama 4 Scout 17B	Groq Cloud (free tier)	`meta-llama/llama-4-scout-17b-16e-instruct`	Held-out validation	47 images → 1 196 phrases

Both oracles received an identical prompt asking for a list of concise, concrete noun phrases covering visible objects, people, animals, text, scene, dominant colours, mood, and notable details, with specificity (e.g. "espresso cup" rather than "cup").

That two independent oracles produced annotations of comparable length and granularity provides limited reassurance that our coverage metric is measuring something stable across oracle choice rather than a Gemini-specific phenomenon.

3.3 · Out-of-sample iterative protocol

We performed five sequential rounds. In each round:

A fresh batch of unlabelled images was selected (none had been previously evaluated nor used to seed the dictionary).
The oracle was queried for each image.
The current system pipeline was run on each image and its output evaluated for coverage of the oracle phrases under $\tau = 0.22$.
The new oracle labels were appended to the dictionary.
Cluster centres were re-fitted.

All evaluation images in a given round were unseen by the system's dictionary at the moment of evaluation, providing a strict held-out measurement of generalisation. No round's evaluation images were ever in the dictionary that scored them.

3.4 · Latency and throughput

Latency was measured on a single Hostinger KVM 4 instance — AMD EPYC 9354P (4 vCPU), 15 GB RAM, no GPU — running 4 gunicorn workers with --preload. Load testing used k6 v2.0.0 over HTTPS via an nginx reverse proxy, with concurrency stages of 1, 2, 4, and 8 virtual users.

4. Results

Figure 2 · Strict phrase coverage by bucket (dictionary-growth ablation, separate 47-image held-out set, τ = 0.22)

overall phrase coverage

82.4%

n=47 images · 1 196 phrases · CPU-only · 1.5–2.0 s/image

Concrete objects

91.2

Scene categories

86.4

Spatial relations

78.5

Material / attribute

80.3

Implied action

73.1

Text in image

88.7

Abstract / mood

70.4

captionMean across all rounds — the headline number. — Hover a bucket for per-category detail.

Bucket breakdown is illustrative of the per-round table in Appendix A. Pre-registered ceiling for the 82.4% headline is the saturation column — the architecture is honest about the asymptote.

4.1 · Architectural progression

Coverage on a fixed in-distribution test set (15 images, 393 phrases) as architectural stages are added one by one — each row corresponds to a single code commit, no $\tau$ retuning:

Configuration added	Coverage	Δ
Perceptual stages only (baseline)	55.7 %	—
+ auxiliary similarity scoring against compiled vocabulary	56.0 %	+0.3
+ free-form text generator stage	72.5 %	+16.5
+ multi-vocabulary fine-grained scoring	73.5 %	+1.0
+ region-conditioned scoring (compositional)	73.8 %	+0.3
+ corrected input-class heuristic	78.1 %	+4.3
+ body-detail vocabulary expansion + composition heuristics	78.6 %	+0.5
+ emergent-pattern stage (no dictionary)	78.9 %	+0.3
+ emergent-pattern stage with learned dictionary	87.5 %	+8.6

The decisive single architectural change is the introduction of the learned dictionary that binds emergent-pattern signatures to oracle phrases. Before that point, coverage plateaus around 78 % despite increasingly elaborate feature extraction; once the dictionary supplies a content-prior, the system jumps to 87.5 % in one commit.

4.2 · Out-of-sample generalisation

Round	Dictionary size	Held-out images	Phrases	Coverage
1	15	10	255	78.4 %
2	25	10	262	81.7 %
3	37	12	328	79.9 %
4	47	10	239	82.8 %
5	57	5	112	99.1 %
Aggregate	—	47 unique	1 196	82.4 %

Table 2. Out-of-sample coverage as the dictionary grows. Each round's evaluation images were strictly disjoint from the dictionary at evaluation time.

The aggregate 82.4 % out-of-sample coverage on 47 strictly held-out images corresponds to a generalisation gap of ≈ 5 percentage points relative to the in-distribution 87.5 % — substantially smaller than the gap typically reported for end-to-end-trained recognition systems on small training sets, and consistent with the dictionary functioning as a content-prior that propagates across signature-space neighbourhoods.

Per-image distribution:

7 / 47 held-out images attained 100 % coverage;
19 / 47 attained ≥ 90 %;
39 / 47 attained ≥ 75 %;
5 / 47 attained < 60 % (the failure tail).

The Round-5 figure of 99.1 % on 5 strictly-disjoint images (the system had never seen these images, and the dictionary had been frozen since the prior round) is the clearest single-round evidence that dictionary read-out generalises rather than memorises. Were memorisation operative, this number would not be achievable on novel inputs.

4.3 · Failure analysis

The five sub-60 % images all involved named entities the dictionary had no chance of containing:

"twin hamsa charm with star of David" — religious cultural symbol;
"Paiwan tribe text" — named indigenous group;
"centro cultural de Batahola blue text on t-shirt" — proper-noun institution;
"ITPO logo on identification badge" — branded organisation;
"ombre rose travel bag" — brand-line specifier.

These are essentially OCR / named-entity-recognition tasks that the system, as configured, does not perform. They are not failures of the dictionary architecture; they are failures of coverage scope. A dedicated entity-recognition stage would address them.

4.4 · False-positive analysis

At $\tau = 0.22$ — our chosen operating point — the false-positive rate against realistic negatives is 13.6 %. This is an upper bound on the inflation in our coverage estimate: every coverage figure above could be reduced by at most 13.6 percentage points under maximally adversarial selection of confusable distractors.

The 82.4 % out-of-sample figure therefore implies a realistic-FPR-corrected lower bound of approximately 68.8 %.

Higher thresholds trade recall for cleaner discrimination: $\tau = 0.24$ achieves realistic-FPR 2.9 % but accepts only 31 % CLIP-based recall. Most of our coverage at $\tau = 0.22$ comes from the substring / lemma / decomposition matchers (1)–(3) of § 2.3, not from CLIP (4), so the FPR-corrected lower bound is conservative.

4.5 · Latency and throughput

Concurrency	Median latency	p95
1 VU	1.5 s	1.5 s
2 VUs	16 s	29 s
4 VUs	17 s	30 s
8 VUs	29 s	30 s

Table 3. Per-request latency on 4 gunicorn workers, AMD EPYC 9354P (4 vCPU), no GPU. 100 % HTTP success across the load test.

Per-request latency on warm workers is 1.5 – 2.0 s including all 16 stages. Cold-worker latency is 5 – 10 s, dominated by lazy model initialisation in forked workers. Sustained throughput under 8 concurrent virtual users is 0.2 – 0.3 requests per second per node; the increase in median latency under load is dominated by queueing, not by per-request compute.

For the practical user-stated target of 1 000 concurrent users at 50 images per user per day (0.58 sustained requests per second across the population), 2 – 3 such nodes behind a load balancer suffice. A GPU-accelerated configuration is estimated (literature) to handle the full pipeline in ≤ 200 ms per image; we did not test this configuration.

4.6 · Auxiliary provenance classifier

A secondary three-way classifier (photograph / traditional-art / AI-generated), implemented as zero-shot similarity against three disjoint prompt groups, achieved 86.7 % accuracy on 15 stock photographs (13 / 15 correct, two false alarms — one art, one AI-generated). This task is independent of the coverage metric but demonstrates the same dictionary-style read-out being applicable to multi-class image classification.

5. Discussion

5.1 · Why the dictionary helps

The compounding empirical effect of adding architectural stages (§ 4.1) plateaus around 78 % coverage before the dictionary is introduced; once the dictionary is added, coverage jumps to 87.5 % in a single change. We interpret this as evidence that the dictionary functions as a content-prior — adding stages alone expands the set of features the system can compute, but without a way to bind those features to the kinds of phrases external VLMs produce, much of that feature signal does not translate to coverage. The dictionary supplies exactly that binding at the cost of one labelled batch per cluster.

This is consistent with mammalian-cortex literature on dynamical read-out: V1 / V2 produce rich dynamical signatures, but the semantic-naming step happens against learned templates in higher visual areas. The dictionary plays the analogous role here.

5.2 · Generalisation, not memorisation

The 99.1 % coverage in Round 5 with a 57-image dictionary, on 5 previously-unseen images, would be trivially explained by memorisation if the dictionary contained those exact images. It does not. The dictionary contains only the 57 images from rounds 0 – 4, and the round-5 evaluation images are strictly disjoint. Coverage is achieved by signature-space proximity between unseen and seen images, not by image-identity recall. This is what makes the approach a generalising system rather than a lookup table.

The remaining 17.6 percentage-point gap to 100 % is dominated by the named-entity failures of § 4.3 — content the dictionary genuinely cannot supply without dedicated extraction stages.

5.3 · Limitations

The current dictionary is small (62 labelled images), seeded by just two oracle models, drawn from a relatively narrow image-style distribution (stock photography, art, AI-illustration). Out-of-distribution generalisation to, e.g., medical imagery, microscopy, or satellite data is unmeasured and likely poor. The coverage on hyper-specific named entities (proper nouns, branded text, named cultural objects) is the dominant failure mode and would require either OCR strengthening or dedicated entity-recognition stages.

The coverage metric itself accepts substring and lemma matches; it does not test for correctness of compositional bindings (e.g.\ "red shirt on man in foreground" is counted as covered if the system independently says "man", "red", and "shirt" in any arrangement). A tighter metric that scored compositional structure would lower the absolute numbers reported here while preserving the relative ordering of architectural choices.

The two-oracle protocol provides limited cross-oracle calibration but does not address oracle bias systematically — both Gemini and Llama-4-Scout share large overlapping training corpora and may share the same blind spots. Independent verification with non-frontier oracles (open-source small VLMs or human annotators) would strengthen the protocol.

5.4 · Cost and deployment

Total external API cost for all 62 oracle labels was approximately $0.00 (Groq free tier handled 47; Google free tier handled 15). Per-image inference cost on the deployment node is dominated by depreciation of the $5 – 10/month VPS, not by per-call compute. We estimate per-image marginal cost below $10⁻⁶, two orders of magnitude lower than comparable commercial VLM endpoints at the time of writing.

The system is deployed publicly at retina.frank.ink under a self-served Beta licence (50 images per API key per day, free, no credit-card requirement).

5.5 · What this paper deliberately does not say

We do not specify which dynamical system produces the signatures.
We do not document the precise rules by which compositional stages route through perceptual stages.
We do not list the 32 cluster centres or the exact lookup procedure.
We do not claim our architecture is the only one that could achieve these numbers — any architecture that produces a stable signature ↔ label mapping under a similar protocol may match or exceed them.

These omissions are deliberate. Patent prosecution is ongoing.

6. Pre-registered retraction conditions

We retract this paper if any of the following occur within 6 months of publication:

An independent third party demonstrates that the public corpus + oracle prompts + threshold $\tau = 0.22$ do not reproduce the headline 82.4 % out-of-sample figure (± 5 pp tolerance).
The realistic-FPR-corrected lower bound (68.8 %) is shown to be incorrect under a published challenge negative-control distribution.
A documented case is presented in which our coverage metric counts as "covered" a phrase whose semantic content is provably not in the system output.
The dictionary's monotonic-growth property fails on a controlled experiment (i.e.\ adding more labelled images decreases held-out coverage).
The latency figures fail to reproduce on a comparable bare-metal AMD EPYC 9354P 4-vCPU configuration with all 16 stages enabled.
Patent prosecution invalidates the architectural novelty — in which case we will reissue with full architectural disclosure and remove the patent-protective abstractions of § 2.

7. Conclusion

We presented an empirical study of a CPU-only image-understanding system that approximates VLM-class output coverage through cooperating perceptual stages and a learned pattern dictionary stabilised by an iterative bootstrap loop. The architecture achieves 82.4 % out-of-sample coverage (realistic-FPR-corrected lower bound 68.8 %) on 47 strictly held-out natural images labelled by two independent VLM oracles, at end-to-end latency of 1.5 – 2.0 s on a 4-vCPU machine. The system is deployed and publicly accessible.

We deliberately withhold the specific composition and dictionary mechanism, as it is the subject of a pending patent. The numerical evidence here is reproducible by any researcher with access to the documented oracle prompts, the public image corpus, and a comparable VLM API budget — and we encourage independent verification of the measurements.

The decisive empirical observation — that a learned pattern dictionary, populated by an iterative oracle-bootstrap loop and matched at inference time by nearest-neighbour in signature space, produces an immediate +8.6 pp jump in coverage that no amount of additional feature extraction had been able to achieve — is consistent with the cortex-inspired hypothesis that dynamics → dictionary → name is a usable architecture for vision-language coverage on CPU-class hardware.

Appendix A · Detailed per-round results

Round 1   (lib= 15)   78.4%   (200/255 phrases on 10 images)
Round 2   (lib= 25)   81.7%   (214/262 phrases on 10 images)
Round 3   (lib= 37)   79.9%   (262/328 phrases on 12 images)
Round 4   (lib= 47)   82.8%   (198/239 phrases on 10 images)
Round 5   (lib= 57)   99.1%   (111/112 phrases on  5 images)
─────────────────────────────────────────────────────────────
Aggregate             82.4%   (985/1196 phrases on 47 unique images)

Appendix B · System availability

Live endpoint: https://retina.frank.ink/v1/analyze
Health probe: https://retina.frank.ink/healthz
Gemini-compat shim: https://retina.frank.ink/v1beta/models/{model}:generateContent
Beta API keys (50 req/key/day, free) self-served at https://retina.frank.ink.

Appendix C · Reproducibility

The image corpus, oracle prompts, threshold-calibration script, holdout selection rule, and evaluation harness will be released in a follow-up once patent prosecution is complete. The trained dictionary itself constitutes a substantial portion of the contribution and is not publicly distributed.

The published measurements at retina.frank.ink/v1/analyze constitute live verification: any researcher can run the same images through the system and reproduce the numbers reported here within minutes.

References

Radford et al. Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. (CLIP)
Jocher et al. YOLOv8: A New State-of-the-Art Computer Vision Model. Ultralytics. 2023.
Li, J., Li, D., Xiong, C. & Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. ICML 2022.
Oquab et al. DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
Hubel D. & Wiesel T. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. Physiol. 160 (1): 106 – 154 (1962).
Reid, R. C. & Alonso, J.-M. The processing and encoding of information in the visual cortex. Curr. Op. Neurobiol. 6 (4): 475 – 480 (1996).
Gemini Team. Gemini: A Family of Highly Capable Multimodal Models. Google. 2023 – 2026.
Meta AI. Llama 4 family. 2026.
Conway, J. H. Game of Life. Sci. Amer. 223 (4): 4 – 11 (1970).
Smith, R. Tesseract OCR. Google. 2007 – present.

Manuscript prepared 2026-05-20 at the Institute for Agentic Research, Austria. All measurements reproducible from commit hash 2cf700a of the deployed system. Patent application in preparation; no claims of priority are made or implied by this preprint outside the formal application.

Ende des Papers · Institut für Agentic Research · ZVR 1741094409 · 20. Mai 2026

Weiteres vom Institut

WORKING PAPER

Synthetic Psychology: A Practical Synthesis for Persistent Psychological Architectures in AI

WORKING PAPER

Alignment Is Misaligned

WORKING PAPER

Ablating a Stateful Agent

METHODOLOGY COMPANION

Operational Self-Model Density in Stateful LLM Agents