Live · self-serve Beta · patent pending

We taught a 4-vCPU box to see like a VLM — without renting one.

Digital Retina is a modular, VLM-near inference architecture that approximates the output of large vision-language models on commodity CPU hardware — a local pipeline with no GPU and no external VLM API. 92–95 % visual concept coverage (CLIP-lenient) and 70–87 % strict text coverage against Gemini 2.0 Flash and Llama-4 Scout, at 1.5 – 2 s per image.

retina.frank.ink·Hostinger KVM 4 · AMD EPYC 9354P (4 vCPU)·Beta · free · 50 imgs/key/day

Visit retina.frank.ink↗Read the paper

Conceptual diagram — light enters a stylised eye, passes through translucent layers of neural circuitry rendered as cooperative neuron sheets, and emerges as a reconstructed pixel grid on the right. — ARCHITECTURE (conceptual)Light enters; cooperating perceptual sheets transform it through 16 stages; a learned pattern dictionary stabilises read-out; the resulting noun-phrase set is what downstream agents see.

93.1 %

visual concept coverage (CLIP-lenient), vs Gemini 2.0 Flash + Llama-4 Scout

n=44

images · two VLM oracles · 70–87 % strict text coverage

1.7 s

warm-worker p50 latency, 16 cooperating stages

cooperating neural / dynamical stages, 4 vCPU

What it does

Digital Retina takes an image and produces a structured noun-phrase description that approximates what a frontier vision-language model (Gemini, Llama-4-Scout, GPT-4V) would say about the same image — but on commodity CPU hardware, in 1.5 – 2 s, at a marginal cost below 10⁻⁶ USD per image.

The API has two surfaces: a Retina-native endpoint with full structured output (concepts, objects, OCR text, faces, scene type, dominant colours, fine-grained details, composition, and texture/pattern descriptors), and a Gemini-compatible shim that drop-in replaces google.genai's generateContent call.

It is deployed at retina.frank.ink as a free self-serve Beta — 50 images per API key per day, no credit card. Sign up with Clerk; receive an API key in 30 seconds.

Dictionary-growth ablation — coverage as the dictionary grows

Complementary ablation on a separate 47-image held-out set at the stricter threshold τ = 0.22 (the headline benchmark above uses the 44-image Gemini/Llama set at τ = 0.20). Five sequential rounds, each round's evaluation images strictly disjoint from the dictionary at the moment of evaluation. New oracle labels are appended after measurement, never before. Round 5 reaches 99.1 % on five previously-unseen images — the clearest single-round evidence that read-out generalises rather than memorises.

Round	Dictionary size	Held-out images	Phrases	Coverage
1	15	10	255	78.4 %
2	25	10	262	81.7 %
3	37	12	328	79.9 %
4	47	10	239	82.8 %
5	57	5	112	99.1 %
Σ	—	47	1 196	82.4 %

Threshold τ = 0.22 (realistic-FPR-corrected lower bound: 68.8 %). Two oracles: Gemini 2.0 Flash (rounds 0 — seed only) and Llama 4 Scout 17B via Groq (rounds 1 – 5).

Architecture, abstracted

Sixteen stages arranged as a directed acyclic graph fall into three classes:

Perceptual stages — Off-the-shelf image-domain models (open CLIP family, COCO object detector, OCR, face detector + emotion classifier). Scalar similarities or structured detections.
Compositional stages — Consume upstream outputs and re-score the image (or sub-regions thereof) under context-conditioned vocabularies. Produce phrases at finer granularity than whole-image scoring can.
Emergent-pattern stages with learned dictionary — Transform the image into a low-dimensional signature in a structured dynamical system and read out by nearest-neighbour against a learned dictionary of labelled archetypes. Contribute texture, composition, atmosphere, and style descriptors.

The specific composition rules, signature definition, and dictionary mechanism are the subject of a pending patent application. Details are deliberately abstracted here pending prosecution.

What makes the measurements honest

Two independent VLM oracles (Google Gemini 2.0 Flash and Meta Llama 4 Scout 17B) supplied phrase-level ground truth across 62 images. Cross-oracle agreement provides a stability check that single-oracle protocols miss.

False-positive rate calibrated against *realistic* negatives — phrases drawn from other natural images of similar provenance — not just totally unrelated probes. At τ = 0.22 the realistic FPR is 13.6 %, yielding a corrected lower-bound coverage of 68.8 %.

Held-out images for each round were strictly disjoint from the dictionary at the moment of evaluation. No image was ever in the dictionary that scored it. Round-5's 99.1 % is on five novel images, not a lookup.

What this does not show

The dictionary is small (62 labelled images), seeded by two oracles, drawn from a relatively narrow image distribution (stock photography, AI illustration, traditional art, world architecture). Out-of-distribution behaviour on medical imagery, microscopy, satellite, industrial inspection, etc. is unmeasured and likely poor.

The coverage metric accepts substring, lemma, and decomposed-phrase matches; it does NOT test for correctness of compositional bindings. A tighter metric would lower the absolute numbers while preserving the relative ordering of architectural choices.

Hyper-specific named entities (proper nouns, branded text, named cultural objects) are the dominant failure mode and would require either OCR strengthening or dedicated entity-recognition stages. This is a coverage-scope limitation, not an architectural one.

Try it

Sign up. 50 images per day. Free during Beta.

Self-serve API keys via Clerk. Retina-native JSON or Gemini-compatible drop-in. Read the paper for the empirical case; bring an image to see the system answer.

Open retina.frank.ink↗Read the working paper