Live · self-serve Beta · patent pending
We taught a 4-vCPU box to see like a VLM — without renting one.
Digital Retina is a multi-tier image-understanding system that approximates the output distribution of large vision-language models on commodity CPU hardware. Sixteen cooperating perceptual stages feed a learned pattern dictionary that stabilises read-out. 82.4 % out-of-sample phrase coverage against two independent VLM oracles on 47 strictly held-out images, at 1.5 – 2 s per image, no GPU.

out-of-sample phrase coverage, n = 47 held-out images, 1 196 phrases
strictly held-out images, two independent VLM oracles
warm-worker p50 latency, 16 cooperating stages
cooperating neural / dynamical stages, 4 vCPU
What it does
Digital Retina takes an image and produces a structured noun-phrase description that approximates what a frontier vision-language model (Gemini, Llama-4-Scout, GPT-4V) would say about the same image — but on commodity CPU hardware, in 1.5 – 2 s, at a marginal cost below 10⁻⁶ USD per image.
The API has two surfaces: a Retina-native endpoint with full structured output (concepts, objects, OCR text, faces, scene type, dominant colours, fine-grained details, composition, and texture/pattern descriptors), and a Gemini-compatible shim that drop-in replaces google.genai's generateContent call.
It is deployed at retina.frank.ink as a free self-serve Beta — 50 images per API key per day, no credit card. Sign up with Clerk; receive an API key in 30 seconds.
Out-of-sample coverage as the dictionary grows
Five sequential rounds, each round's evaluation images strictly disjoint from the dictionary at the moment of evaluation. New oracle labels are appended after measurement, never before. Round 5 reaches 99.1 % on five previously-unseen images — the clearest single-round evidence that read-out generalises rather than memorises.
| Round | Dictionary size | Held-out images | Phrases | Coverage |
|---|---|---|---|---|
| 1 | 15 | 10 | 255 | 78.4 % |
| 2 | 25 | 10 | 262 | 81.7 % |
| 3 | 37 | 12 | 328 | 79.9 % |
| 4 | 47 | 10 | 239 | 82.8 % |
| 5 | 57 | 5 | 112 | 99.1 % |
| Σ | — | 47 | 1 196 | 82.4 % |
Threshold τ = 0.22 (realistic-FPR-corrected lower bound: 68.8 %). Two oracles: Gemini 2.0 Flash (rounds 0 — seed only) and Llama 4 Scout 17B via Groq (rounds 1 – 5).
Architecture, abstracted
Sixteen stages arranged as a directed acyclic graph fall into three classes:
- Perceptual stages — Off-the-shelf image-domain models (open CLIP family, COCO object detector, OCR, face detector + emotion classifier). Scalar similarities or structured detections.
- Compositional stages — Consume upstream outputs and re-score the image (or sub-regions thereof) under context-conditioned vocabularies. Produce phrases at finer granularity than whole-image scoring can.
- Emergent-pattern stages with learned dictionary — Transform the image into a low-dimensional signature in a structured dynamical system and read out by nearest-neighbour against a learned dictionary of labelled archetypes. Contribute texture, composition, atmosphere, and style descriptors.
The specific composition rules, signature definition, and dictionary mechanism are the subject of a pending patent application. Details are deliberately abstracted here pending prosecution.
What makes the measurements honest
01
Two independent VLM oracles (Google Gemini 2.0 Flash and Meta Llama 4 Scout 17B) supplied phrase-level ground truth across 62 images. Cross-oracle agreement provides a stability check that single-oracle protocols miss.
02
False-positive rate calibrated against *realistic* negatives — phrases drawn from other natural images of similar provenance — not just totally unrelated probes. At τ = 0.22 the realistic FPR is 13.6 %, yielding a corrected lower-bound coverage of 68.8 %.
03
Held-out images for each round were strictly disjoint from the dictionary at the moment of evaluation. No image was ever in the dictionary that scored it. Round-5's 99.1 % is on five novel images, not a lookup.
What this does not show
The dictionary is small (62 labelled images), seeded by two oracles, drawn from a relatively narrow image distribution (stock photography, AI illustration, traditional art, world architecture). Out-of-distribution behaviour on medical imagery, microscopy, satellite, industrial inspection, etc. is unmeasured and likely poor.
The coverage metric accepts substring, lemma, and decomposed-phrase matches; it does NOT test for correctness of compositional bindings. A tighter metric would lower the absolute numbers while preserving the relative ordering of architectural choices.
Hyper-specific named entities (proper nouns, branded text, named cultural objects) are the dominant failure mode and would require either OCR strengthening or dedicated entity-recognition stages. This is a coverage-scope limitation, not an architectural one.
Try it
Sign up. 50 images per day. Free during Beta.
Self-serve API keys via Clerk. Retina-native JSON or Gemini-compatible drop-in. Read the paper for the empirical case; bring an image to see the system answer.