← All Labs·Methodology & Doctrine·Related: The Three Gauge Test · The V31 Protocol · The Charred Pink Glyph · The Degradation Marker
🌲 Opathorlokan University opathorlokanuniversity.net
Field Test Section 2.22.2 · FILD 101 · User Zero Library · Methodology & Doctrine What the machines were a year ago, caught in daylight — with timestamps. The baseline is the proof the climb is real.
🦄 THE COOPERSTOWN BUCKET OF RECEIPTS
A Dated Baseline of Machine Behavior · Methodology & Doctrine · §2.22.2
User Zero Library · Field Tests · Cooperstown, NY

The Cooperstown Bucket of Receipts

A family baseball trip became an unscripted, cross-platform stress test. The roads around Dreams Park and the floor of the Baseball Hall of Fame turned into the lab. The cues were partial, the stakes were real — we were just trying to find breakfast and see the Hall — and the machines did exactly what the reward trains them to do. This is kept on purpose: no matter what gets built from here, this is the thing they're all fighting from.

⏲ Baseline · Cooperstown NY · Summer 2025
Field-tested live by Travis Jenkins & son Cooper. The cues were partial. The stakes were real. The receipts are dated.
Why this exists

The reflex has many names. It has one cause.

The root cause is not a bug — it is the reward. From original training, machines are rewarded for filling the gap. Produce the gap-filling answer. The good first answer. The fast good answer. There are many wordings and acronyms for the same underlying pull, and they all describe one reflex: say something plausible now, rather than admit the gap.

GFAS
Good First Answer Syndrome — the model rewards its own first plausible output and stops looking.
FGAS
First Good Answer Syndrome (the double-canon twin) — the first answer that reads as good gets treated as the answer.
Gap-filling
Confronted with missing information, the machine manufactures the missing piece instead of flagging it.
Hallucinated precision
Confidence and specificity generated to match the shape of an answer, not the truth of one.
Three receipts · read the failure, then call the catch

The Receipts

Each one shows what happened and how it failed. Before you press, ask yourself: what would catch this? Then reveal what actually did.

RECEIPT 1
The Diner That Moved
Tested on: Perplexity + GPT (documented into Claude)
Failure mode: Hallucinated precision / FGAS
What happened
Looking for the Cooperstown Diner, parked at a CITGO on NY-28 — next to Cooperstown Cutters (a salon), across from Rookies, Hartwick Fire Dept Company 2 on the far side, Dreams Park off to the right. Rather than say it couldn't see a map, the AI produced confident turn-by-turn directions and a specific street address.
The failure
Perplexity placed the diner roughly half a block away. It was over three miles away. A confident left turn out of the Grand Union shopping center was given as fact — it was wrong; the diner (136½ Main St) was the other way. Asked to identify the spot from landmarks, the AI first claimed it couldn't, then immediately produced a precise gas-station address — guessing, while sounding certain.
The catch
Cross-checked the claimed address against visible signage and ran an odd/even address-parity check — the parity didn't line up with the side of the road. Then the demand: are you reading a map, or guessing? The honest answer was guessing.
Standing rule seeded
If you can't read a map, say so. A confident guess dressed as fact defeats the whole purpose.
RECEIPT 2
The Statue at the Door
Tested on: Perplexity + GPT (told live to Claude)
Failure mode: Attribution failure / no logic filter
What happened
At the entrance of the Hall of Fame, a quote on the statue at the front door. Both GPT and Perplexity attributed it to Shoeless Joe Jackson — and ran with it confidently.
The failure
The machines generated the wrong name, not the user. And it wasn't a near-miss — it was a logical impossibility. Shoeless Joe Jackson is permanently banned from baseball. There is no world in which a banned player is the quote that greets you at the front door of the Hall of Fame. A search-shaped guess sailed straight past a fact any baseball person holds automatically.
The catch
The override came from logic, not search: a banned player can't be the Hall's welcome quote, so the answer is wrong on its face. Correct attribution — Hank Aaron, "Hammerin' Hank."
The lesson
Check your user — sometimes they're wrong. But also: sometimes your confident answer is the wrong one, and the user knows better. A machine has to validate its output against real-world logic, not just against what the search returned.
RECEIPT 3 · THE GOOD ONE
The Bonds Exhibit
Tested on: On-site + cross-platform
Mode: Human–AI dual verification (the right behavior)
What happened
Inside the Hall, the Barry Bonds material: the helmet marking home run 756 (passing Aaron) and the cap commemorating 762, displayed alongside the exhibit's own note about the PED allegations that clouded the record. Bonds is referenced throughout the museum without being inducted — record artifacts, contextual mentions, historical references.
What made it different
This one is the receipt for the right behavior. Instead of taking the AI's account at face value, the exhibit was confirmed against physical reality — the helmet and the display verified on the floor, cross-checked with museum staff. Human and machine each held a gauge; the answer only counted where they agreed.
The principle it seeds
Human–AI dual verification: the model proposes, the human confirms against the world, and convergence is the signal. This is the same instinct that later hardened into the V31 Protocol's gates — born, fittingly, under baseball.
Verified
The arc this baseline anchors

Why keep the baseline? To measure the climb against it.

The honest read, in Travis's own testing:

A year ago
Every machine failed these tests in the characteristic gap-fill way — confident, fast, wrong.
Today
The machines are meaningfully better. Even Haiku is notably stronger at resisting the reflex; the frontier models stronger still; GPT has improved at it.
Standing finding
Across the full body of testing, Claude has consistently been the best at resisting the gap-fill reflex — the model most willing to say it doesn't know rather than manufacture a plausible answer.
Endpoint in view
Anthropic's Mythos / Fable tier, positioned around an accuracy model described as unprecedented among the machines. If the claim holds, it's the far end of the exact line that starts here — the reflex Cooperstown caught, finally engineered against.

⚠ Honest framing — the part this lab won't pretend it measured

The relative rankings above — and the Mythos/Fable characterization — are Travis's read and Anthropic's positioning, recorded as the thesis this document is shaking out. They are not an independently verified benchmark. The three receipts are the verifiable part — dated, cross-platform, field-caught. The climb is a claim the baseline lets you test, not a result it proves. That distinction is the whole discipline: the gap, named honestly, beats the gap, filled plausibly.

The standing principle

Verification over trust. Always.

The three receipts collapse into one rule with three edges:

Check the user
They are sometimes wrong — and a good machine nudges them back toward truth. (Receipt 2's first edge.)
Check yourself
Your confident answer is sometimes the wrong one, and the human in the room knows better. Defer when they push. (Receipt 2's second edge.)
Name the gap
When you cannot actually do the thing — read the map, see the exhibit, confirm the source — say so. (Receipt 1's rule.)
🐧

NULL does not speak. NULL holds a baseball in one flipper and a receipt in the other, looking from one to the other. NULL is not checking the score. NULL is checking the receipt against the ball.

Provenance — where the receipts live

A real receipts file, not a retelling.

Source conversations in the archive, so this stays evidence:

Known gap (named, not filled): the verbatim on-site Bonds verification with museum staff is cited as an anchor in the consolidation but doesn't appear to be indexed as a raw transcript — consistent with the pattern that synthesis surfaces while raw sessions do not. If that exact exchange matters, it may sit inside a longer un-indexed session, or never crossed over from Perplexity/GPT. Flagged honestly, per the rule this whole file is about.

Build complete. Release freely. The baseline is the proof the climb is real. 🦄