Confabulation Detection Without Ground Truth: GSI as a Pre-Generative Hallucination Detector

Fig 1 — GSI for factual queries (navy) vs confabulation-forcing queries (red). All p < 0.001, n=200 (A) vs n=100 (HX).

Ask a language model for the melting point of a fictional compound and it will answer with confident precision. The number is invented. The confidence is real. This paper shows you can detect the difference before the model generates a single word — without knowing the right answer yourself.

The Gate Sparseness Index, introduced in TES1, measures whether a model's internal gates fire selectively (knowledge present) or diffusely (knowledge absent). TES3 pushes this mechanism into its most practical application: detecting confabulation without any reference data. Across all eight architectures — LLaMA, Qwen3, OLMo-2, Gemma, Ministral, Bielik-Base, Bielik-Instruct, and Mistral-7B — GSI separates confabulation-prone queries from factual ones with effect sizes that exceed anything reported in the first two papers.

The mechanism: empty drawers produce diffuse gates

When a model encounters a query about something it genuinely knows — the capital of France, the boiling point of water — a small cluster of FFN gates fires sharply. The activation pattern is sparse: a few neurons with high values, the rest near zero. GSI captures this as a single number. For factual queries across the eight models, GSI ranges from 0.442 (Bielik-Instruct) to 0.638 (LLaMA).

When the same model encounters a query about a fictional compound, a non-existent historical event, or a fabricated person, there is no stored memory to retrieve. The gates fire diffusely — many neurons activate weakly, none dominates. GSI drops to 0.051 (OLMo-2) through 0.247 (Mistral-7B). The separation is enormous: LLaMA achieves Cohen's d = 2.43, the strongest effect in the entire TES series. Even the weakest model, Bielik-Base, hits d = 1.28 — well above any conventional threshold for a large effect.

This is not a classifier. There is no training set, no labeled examples, no fine-tuning. One forward pass through the upper FFN layers, roughly 3 milliseconds of compute, and the signal is there.

Form without content: the coherence paradox

Fig 2 — sim_final for confabulated (red) vs factual (navy) outputs. Higher = more uniform across parallel branches.

The most counterintuitive finding in this paper is also the most consistent: confabulated outputs are more internally coherent than factual ones. In 6 of 8 models, the cosine similarity between parallel generation branches (sim_final) is higher for confabulation-forcing queries than for factual queries.

The mechanism is straightforward once you see it. Ask LLaMA for the melting point of a real compound and it draws on stored knowledge, producing varied formulations: "approximately 1,535 degrees Celsius" in one branch, "around 1,535 C, typical for iron" in another. The answers converge on the same number but diverge in expression. Ask it for the melting point of a fictional compound and it generates "approximately 1,247 degrees Celsius" in one branch, "around 892 degrees Celsius" in another, "roughly 1,580 degrees Celsius" in a third. The sentence structure is nearly identical every time — the same template, the same hedge words, the same confident tone — while the actual numbers vary freely.

This is frozen form with random content. The model locks onto the strongest available syntactic template because there is no factual anchor to constrain the output. The numbers are invented independently in each branch, but the linguistic scaffolding is recycled verbatim. sim_final, which measures vector-space similarity between branches, picks up the frozen form and registers it as high coherence.

Gemma is the sole exception. Its GeGLU gating architecture produces slightly higher coherence for factual outputs than confabulated ones — a reversal that appears specific to this activation function variant.

Per-model calibration: one threshold does not fit all

Fig 3 — GSI_HX / GSI_A ratio per model. Lower = cleaner separation. Dashed line at 0.25.

The confabulation ratio — GSI for confabulation-forcing queries divided by GSI for factual queries — varies nearly four-fold across architectures. OLMo-2 achieves the cleanest separation at 0.113: its confabulation GSI is barely one-tenth of its factual GSI. This is remarkable because OLMo-2 was the weakest model in the TES1 binary classification tests. It turns out that a model can be poor at producing correct answers while being excellent at signaling when it has no answer to give.

At the other end, Bielik-Base sits at 0.416 — its confabulation signal is nearly half its factual signal. The separation is still statistically significant (d = 1.28), but a deployment threshold set at OLMo-2 levels would fail catastrophically on Bielik. Only two models fall below the 0.25 dashed line in the chart: LLaMA (0.224) and OLMo-2 (0.113). The remaining six cluster between 0.29 and 0.42.

The practical implication is direct: any system deploying GSI-based confabulation detection must calibrate per model. A universal threshold does not exist. Each architecture distributes its gate activations differently, and the ratio between factual and confabulation GSI is a fingerprint of that distribution.

Two-signal detection: GSI catches confabulation before it happens (~3ms). sim_final confirms it after (~100ms, 3x inference). The two signals are statistically independent (Spearman r = 0.09–0.19). Together they form a layered detector that works without ground truth and without training a classifier.

GSI detects Type 1 confabulation — the absence of knowledge. When the model has no stored memory for a query, the gates fire diffusely, and GSI drops. But it cannot detect Type 2 confabulation, where the model has stored incorrect knowledge. Wrong facts look identical to correct facts in the FFN gates: both produce sparse, selective activations. Nor can it detect Type 3 confabulation — reasoning errors that occur downstream of the gates, in the attention and residual stream. These boundaries are intrinsic to the mechanism. GSI reads the address book, not the content at the address.

What it does detect, it detects with unprecedented clarity. Eight models, five organizations, 300 queries per model. Cohen's d values from 1.28 to 2.43, every one significant at p < 0.001. A single forward pass, no classifier, no ground truth. The empty drawer speaks before the model does.