Given a source paragraph stating Q3 2026 revenue of $1.85B, prior-year $1.62B, net income $312M, and EPS $0.78, and an LLM-produced summary that states revenue $1.85B (matching), revenue growth 14.8% (computed not cited), net income $312M (matching), and EPS $0.81 (off by $0.03 from source $0.78), the Hallucination Detector returns: totalClaims = 8, groundedCount = 5, ungroundedCount = 3, groundingRate = 0.625. The detector flags the EPS mismatch as raw="$0.81" / nearest="0.78", a substitution-style hallucination that a semantic-similarity check would miss because the surrounding language is correct.

TL;DR

The engine extracts every numeric claim in the LLM output and checks each against the source paragraph. On the canonical example:

Output claim Matches source? Detector verdict
Q3 2026 Yes (literal) grounded
2026 Yes (literal) grounded
$1.85 billion Yes (literal) grounded
14.8% Computed not cited ungrounded
$1.62 billion Yes (literal) grounded
$312 million Yes (literal) grounded
$0.81 No, source says $0.78 ungrounded (nearest = 0.78)
$4 billion Approximation of source $4.2B ungrounded (nearest = 4.2B)

groundingRate = 0.625. Five of the eight claims (the two dates plus the three figures that literally match the source) are grounded; the three substantive misses remain ungrounded. The single load-bearing hit is the EPS $0.81 vs source $0.78, a numeric substitution that no fluency check or semantic-similarity score detects. The detector grounds figures that match the source after normalization and flags only the claims that genuinely diverge: the computed growth rate, the substituted EPS, and the rounded "$4 billion" approximation.

The numeric-hallucination class

Hallucinations in finance LLM output split into three broad classes:

  1. Fabrication. The model invents a number not in the source. ("Revenue was $1.92B" when the source says $1.85B.) Detector behaviour: ungrounded, nearest = source-side closest number.
  2. Substitution. The model picks a real source number and applies it to the wrong field. ("EPS was $312M" when $312M is net income and EPS is $0.78.) Detector behaviour: claim grounded (the number exists in source) but the surrounding language suggests wrong attribution. The detector flags it as ungrounded because the kind mismatch.
  3. Arithmetic drift. The model performs an arithmetic operation and emits a result that disagrees with the source. ("Growth of 14.8%" when source numbers compute to 14.20%.) Detector behaviour: ungrounded, the result is not in the source even if the inputs are.

The EPS $0.81 hit is class 1 (fabrication) on the canonical example. Class 2 and 3 produce different signatures but the engine treats them all under the same ungroundedness flag, which is the right behaviour for an audit: any numeric in the output that does not literally appear in the source is suspect.

The four-rule policy

The defensible policy for an LLM-driven finance pipeline:

Rule 1: every numeric claim must be groundable to source. If the output contains a number, that number must appear verbatim (after normalization) in the source paragraph. The hallucination detector implements this exactly.

Rule 2: every computed claim must show its derivation. A growth rate, ratio, or aggregate that the LLM produces is acceptable only if the calculation is shown step-by-step using source numbers. The engine flags the result as ungrounded; the audit reviewer accepts or rejects based on the derivation.

Rule 3: every approximation must be marked. "About $4 billion" is acceptable if the source says $4.2B; "$4 billion" without the "about" is a precision claim that needs the actual source number. The detector treats both as ungrounded; the audit reviewer treats the marked-approximation case as benign.

Rule 4: groundingRate below 0.7 fails the gate. Below 70% of numeric claims grounded, the output is unsafe to publish or to route to a trade-decision agent. The canonical example at 0.625 is below the gate; even after accepting the computed-derivation (14.8%) and approximation ("$4 billion") exceptions, the EPS $0.81 substitution remains and the output fails.

What the detector grounds and what it flags

Grounding works by reducing both the source and the output claim to the same canonical form before comparison. "$1.85 billion" in the output normalizes to "1.85B", the source "$1.85 billion" normalizes to "1.85B", and the two match, so the claim is grounded. The same path grounds "$1.62 billion" and "$312 million", which all appear verbatim in the source.

On the canonical example three claims remain ungrounded, and each is a genuine miss:

  • 14.8%: a computed growth rate that does not appear in the source. The source figures imply 14.20% growth, so 14.8% is arithmetic drift, not a copied figure.
  • $0.81 (EPS): a substitution. The source says $0.78; the detector reports nearest = 0.78.
  • $4 billion: a rounded approximation of the source's $4.2 billion (total assets). The values differ by 5%, outside the 1% match tolerance, so the detector flags it with nearest = 4.2 billion.

groundingRate = 0.625 (5 of 8). The detector grounds the figures that match the source and flags only the claims that diverge. The remaining 0.375 ungrounded fraction is the real defect rate on this output, not a normalization artefact: a computed number, a substituted number, and an unmarked approximation. For a stricter gate you can treat the marked-approximation case ("about $4 billion") as benign and accept computed claims that show their derivation, which leaves the EPS substitution as the load-bearing failure.

What semantic similarity misses

The substitution failure ($0.81 vs $0.78) is the diagnostic case for why numeric grounding is irreplaceable. A semantic-similarity check (cosine on sentence embeddings) on the source and output paragraphs scores ~0.96 on this pair — they say roughly the same thing in roughly the same words. The number is wrong by less than 4% and the surrounding language is faithful.

For most NLP tasks 0.96 cosine is "essentially identical." For finance numerical reporting it is a critical defect. The relative error on EPS is small but the absolute decision impact is large: an EPS print 4% above guidance vs. on-guidance is a material business event. The detector's strict numeric matching is the only check that catches it cheaply.

The same logic applies to LLM-generated trade decisions: a substituted position size or a substituted stop-loss percentage produces semantic content that scores high on similarity but is operationally wrong. See /articles/numeric-precision-llm-filings/ for the systematic taxonomy.

Where the detector breaks

The detector compares scaled numeric values, not just literal strings. "$1.85 billion" and "1,850 million" both reduce to 1.85e9, so the detector grounds one against the other. What it still does not understand is currency conversion: EUR and USD figures that refer to the same underlying amount via an exchange rate are treated as distinct values. The normalization and magnitude-scaling layers cover the common same-currency format variants but not cross-currency equivalence.

The detector also does not verify arithmetic. If the source states revenue $1.85B and prior-year revenue $1.62B, and the output states growth = 14.20% (the correct value), the detector flags 14.20% as ungrounded (not in source). The audit reviewer can accept it as a computed claim with derivation; the detector cannot.

For LLM outputs that combine extracted facts and computed metrics, the right pipeline is: detector → human review of ungrounded claims with derivation shown → published output. Skipping the detector saves a few seconds and risks the substitution-style failure that semantic checks cannot catch.

Connects to

References

  • Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401
  • Anthropic. "Hallucinations and grounding." docs.anthropic.com, accessed 2026-05-21. https://docs.anthropic.com/en/docs/
  • Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). "On Faithfulness and Factuality in Abstractive Summarization." ACL 2020. https://arxiv.org/abs/2005.00661
  • Manakul, P., Liusie, A., & Gales, M. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative LLMs." EMNLP 2023. https://arxiv.org/abs/2303.08896
  • OpenAI. "Improving Factuality with Search." openai.com/research, accessed 2026-05-21. Reference for production-scale grounding patterns.

Verified engine output

Show the recompute-verified inputs and outputs
Inputs
sourceQ3 2026 revenue: $1.85 billion. Prior-year revenue: $1.62 billion. Net income: $312 million. EPS: $0.78. Total assets: $4.2 billion.
outputIn Q3 2026, revenue was $1.85 billion, up 14.8% year over year from $1.62 billion. Net income was $312 million and EPS was $0.81. Total assets were approximately $4 billion.
Result
claims (8 items)[...]
total claims8
grounded count5
ungrounded count3
grounding rate0.625

Computed live at build time.

Frequently asked questions

Does the detector ground $1.85 billion when the source says $1.85 billion?
Yes. The engine normalizes both source and output claim to the same canonical form before comparison, so figures that match the source are grounded regardless of currency symbol, comma, or billion-vs-B formatting. Claims still flagged are genuine misses: computed numbers, substitutions, and out-of-tolerance approximations.
Should I rely solely on the detector for audit?
No. The detector catches numeric substitution and fabrication; it does not catch policy violations or qualitative misinterpretation. Combine with the Structured Schema Validator and human review.
What groundingRate threshold should I use for production?
0.70 is the conservative gate; for audit-trail-bound contexts (BaFin/SEC) use 0.90. The cost of a single substitution failure exceeds the cost of false-positive triage.
Does the detector handle currency conversion?
No. EUR 1.5B and USD 1.65B referring to the same underlying number via exchange rate will not match. Currency-aware normalization is a planned extension.
Can the detector verify table values?
Yes if the source paragraph contains table values in text. For PDF-OCR sources where tables fragment, pre-process to canonical text before passing to the detector.