Skip to main content
aifinhub

Playground

Hallucination Detector — Finance

Paste source + LLM extraction. Every numeric claim is cross-checked against the source; ungrounded claims are flagged. Runs in your browser. Free.

Inputs
Source document + LLM extraction
Runtime
Instant
Privacy
Client-side · no upload
API key
Not required
Methodology
Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

Grounding score

57%

4 / 7 numeric claims found in the source. 3 ungrounded · 4 grounded.

Numeric class only — currencies, percents, plain numbers ≥ 1000, and dates. Prose-level checks not in scope.

1 · Source document

2 · LLM extraction / output

3 · Markup

Company: SYNTHETIC_A Corp. Period: fiscal year ended 2025-12-31. Revenue: $2,847 million (18.1% YoY growth). Net income: $412 million. Operating cash flow: $780 million. R&D intensity: 12.3% of revenue. Notable: primary customer concentration exceeds 35% of revenue.

greengroundedrosenot grounded — likely hallucination

Ungrounded claims

  • percent18.1%
  • currency$780 millionnearest in source: 615000000
  • percent35%

How grounding is checked

  • · Numbers in the output are extracted (currencies, plain numbers ≥ 1000, percents, dates).
  • · Each number is checked for direct substring presence in the source, then for within-1% numeric proximity.
  • · Dates get a looser year-level fallback.
  • · Prose-level fabrication is not detected. This pass catches the numeric class only — by far the most costly class in financial extractions.

See methodology for the full algorithm, limitations, and planned embedding-based prose checker.

How to use

Step-by-step

Full calculator guide →
  1. 1

    Provide source context (the document or retrieval result the model was supposed to ground its answer in).

  2. 2

    Provide the model's output to be checked.

  3. 3

    Run the detector. It checks entity grounding, numerical grounding, and self-consistency across N samples.

  4. 4

    Read flagged spans. Each flag includes the failure type and severity. Review flags before accepting the output.

  5. 5

    For batch use, upload a list of (context, output) pairs and read aggregate flag rates. Flag rate above 10% suggests prompt or model needs adjustment.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/hallucination-detector.js";

Contract: /contracts/hallucination-detector.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

How does the detector decide a model output is hallucinated?

Three checks documented on the methodology page: (1) entity grounding — does every named entity in the output appear in the source context? (2) numerical grounding — do numbers in the output match values in the source within tolerance? (3) self-consistency — does the same prompt produce stable outputs across N samples? Any failure flags the response.

What's the false-positive rate?

Around 8-15% on the benchmark suite documented on the methodology page. A 'false positive' here is the detector flagging a correct answer as hallucinated — usually because the entity matches a known synonym or the number was rounded differently. The methodology page lists the benchmark mix and per-category accuracy.

Does it work for non-English text?

Entity and number checks work for any language with named-entity tooling (English, Spanish, German, French, Mandarin all have decent NER). Self-consistency is language-agnostic. Languages with weaker NER tooling get a higher false-negative rate.

Can it detect 'plausible but wrong' hallucinations?

Partially. Entity checks catch 'invented company names'. Number checks catch 'made-up percentages'. The hardest case — a plausible but wrong claim that doesn't contradict anything in the source — usually slips through unless it triggers self-inconsistency across samples. Multi-sample voting helps but isn't bulletproof.

How many samples does self-consistency need?

5 samples is the sweet spot per the methodology page — more catches subtler inconsistencies but multiplies cost. At 5 samples, a 3/5 agreement is typically enough to ignore minor wording variation; below 3/5, the response is flagged as inconsistent (and probably less reliable).

Complementary tools

Planning estimates only — not financial, tax, or investment advice.