Playground

Hallucination Detector — Finance

Name: Hallucination Detector — Finance
Author: AI Fin Hub Research

Paste source + LLM extraction. Every numeric claim is cross-checked against the source; ungrounded claims are flagged. Runs in your browser. Free.

AI Fin Hub Research Published Apr 20, 2026 Methodology Corrections

Inputs: Source document + LLM extraction
Runtime: Instant
Privacy: Client-side · no upload
API key: Not required
Methodology: Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

Grounding score

57%

4 / 7 numeric claims found in the source. 3 ungrounded · 4 grounded.

Numeric class only — currencies, percents, plain numbers ≥ 1000, and dates. Prose-level checks not in scope.

1 · Source document

2 · LLM extraction / output

3 · Markup

Company: SYNTHETIC_A Corp. Period: fiscal year ended 2025-12-31. Revenue: $2,847 million (18.1% YoY growth). Net income: $412 million. Operating cash flow: $780 million. R&D intensity: 12.3% of revenue. Notable: primary customer concentration exceeds 35% of revenue.

greengroundedrosenot grounded — likely hallucination

Ungrounded claims

percent18.1%
currency$780 millionnearest in source: 615000000
percent35%

How grounding is checked

· Numbers in the output are extracted (currencies, plain numbers ≥ 1000, percents, dates).
· Each number is checked for direct substring presence in the source, then for within-1% numeric proximity.
· Dates get a looser year-level fallback.
· Prose-level fabrication is not detected. This pass catches the numeric class only — by far the most costly class in financial extractions.

See methodology for the full algorithm, limitations, and planned embedding-based prose checker.

How to use

Step-by-step

Full calculator guide →

1
Provide source context (the document or retrieval result the model was supposed to ground its answer in).
2
Provide the model's output to be checked.
3
Run the detector. It checks entity grounding, numerical grounding, and self-consistency across N samples.
4
Read flagged spans. Each flag includes the failure type and severity. Review flags before accepting the output.
5
For batch use, upload a list of (context, output) pairs and read aggregate flag rates. Flag rate above 10% suggests prompt or model needs adjustment.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/hallucination-detector.js";

Contract: /contracts/hallucination-detector.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

How does the detector decide a model output is hallucinated?

Three checks documented on the methodology page: (1) entity grounding — does every named entity in the output appear in the source context? (2) numerical grounding — do numbers in the output match values in the source within tolerance? (3) self-consistency — does the same prompt produce stable outputs across N samples? Any failure flags the response.

What's the false-positive rate?

Around 8-15% on the benchmark suite documented on the methodology page. A 'false positive' here is the detector flagging a correct answer as hallucinated — usually because the entity matches a known synonym or the number was rounded differently. The methodology page lists the benchmark mix and per-category accuracy.

Does it work for non-English text?

Entity and number checks work for any language with named-entity tooling (English, Spanish, German, French, Mandarin all have decent NER). Self-consistency is language-agnostic. Languages with weaker NER tooling get a higher false-negative rate.

Can it detect 'plausible but wrong' hallucinations?

Partially. Entity checks catch 'invented company names'. Number checks catch 'made-up percentages'. The hardest case — a plausible but wrong claim that doesn't contradict anything in the source — usually slips through unless it triggers self-inconsistency across samples. Multi-sample voting helps but isn't bulletproof.

How many samples does self-consistency need?

5 samples is the sweet spot per the methodology page — more catches subtler inconsistencies but multiplies cost. At 5 samples, a 3/5 agreement is typically enough to ignore minor wording variation; below 3/5, the response is flagged as inconsistent (and probably less reliable).

Related deep dive

All articles →

Read further

Long-form context behind the tool output.

Used in

Decision workflows that use this tool

Goal-driven flows that bundle this tool with adjacent ones.

Audit Your Pipeline
Catch hallucinations, prompt injections, and regression drift before they ship.
Open

Complementary tools

Agent Skill Tester for Markets

Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.

Playgrounds Open

Price-Blind Research Auditor

Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.

Playgrounds Open

Prompt Injection Tester

Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.

Playgrounds Open

1 · Source document

2 · LLM extraction / output

3 · Markup

Ungrounded claims

How grounding is checked

Step-by-step

Use in an agent

Terms used by this tool

FAQ

Read further

Decision workflows that use this tool

Users of this tool often explore

Agent Skill Tester for Markets

Price-Blind Research Auditor

Prompt Injection Tester