Playground
Hallucination Detector — Finance
Paste source + LLM extraction. Every numeric claim is cross-checked against the source; ungrounded claims are flagged. Runs in your browser. Free.
- Inputs
- Source document + LLM extraction
- Runtime
- Instant
- Privacy
- Client-side · no upload
- API key
- Not required
- Methodology
- Open →
Grounding score
57%
4 / 7 numeric claims found in the source. 3 ungrounded · 4 grounded.
Numeric class only — currencies, percents, plain numbers ≥ 1000, and dates. Prose-level checks not in scope.
1 · Source document
2 · LLM extraction / output
3 · Markup
greengroundedrosenot grounded — likely hallucination
Ungrounded claims
- percent
18.1% - currency
$780 millionnearest in source:615000000 - percent
35%
How grounding is checked
- · Numbers in the output are extracted (currencies, plain numbers ≥ 1000, percents, dates).
- · Each number is checked for direct substring presence in the source, then for within-1% numeric proximity.
- · Dates get a looser year-level fallback.
- · Prose-level fabrication is not detected. This pass catches the numeric class only — by far the most costly class in financial extractions.
See methodology for the full algorithm, limitations, and planned embedding-based prose checker.
How to use
Step-by-step
- 1
Provide source context (the document or retrieval result the model was supposed to ground its answer in).
- 2
Provide the model's output to be checked.
- 3
Run the detector. It checks entity grounding, numerical grounding, and self-consistency across N samples.
- 4
Read flagged spans. Each flag includes the failure type and severity. Review flags before accepting the output.
- 5
For batch use, upload a list of (context, output) pairs and read aggregate flag rates. Flag rate above 10% suggests prompt or model needs adjustment.
For agents
Use in an agent
Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.
import { compute } from "https://aifinhub.io/engines/hallucination-detector.js"; Contract: /contracts/hallucination-detector.json Full agent guide →
Glossary references
Terms used by this tool
Questions people ask next
FAQ
How does the detector decide a model output is hallucinated?
Three checks documented on the methodology page: (1) entity grounding — does every named entity in the output appear in the source context? (2) numerical grounding — do numbers in the output match values in the source within tolerance? (3) self-consistency — does the same prompt produce stable outputs across N samples? Any failure flags the response.
What's the false-positive rate?
Around 8-15% on the benchmark suite documented on the methodology page. A 'false positive' here is the detector flagging a correct answer as hallucinated — usually because the entity matches a known synonym or the number was rounded differently. The methodology page lists the benchmark mix and per-category accuracy.
Does it work for non-English text?
Entity and number checks work for any language with named-entity tooling (English, Spanish, German, French, Mandarin all have decent NER). Self-consistency is language-agnostic. Languages with weaker NER tooling get a higher false-negative rate.
Can it detect 'plausible but wrong' hallucinations?
Partially. Entity checks catch 'invented company names'. Number checks catch 'made-up percentages'. The hardest case — a plausible but wrong claim that doesn't contradict anything in the source — usually slips through unless it triggers self-inconsistency across samples. Multi-sample voting helps but isn't bulletproof.
How many samples does self-consistency need?
5 samples is the sweet spot per the methodology page — more catches subtler inconsistencies but multiplies cost. At 5 samples, a 3/5 agreement is typically enough to ignore minor wording variation; below 3/5, the response is flagged as inconsistent (and probably less reliable).
Related deep dive
All articles →Read further
Long-form context behind the tool output.
- Pillar · Guide·7 min
DeepSeek V4 for Finance 2026: SEC Filing Extraction Cost
DeepSeek V4 for finance 2026: V4-Flash reads a full 10-K for about $0.018 at $0.14/$0.28 per Mtok with a 1M-token window. Legacy model IDs retire July 24.
Read - Methodology · Opinion·8 min
The Price-Blind LLM Research Harness
Price-blind LLM research — most harnesses leak the current price and the model confabulates. The architectural fix and a 30-line Python scaffold.
Read - Tutorial · Runnable·9 min
LLM Prompt Patterns for 10-K and 8-K Extraction
Three structured patterns for auditable 10-K extractions: field-by-field JSON, citation-required verbatim quotes, and contradiction-triangle cross-check.
Read
Used in
Decision workflows that use this tool
Goal-driven flows that bundle this tool with adjacent ones.
Complementary tools
Users of this tool often explore
Agent Skill Tester for Markets
Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.
Price-Blind Research Auditor
Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.
Prompt Injection Tester
Red-team a finance agent against 24 documented prompt-injection attacks — direct override, role confusion, indirect injection via retrieved content.