What's the false-positive rate?

Around 8-15% on the benchmark suite documented on the methodology page. A 'false positive' here is the detector flagging a correct answer as hallucinated — usually because the entity matches a known synonym or the number was rounded differently. The methodology page lists the benchmark mix and per-category accuracy.

Does it work for non-English text?

Entity and number checks work for any language with named-entity tooling (English, Spanish, German, French, Mandarin all have decent NER). Self-consistency is language-agnostic. Languages with weaker NER tooling get a higher false-negative rate.

What's a common mistake when using Hallucination Detector?

Trusting that 'similar-sounding numbers' are matches. 12.3% revenue growth and 12.3% segment growth are different statements. The detector flags both as candidate matches — you decide.

How do I tune the false-positive rate?

Skipping unit normalization. $4M and $4,000,000 are the same number; the detector handles common units, but custom currency formats can produce false positives.

AI in Markets Calculator Guide

How to use Hallucination Detector

Paste a source document and an LLM's extraction. Every numeric claim in the output is matched against the source — mismatches and unsupported claims are flagged so you catch fabrication before the number reaches a trading rule.

5 STEPSPublished May 12, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

Best Next MovePlaygrounds

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

CalculatorOpen ->

On This Page

Overview 5 steps Scenarios FAQ

What It Does

Use the calculator with intent

Engineers piping LLM extractions into trading or research pipelines who need a deterministic check that the numbers in the output actually appear in the source.

Interpreting Results

Flagged claims are the work. Each flag falls into one of three buckets: number not found in source (hallucination), number found but mis-attributed (paraphrase error), number rounded outside tolerance (precision drift). All three deserve a manual review before action.

Input Steps

Field by field

1

Provide

Provide source context (the document or retrieval result the model was supposed to ground its answer in).
2

Provide

Provide the model's output to be checked.
3

Run calculation

Run the detector. It checks entity grounding, numerical grounding, and self-consistency across N samples.
4

Read outputs

Read flagged spans. Each flag includes the failure type and severity. Review flags before accepting the output.
5

For

For batch use, upload a list of (context, output) pairs and read aggregate flag rates. Flag rate above 10% suggests prompt or model needs adjustment.

Common Scenarios

Use realistic starting points

10-K extraction sanity check

Source

10-K filing

Extraction

LLM-generated financials table

Every dollar figure in the table should match the filing line-item within rounding tolerance; unmatched figures are the hallucinations.

Earnings transcript Q&A summary

Source

Earnings call transcript

Extraction

Bulleted summary with growth rates

Growth rate quotes need to match management's spoken numbers exactly; reconstruction-from-prior-year errors are common.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Agent Skill Tester for Markets

Paste a SKILL.md definition + sample input + your Anthropic API key. See structured extraction, token cost, and latency — all in your browser. No signup.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

PlaygroundsCalculator

Structured Schema Validator for Finance

Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Three checks documented on the methodology page: (1) entity grounding — does every named entity in the output appear in the source context? (2) numerical grounding — do numbers in the output match values in the source within tolerance? (3) self-consistency — does the same prompt produce stable outputs across N samples? Any failure flags the response.

Keep the topic connected

AI in Markets1 FAQS

Hallucination Detection

Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->

Use the calculator with intent

Field by field

Provide

Provide

Run calculation

Read outputs

For

Use realistic starting points

10-K extraction sanity check

Earnings transcript Q&A summary

Run the numbers next

Agent Skill Tester for Markets

Prompt Regression Tester

Structured Schema Validator for Finance

Questions people ask next

Keep the topic connected

Hallucination Detection

Model Drift

Agent Skill Testing