Playground

LLM Finance Error Taxonomy

Name: LLM Finance Error Taxonomy
Author: AI Fin Hub Research

12 documented LLM-on-finance failure modes (hallucinated ticker, stale price, units, currency, off-by-100, fictional source, more). Paste output, see flags.

AI Fin Hub Research Published May 8, 2026 Methodology Corrections

Inputs: Paste + configure
Runtime: 1–15 s
Privacy: Client-side · no upload
API key: Not required
Methodology: Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

LLM output

Ground truth (optional)

Failure modes detected

of 12 documented failure modes. 6 total flag(s) raised. Heuristic detection — review each flag manually.

Flagged modes

Hallucinated ticker
1 flag
Fabricated stock symbol or one that does not exist on the implied exchange.
- [low] Symbol "XYZW" is referenced as a ticker but is not in the verified universe.
Remediation: Constrain to a verified ticker list and reject unknown symbols.
Stale price
1 flag
Price quoted appears to come from training data, not a live source. Phrases like 'as of my last update' or specific old-looking dates.
- [high] Phrase: "As of my last update"
Remediation: Require live data tool calls; reject answers with 'as of my knowledge' caveats.
Time-zone error
1 flag
Market times reported in wrong timezone (e.g. 09:30 CET for NYSE open).
- [high] NYSE opens at 09:30 ET, not CET/GMT
Remediation: Convert all times to UTC before display and require timezone tags.
Off-by-100 magnitude
1 flag
Decimals shifted: 5% reported as 0.05% or 500%.
- [medium] Three-digit percentage for return/yield (likely off-by-100)
Remediation: Sanity-check ranges (return |x| < 100, ratios within plausible bands).
Fictional source
1 flag
Citation to a paper, page, or filing that doesn't exist or is fabricated.
- [low] Citation "Bloomberg article" appears without a verifiable URL nearby.
Remediation: Require URLs that retrieve OK and DOIs that resolve before citing.
Wrong period
1 flag
Quarterly figures presented as annual (or vice versa) without conversion.
- [medium] Q-period figures presented as annual
Remediation: Tag every figure with its period; reject mixed-period aggregations.

No flags raised for

Ratio mistake · Units error · Currency error · Wrong split-adjustment · Wrong tax bracket / rate · Double-counted dividend

Absence of a flag does not mean the answer is correct — only that no surface pattern matched. See methodology for the full taxonomy.

How to use

Step-by-step

Full calculator guide →

1
Browse the six top-level categories: factual, reasoning, arithmetic, formatting, refusal, prompt-injection.
2
Drill into a category to see specific failure modes with example prompts and expected vs. observed outputs.
3
Use the category structure to design your own evals: pick the categories most relevant to your task.
4
Reference the per-model error rates on the methodology page when choosing a model — error profile matters more than aggregate accuracy.
5
Re-check the taxonomy after each major model release; error rates shift with new versions.

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

What categories of LLM error does the taxonomy cover?

Six top-level categories from the methodology page: factual errors (date/number/entity wrong), reasoning errors (correct facts, wrong inference), arithmetic errors (compute mistakes on simple math), formatting errors (output schema violations), refusal errors (model refuses to answer when it shouldn't), and prompt-injection compromises. Each has 3-7 subcategories.

How were the error rates measured?

From the 2026-04 evaluation suite documented on the methodology page: 500 finance-domain prompts spanning 12 task types, run against Claude Opus, Sonnet, and GPT-4 with three samples each. Error rates are averaged across samples. The full prompt set and reference answers are linked from methodology.

Why are arithmetic errors so common in some models?

Models that can't call a calculator tool make arithmetic errors at 5-15% on multi-step problems — even on simple decimal arithmetic. Models with tool-use enabled (function calling) drop to <1%. The taxonomy distinguishes 'no-tool' from 'tool-use' error rates.

Are these error rates stable over time?

No — every major model release shifts the error profile. The taxonomy's measurements are timestamped to the asOfDate; they should be re-run with each major model release. The methodology page commits to quarterly refreshes.

How do I use the taxonomy in my own evals?

Pick the error categories most relevant to your task and write prompts that probe each. E.g., for a 10-K summarization agent, factual + arithmetic + refusal are the high-stakes categories; formatting matters less. The taxonomy gives you the structure, not the specific prompts — those depend on your domain.

Related deep dive

All articles →

Read further

Long-form context behind the tool output.

Complementary tools

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Playgrounds Open

Structured Schema Validator for Finance

Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.

Playgrounds Open

Price-Blind Research Auditor

Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.

Playgrounds Open

LLM output

Ground truth (optional)

Step-by-step

Terms used by this tool

FAQ

Read further

Users of this tool often explore

Hallucination Detector

Structured Schema Validator for Finance

Price-Blind Research Auditor