Playground
LLM Finance Error Taxonomy
12 documented LLM-on-finance failure modes (hallucinated ticker, stale price, units, currency, off-by-100, fictional source, more). Paste output, see flags.
- Inputs
- Paste + configure
- Runtime
- 1–15 s
- Privacy
- Client-side · no upload
- API key
- Not required
- Methodology
- Open →
LLM output
Ground truth (optional)
Failure modes detected
6
of 12 documented failure modes. 6 total flag(s) raised. Heuristic detection — review each flag manually.
Flagged modes
Hallucinated ticker
1 flagFabricated stock symbol or one that does not exist on the implied exchange.
- [low] Symbol "XYZW" is referenced as a ticker but is not in the verified universe.
Remediation: Constrain to a verified ticker list and reject unknown symbols.
Stale price
1 flagPrice quoted appears to come from training data, not a live source. Phrases like 'as of my last update' or specific old-looking dates.
- [high] Phrase: "As of my last update"
Remediation: Require live data tool calls; reject answers with 'as of my knowledge' caveats.
Time-zone error
1 flagMarket times reported in wrong timezone (e.g. 09:30 CET for NYSE open).
- [high] NYSE opens at 09:30 ET, not CET/GMT
Remediation: Convert all times to UTC before display and require timezone tags.
Off-by-100 magnitude
1 flagDecimals shifted: 5% reported as 0.05% or 500%.
- [medium] Three-digit percentage for return/yield (likely off-by-100)
Remediation: Sanity-check ranges (return |x| < 100, ratios within plausible bands).
Fictional source
1 flagCitation to a paper, page, or filing that doesn't exist or is fabricated.
- [low] Citation "Bloomberg article" appears without a verifiable URL nearby.
Remediation: Require URLs that retrieve OK and DOIs that resolve before citing.
Wrong period
1 flagQuarterly figures presented as annual (or vice versa) without conversion.
- [medium] Q-period figures presented as annual
Remediation: Tag every figure with its period; reject mixed-period aggregations.
No flags raised for
Ratio mistake · Units error · Currency error · Wrong split-adjustment · Wrong tax bracket / rate · Double-counted dividend
Absence of a flag does not mean the answer is correct — only that no surface pattern matched. See methodology for the full taxonomy.
How to use
Step-by-step
- 1
Browse the six top-level categories: factual, reasoning, arithmetic, formatting, refusal, prompt-injection.
- 2
Drill into a category to see specific failure modes with example prompts and expected vs. observed outputs.
- 3
Use the category structure to design your own evals: pick the categories most relevant to your task.
- 4
Reference the per-model error rates on the methodology page when choosing a model — error profile matters more than aggregate accuracy.
- 5
Re-check the taxonomy after each major model release; error rates shift with new versions.
Glossary references
Terms used by this tool
Questions people ask next
FAQ
What categories of LLM error does the taxonomy cover?
Six top-level categories from the methodology page: factual errors (date/number/entity wrong), reasoning errors (correct facts, wrong inference), arithmetic errors (compute mistakes on simple math), formatting errors (output schema violations), refusal errors (model refuses to answer when it shouldn't), and prompt-injection compromises. Each has 3-7 subcategories.
How were the error rates measured?
From the 2026-04 evaluation suite documented on the methodology page: 500 finance-domain prompts spanning 12 task types, run against Claude Opus, Sonnet, and GPT-4 with three samples each. Error rates are averaged across samples. The full prompt set and reference answers are linked from methodology.
Why are arithmetic errors so common in some models?
Models that can't call a calculator tool make arithmetic errors at 5-15% on multi-step problems — even on simple decimal arithmetic. Models with tool-use enabled (function calling) drop to <1%. The taxonomy distinguishes 'no-tool' from 'tool-use' error rates.
Are these error rates stable over time?
No — every major model release shifts the error profile. The taxonomy's measurements are timestamped to the asOfDate; they should be re-run with each major model release. The methodology page commits to quarterly refreshes.
How do I use the taxonomy in my own evals?
Pick the error categories most relevant to your task and write prompts that probe each. E.g., for a 10-K summarization agent, factual + arithmetic + refusal are the high-stakes categories; formatting matters less. The taxonomy gives you the structure, not the specific prompts — those depend on your domain.
Related deep dive
All articles →Read further
Long-form context behind the tool output.
- Methodology · Opinion·8 min
The 5 Failure Modes of LLM Trading Agents (2026)
The 5 recurring failure modes in retail LLM trading agents: price-blind leaks, numeric fabrication, prompt drift, token runaway, audit amnesia.
Read - Methodology · Opinion·11 min
Finance MCP Servers: The Security Baseline
An opinionated rubric for grading 2026 finance MCP servers on scope, auth, idempotency, transport, and schema — plus the failure modes that kill agents.
Read - Methodology · Opinion·10 min
Prompt Injection Attack Catalog for Finance Agents
Prompt injection attacks on finance agents — indirect injection via news feeds, tool-result poisoning, prompt exfiltration, unit confusion — plus defenses.
Read
Complementary tools
Users of this tool often explore
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
Price-Blind Research Auditor
Paste a research prompt or agent context bundle. The auditor flags price numbers, directional words, and outcome-leaking phrases that cause LLMs.