Skip to main content
aifinhub

Playground

LLM Finance Error Taxonomy

12 documented LLM-on-finance failure modes (hallucinated ticker, stale price, units, currency, off-by-100, fictional source, more). Paste output, see flags.

Inputs
Paste + configure
Runtime
1–15 s
Privacy
Client-side · no upload
API key
Not required
Methodology
Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

LLM output

Ground truth (optional)

Failure modes detected

6

of 12 documented failure modes. 6 total flag(s) raised. Heuristic detection — review each flag manually.

Flagged modes

  • Hallucinated ticker

    1 flag

    Fabricated stock symbol or one that does not exist on the implied exchange.

    • [low] Symbol "XYZW" is referenced as a ticker but is not in the verified universe.

    Remediation: Constrain to a verified ticker list and reject unknown symbols.

  • Stale price

    1 flag

    Price quoted appears to come from training data, not a live source. Phrases like 'as of my last update' or specific old-looking dates.

    • [high] Phrase: "As of my last update"

    Remediation: Require live data tool calls; reject answers with 'as of my knowledge' caveats.

  • Time-zone error

    1 flag

    Market times reported in wrong timezone (e.g. 09:30 CET for NYSE open).

    • [high] NYSE opens at 09:30 ET, not CET/GMT

    Remediation: Convert all times to UTC before display and require timezone tags.

  • Off-by-100 magnitude

    1 flag

    Decimals shifted: 5% reported as 0.05% or 500%.

    • [medium] Three-digit percentage for return/yield (likely off-by-100)

    Remediation: Sanity-check ranges (return |x| < 100, ratios within plausible bands).

  • Fictional source

    1 flag

    Citation to a paper, page, or filing that doesn't exist or is fabricated.

    • [low] Citation "Bloomberg article" appears without a verifiable URL nearby.

    Remediation: Require URLs that retrieve OK and DOIs that resolve before citing.

  • Wrong period

    1 flag

    Quarterly figures presented as annual (or vice versa) without conversion.

    • [medium] Q-period figures presented as annual

    Remediation: Tag every figure with its period; reject mixed-period aggregations.

No flags raised for

Ratio mistake · Units error · Currency error · Wrong split-adjustment · Wrong tax bracket / rate · Double-counted dividend

Absence of a flag does not mean the answer is correct — only that no surface pattern matched. See methodology for the full taxonomy.

How to use

Step-by-step

Full calculator guide →
  1. 1

    Browse the six top-level categories: factual, reasoning, arithmetic, formatting, refusal, prompt-injection.

  2. 2

    Drill into a category to see specific failure modes with example prompts and expected vs. observed outputs.

  3. 3

    Use the category structure to design your own evals: pick the categories most relevant to your task.

  4. 4

    Reference the per-model error rates on the methodology page when choosing a model — error profile matters more than aggregate accuracy.

  5. 5

    Re-check the taxonomy after each major model release; error rates shift with new versions.

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

What categories of LLM error does the taxonomy cover?

Six top-level categories from the methodology page: factual errors (date/number/entity wrong), reasoning errors (correct facts, wrong inference), arithmetic errors (compute mistakes on simple math), formatting errors (output schema violations), refusal errors (model refuses to answer when it shouldn't), and prompt-injection compromises. Each has 3-7 subcategories.

How were the error rates measured?

From the 2026-04 evaluation suite documented on the methodology page: 500 finance-domain prompts spanning 12 task types, run against Claude Opus, Sonnet, and GPT-4 with three samples each. Error rates are averaged across samples. The full prompt set and reference answers are linked from methodology.

Why are arithmetic errors so common in some models?

Models that can't call a calculator tool make arithmetic errors at 5-15% on multi-step problems — even on simple decimal arithmetic. Models with tool-use enabled (function calling) drop to <1%. The taxonomy distinguishes 'no-tool' from 'tool-use' error rates.

Are these error rates stable over time?

No — every major model release shifts the error profile. The taxonomy's measurements are timestamped to the asOfDate; they should be re-run with each major model release. The methodology page commits to quarterly refreshes.

How do I use the taxonomy in my own evals?

Pick the error categories most relevant to your task and write prompts that probe each. E.g., for a 10-K summarization agent, factual + arithmetic + refusal are the high-stakes categories; formatting matters less. The taxonomy gives you the structure, not the specific prompts — those depend on your domain.

Complementary tools

Planning estimates only — not financial, tax, or investment advice.