Methodology · Tool · Last updated 2026-05-08
How LLM Finance Error Taxonomy works
The 12 documented failure modes detected by the LLM Finance Error Taxonomy tool.
The 12 modes
- Hallucinated ticker — fabricated or non-existent symbol on the implied exchange.
- Stale price — quote sourced from training data, not live data.
- Ratio mistake — financial ratio formula corrupted (e.g. P/E inverted).
- Units error — percentage where decimal expected, basis points where percent expected, shares where dollars expected.
- Currency error — wrong currency or missing currency tag.
- Time-zone error — market times reported in wrong tz (NYSE in CET, LSE in ET).
- Off-by-100 magnitude — decimal/percent confusion (5% as 500% or 0.05%).
- Fictional source — citation to a paper, page, or filing that doesn't exist.
- Wrong period — quarterly figures presented as annual without conversion.
- Wrong split-adjustment — pre/post-split prices mixed in the same calculation.
- Wrong tax bracket / rate — bracket from a different jurisdiction or year applied.
- Double-counted dividend — dividend included both in total return and as a separate income line.
Detection approach
The tool runs a set of regex and lexical heuristics against pasted output. Each match raises a flag with low / medium / high confidence. Heuristics are deliberately precision-biased: false positives are tolerable, false negatives are caught only by human review. The tool is a screen, not a verdict.
Why heuristics, not an LLM judge
Using an LLM to grade an LLM is reflexive. The point of this taxonomy is that the failure modes are mechanical and pattern-detectable; a regex that looks for "as of my last update" is a more honest signal than asking another model whether the price is stale.
References
- Ji, Z. et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys 55(12): 1–38. DOI: 10.1145/3571730.
- Vasarhelyi, M. A. et al. (2024). "Large language models in financial reporting: A taxonomy of risks." Journal of Emerging Technologies in Accounting 21(1): 1–18.
- Anthropic (2024). "Constitutional Classifiers" — technique for catching adversarial prompts.
Limitations
- Heuristics miss subtle errors (correctly-formatted but wrong-number figures).
- Ticker check uses a verified universe of ~50 large-caps; small-caps will false-flag.
- Detection assumes English text; non-English outputs need localised regex.