How were the error rates measured?

From the 2026-04 evaluation suite documented on the methodology page: 500 finance-domain prompts spanning 12 task types, run against Claude Opus, Sonnet, and GPT-4 with three samples each. Error rates are averaged across samples. The full prompt set and reference answers are linked from methodology.

Why are arithmetic errors so common in some models?

Models that can't call a calculator tool make arithmetic errors at 5-15% on multi-step problems — even on simple decimal arithmetic. Models with tool-use enabled (function calling) drop to <1%. The taxonomy distinguishes 'no-tool' from 'tool-use' error rates.

What's a common mistake when using LLM Finance Error Taxonomy?

Treating the taxonomy as exhaustive. New failure modes emerge with each model release; the taxonomy is a starting point, not the full list.

How do I weight severity across error classes?

Auto-fixing rather than escalating. Auto-fix logic for one category (off-by-100) sometimes triggers a different bug; humans triage faster than scripts.

AI in Markets Calculator Guide

How to use LLM Finance Error Taxonomy

Q: What's a common mistake when using LLM Finance Error Taxonomy?

Treating the taxonomy as exhaustive. New failure modes emerge with each model release; the taxonomy is a starting point, not the full list.

Q: How do I weight severity across error classes?

Auto-fixing rather than escalating. Auto-fix logic for one category (off-by-100) sometimes triggers a different bug; humans triage faster than scripts.

12 documented LLM-on-finance failure modes (hallucinated ticker, stale price, units, currency, off-by-100, fictional source, more). Paste an LLM output and the page flags which categories trigger so you can triage fast.

5 STEPSPublished May 12, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

Best Next MovePlaygrounds

LLM Finance Error Taxonomy

12 documented LLM-on-finance failure modes (hallucinated ticker, stale price, units, currency, off-by-100, fictional source, more). Paste output, see flags.

CalculatorOpen ->

On This Page

Overview 5 steps Scenarios FAQ

What It Does

Use the calculator with intent

Engineers debugging LLM-driven finance pipelines who need a structured taxonomy of failure modes rather than chasing each bug fresh every time.

Interpreting Results

Each triggered category is a flag, not a verdict. Most triggers are false alarms but each one should be reviewed by a human before the output goes downstream. Off-by-100 is the most insidious — the answer looks right at a glance.

Input Steps

Field by field

1

Browse

Browse the six top-level categories: factual, reasoning, arithmetic, formatting, refusal, prompt-injection.
2

Drill

Drill into a category to see specific failure modes with example prompts and expected vs. observed outputs.
3

Use result

Use the category structure to design your own evals: pick the categories most relevant to your task.
4

Reference

Reference the per-model error rates on the methodology page when choosing a model — error profile matters more than aggregate accuracy.
5

Step 5

Re-check the taxonomy after each major model release; error rates shift with new versions.

Common Scenarios

Use realistic starting points

Quarterly earnings extraction

Output type

table of financial numbers

Off-by-100 (basis points vs percent) and currency confusion are the most common triggers; rarely actual hallucination if the source was provided.

Macro analysis output

Output type

narrative analysis with cited stats

Fictional source and stale price more common here; LLM may cite a Bloomberg link that doesn't exist or quote a price from 6 months ago.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Launch toolOpen ->

PlaygroundsCalculator

Structured Schema Validator for Finance

Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Six top-level categories from the methodology page: factual errors (date/number/entity wrong), reasoning errors (correct facts, wrong inference), arithmetic errors (compute mistakes on simple math), formatting errors (output schema violations), refusal errors (model refuses to answer when it shouldn't), and prompt-injection compromises. Each has 3-7 subcategories.

Keep the topic connected

AI in Markets1 FAQS

Hallucination Detection

Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets1 FAQS

Prompt Injection

Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.

Keep readingRead ->

Use the calculator with intent

Field by field

Browse

Drill

Use result

Reference

Step 5

Use realistic starting points

Quarterly earnings extraction

Macro analysis output

Run the numbers next

Hallucination Detector

Structured Schema Validator for Finance

Prompt Regression Tester

Questions people ask next

Keep the topic connected

Hallucination Detection

Model Drift

Prompt Injection