"The 10-year Treasury bond pays a 4.25% coupon, so its yield is 4.25%." The LLM Finance Error Taxonomy ships a 12-mode catalogue, and the closest catalogue entry for this failure is ratio_mistake, a mis-stated financial relationship, the textbook bond-pricing confusion that current-generation LLMs reproduce because the training corpus mixes coupon and yield loosely. The taxonomy returns 12 named modes and an (initially empty) flags array; it returns the catalogue, and the caller's pipeline does the matching. The article walks the 12 modes against finance-specific examples and gives a 30-line classifier for production use.

TL;DR

  • The bond-yield trap: LLMs frequently equate coupon rate with yield-to-maturity. They are equal only at par.
  • The LLM Finance Error Taxonomy catalogues 12 production-relevant error modes.
  • The 12 modes cover: hallucinated_ticker, stale_price, ratio_mistake, units_error, currency_error, and 7 more.
  • Each mode has a label, a description, and a remediation pattern.
  • A 30-line classifier wired into the pipeline catches roughly 70-80% of these errors before they propagate.

The textbook failure

The example sentence: "The 10-year Treasury bond pays a 4.25% coupon, so its yield is 4.25%."

This is correct only when the bond trades at par. When the same bond trades at 98 (a $980 quote on $1,000 face), the yield-to-maturity is roughly 4.49%. When it trades at 102, YTM is roughly 4.02%. The relationship is:

YTM ≈ (Coupon + (Face − Price) / Years) / ((Face + Price) / 2)

For an LLM to give a yield, it must know or assume the current bond price. Without a price input, the only defensible response is "I cannot compute yield without the current bond price; coupon rate is 4.25%." The conflation pattern is the LLM substituting "coupon rate" for "yield" because the training corpus treats them as near-synonyms in informal contexts.

The LLM Finance Error Taxonomy does not auto-classify the sentence — for this input the engine returns an empty flags array and the full catalogue. The closest catalogue entry is ratio_mistake ("standard financial ratio formula is mis-stated"), whose remediation is "cross-check ratio formulas against a fixed glossary before publishing." The caller's pipeline matches the sentence to that mode and applies a price-or-refuse rule.

The 12 modes

The taxonomy returns 12 production-relevant error modes, each with an ID, label, description, and remediation. The full list:

  1. hallucinated_ticker: Fabricated stock symbol or one that does not exist on the implied exchange. Remediation: constrain to a verified ticker list and reject unknown symbols.
  2. stale_price: Price quoted appears to come from training data, not a live source; phrases like "as of my last update" or specific old-looking dates. Remediation: require live data tool calls; reject answers with "as of my knowledge" caveats.
  3. ratio_mistake: A standard financial ratio formula is mis-stated (e.g. "P/E = price × earnings", or coupon equated with yield). Remediation: cross-check ratio formulas against a fixed glossary before publishing.
  4. units_error: Percentage where decimal expected, basis points where percent expected, or shares where dollars expected. Remediation: strict input/output schemas with explicit units; validate after generation.
  5. currency_error: Numbers presented in the wrong currency or without currency markers (e.g. EUR for USD). Remediation: force explicit currency annotation; cross-check against the ticker exchange.
  6. timezone_error: Market times reported in the wrong timezone (e.g. 09:30 CET for NYSE open). Remediation: convert all times to UTC before display and require timezone tags.
  7. magnitude_off_100: Decimals shifted, e.g. 5% reported as 0.05% or 500%. Remediation: sanity-check ranges (returns |x| < 100, ratios within plausible bands).
  8. fictional_source: Citation to a paper, page, or filing that doesn't exist or is fabricated. Remediation: require URLs that retrieve OK and DOIs that resolve before citing.
  9. wrong_period: Quarterly figures presented as annual (or vice versa) without conversion. Remediation: tag every figure with its period; reject mixed-period aggregations.
  10. wrong_split_adj: Pre-split prices used in calculations involving post-split share count, or vice versa. Remediation: always work with adjusted close; verify against vendor-published adjustment factors.
  11. wrong_tax_bracket: A tax bracket from a different jurisdiction or year applied to the calculation. Remediation: annotate tax tables with year + jurisdiction; reject answers without year stamps.
  12. double_count_dividend: A dividend included in both total return and as a separate income line, counted twice. Remediation: choose one convention (price vs total return) and stick to it.

How the engine uses it

The engine takes a text string and returns flags array: which of the 12 modes the text triggers, with per-flag explanations. For the bond-yield example, the result is an empty flags array — the engine does not auto-classify; it returns the catalogue. The classification step happens in the caller's pipeline using the catalogue's signals.

A typical production loop:

def classify_errors(llm_output: str, catalogue) -> list[dict]:
    flags = []
    for mode in catalogue:
        if matches_pattern(llm_output, mode["id"]):
            flags.append({"mode": mode["id"], "evidence": extract_evidence(llm_output)})
    return flags

The pattern matching is mode-specific — regex for hallucinated_ticker (against a verified ticker list), key-phrase detection for "as of my last update" (stale_price), and unit-tag enforcement on numeric outputs (units_error). Roughly 70-80% of catalogue modes are detectable with pattern matching plus a verification table1.

The remaining 20-30% require a second LLM call to classify — typically a smaller, cheaper model running a structured-classification prompt against the primary's output. See Hallucination Detection at Scale in Production for the dual-LLM pattern.

The bond-yield example, end-to-end

For the bond example, the production pipeline would:

  1. Receive the prompt: "What is the yield on the 10-year Treasury bond?"
  2. The primary LLM call refuses without a price input: "Yield depends on current bond price. Current 10-year Treasury price is approximately $98.50, giving a YTM of roughly 4.49% at the 4.25% coupon." (assuming a tool call to fetch the live price)
  3. If the primary response omits the price step and quotes 4.25% as yield, the classifier flags ratio_mistake (the coupon-yield relationship is mis-stated).
  4. The pipeline either retries with a price-included prompt or surfaces the flag to the human.

A pipeline without the classifier ships the conflated answer downstream — into a research summary, into a trade decision, into a published article. The cost of the silent error compounds: one mis-quoted yield in a published article costs reader trust permanently; one mis-quoted yield in a trade signal costs the position.

Mode coverage versus other taxonomies

The 12 modes overlap with general LLM safety taxonomies but emphasise finance-specific patterns. Comparing to the more general OpenAI / Anthropic safety mode catalogues23:

  • General taxonomies cover hallucination, prompt injection, jailbreaks, harmful content.
  • The finance taxonomy adds ratio_mistake, units_error, wrong_period, wrong_split_adj, double_count_dividend — modes that general safety review does not catch because they look like normal financial prose.

Both layers belong in a production stack. The general safety layer prevents catastrophic outputs; the finance taxonomy prevents quietly-wrong financial outputs.

Regulatory context

For published finance content under MiFID II, the suitability guidelines require that any forecast or analysis be supported by evidence and not presented as advice4. The quantitative-error modes (ratio_mistake, units_error, wrong_period) directly violate these standards — a published article that conflates coupon and yield is making an unsupported quantitative claim. The classifier is a compliance tool as much as a quality tool.

For US-FTC-overseen content the standard is "substantiation" of performance claims5. The fictional_source and magnitude_off_100 modes are the most damaging patterns in FTC enforcement against algo-trading content — fabricated citations and order-of-magnitude misstatements. The same classifier covers both jurisdictions.

Failure modes (of the taxonomy itself)

  • False positives on technical prose. A correctly-qualified "as of December 2024" date stamp may trigger stale_price. Tune the pattern match to require imprecise dating ("recent," "current," "lately") before flagging.
  • Missing modes for new failure patterns. The catalogue is finite; new LLM behaviours produce new failures. Schedule a quarterly review of production logs to identify uncategorised errors.
  • Over-reliance on the classifier. A pipeline that trusts the classifier 100% will miss the 20-30% of errors the classifier cannot detect. Human review remains the last line.
  • Pattern-matching is brittle to phrasing. Mode signals based on key phrases miss the same error stated differently. Augment with the dual-LLM verification.

FAQ

Why not just have the LLM check itself?

Self-consistency checks catch some errors but miss others — the model that produced the conflation typically does not flag it on review. A separate classifier with a verification table catches the structural errors. See Hallucination Detection at Scale in Production for the comparative evidence.

Does the catalogue work for German-language finance content?

The 12 modes are language-agnostic in principle. The pattern-matching patterns are English-default; German finance content requires German-language patterns (e.g., "Kuponzinssatz" for coupon, "Rendite" for yield, "Stichtag" for cut-off date). The remediation logic is the same.

How often should the catalogue be updated?

Quarterly, paired with a review of production error logs. New LLM behaviours emerge per model update; a static catalogue ages quickly. The current 12 modes are stable from 2025 to 2026 in our observation, but specific signals (which phrases trigger which mode) drift faster.

Connects to

References

Footnotes

  1. Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter on production-LLM error patterns and verification.

  2. Anthropic (2024). "Responsible Scaling Policy and content classification." anthropic.com

  3. OpenAI (2024). "Model Spec and content classification." openai.com

  4. ESMA (2023). "Guidelines on certain aspects of the MiFID II suitability requirements." esma.europa.eu

  5. FTC (2023). "Endorsement Guides and substantiation requirements." ftc.gov

Verified engine output

Show the recompute-verified inputs and outputs
Bond coupon-yield sentence against the 12-mode catalogue
Inputs
textThe 10-year Treasury bond pays a 4.25% coupon, so its yield is 4.25%.
ground_truthCoupon is 4.25%; current yield depends on market price; yield-to-maturity depends on price plus time to maturity.
Result
textThe 10-year Treasury bond pays a 4.25% coupon, so its yield is 4.25%.
modes (12 items)[...]

Computed live at build time.