Model Selection Framework for Finance Tasks

TL;DR

No single LLM wins every finance workload. The right pick falls out of four inputs: task capability, latency budget, monthly cost envelope, and context footprint. This article sets out a four-step framework: (1) decompose the task into capability axes, (2) derive latency and cost envelopes from the workload shape, (3) filter the 2026-04 model catalogue against those envelopes using published pricing and context limits, (4) validate shortlist candidates with an in-house eval harness before production. Ten concrete finance scenarios each map to a recommended tier, from Haiku-class quick lookups to Opus-class peer comparison. No accuracy numbers appear anywhere below, because honest accuracy numbers require an eval harness run on the buyer's own documents. Published vendor pricing and documented capabilities only.

The framework in one paragraph

A model-selection decision is a filter over the current catalogue, not a ranking. Decompose the task into capability requirements along six axes, convert workload volume into latency and monthly cost envelopes, then filter the model catalogue to candidates that satisfy all constraints at 2026-04 published prices. The output is usually a shortlist of two or three. Pick the final winner with a small eval harness against the buyer's own documents, not against a public benchmark. The framework deliberately produces a tier (Haiku-class, Sonnet-class, Opus-class, GPT-5-mini-class, Gemini 2.5 Pro-class) rather than a specific model, because vendor pricing and context windows move quarterly and the tier is the durable decision.

The capability taxonomy

Finance workloads decompose cleanly into six capability axes. Each axis maps to model-family strengths that vendors publish directly. None of the statements below claim a measured accuracy rank; they map task requirement to documented capability.

Extraction fidelity

Pulling specific facts out of a specific document without inventing them. MD&A line items, footnote disclosures, restated GAAP reconciliations, guidance tables. The dominant failure mode is hallucination of a plausible-looking number that is not in the source. Models with strong instruction-following and structured-output modes (Claude Sonnet, GPT-5-mini, Gemini 2.5 Flash) are the documented fit. Cheap tier is often the right answer here because the task is narrow and auditable against the source document.

Long-context reliability

Needle-in-haystack performance over 100k+ token inputs. Relevant to multi-filing peer comparison, full-transcript earnings-call analysis, and disclosure-diff detection across several years of 10-Ks. Anthropic publishes a 200k context window for the Claude 4 family, with a 1M-token beta tier; OpenAI publishes 400k for GPT-5; Google publishes 2M for Gemini 2.5 Pro¹²³. Published window size and published retrieval accuracy at depth are different things, which is exactly why the framework routes this capability to the eval step.

Structured-output adherence

Returning JSON that matches a schema without stray prose, without markdown fences, and without renaming fields. Anthropic's tool use, OpenAI's structured outputs, and Google's schema-constrained generation all address this, but published guarantees differ. Tasks that feed structured output into a downstream tabular pipeline (forecast feature extraction, guidance-delta tables) must treat schema adherence as a hard filter, not a nice-to-have.

Reasoning depth

Multi-step chains where an error early in the chain compounds. Price-blind research, multi-quarter guidance-delta analysis, and compliance diff detection sit here. Vendor-published "thinking" or "extended reasoning" modes (Anthropic extended thinking, OpenAI o-series, Gemini 2.5 Pro thinking) are the documented option. These modes trade latency and token spend for chain-of-thought stability. See C3 /articles/thinking-tokens-finance-tasks/ for the dedicated walk-through.

Multilingual

Non-English filings, cross-listed issuer disclosures, and German Jahresabschluss or Geschäftsbericht documents. All three major families publish multilingual support; the meaningful differences are in idiomatic financial phrasing. Finance-specific terminology (IFRS, HGB, US-GAAP reconciliation vocabulary) is where eval harnesses diverge from general multilingual benchmarks.

Vision

Chart extraction, scanned filing OCR, and tables rendered as images rather than HTML. Claude, GPT, and Gemini families all ship native vision on their current flagship tiers. Older cheap tiers sometimes skip vision; verify on the pricing page before routing an image-heavy workload to a Haiku-class endpoint.

How the axes interact

Capability axes do not combine additively. A task that needs long-context reliability and structured-output adherence at the same time is strictly harder than either axis alone, because models that tolerate 400k-token inputs can drift from a JSON schema under positional pressure. A task that needs reasoning depth and multilingual is strictly harder than either alone, because thinking-mode chains in non-English can regress to translationese. The axis list is a checklist for the filter step, not a weighted sum. The eval harness is where axis interaction gets measured on the buyer's own documents.

The ten scenarios

The centerpiece of the framework is a mapping from finance task archetype to recommended model tier. Each row below specifies input size, output size, latency envelope, the recommended tier, the rationale, and an order-of-magnitude monthly cost at 2026-04 published rates. The rates used for the cost column are the Anthropic public pricing for Sonnet 4.6 and Haiku 4.6, OpenAI public pricing for GPT-5 and GPT-5-mini, and Google public pricing for Gemini 2.5 Flash and Pro¹²³. Batch API discounts (50 percent on Anthropic, 50 percent on OpenAI) apply where noted⁴.

#	Scenario	Input / Output tokens	Latency budget	Recommended tier	Rationale	Est. monthly cost
a	Nightly 10-K triage sweep, 500 filings	80k / 2k	Overnight (batch)	Sonnet-class or GPT-5-mini, batch mode	Structured-output + long-context at lowest batch rate. Overnight latency lets 50 percent batch discount apply.	$120-220
b	Real-time earnings reaction summariser	6k / 400	< 2s P95	Haiku-class (Claude Haiku 4.6 or Gemini 2.5 Flash-Lite)	Small inputs, tight latency, low reasoning depth. Cheapest tier with vision optional.	$30-60
c	Deep peer comparison across 5 filings	450k / 6k	< 90s	Opus-class, or Sonnet-class on the 1M context tier	Long-context reliability at depth. Output is a ranked differential with citations.	$250-500
d	Price-blind research synthesis	40k / 4k	< 60s	Opus-class, GPT-5, or Gemini 2.5 Pro	Reasoning depth is the binding constraint. Price-blind prompt pattern forbids price leakage.	$180-320
e	MD&A classification, 12 risk buckets	18k / 120	Overnight (batch)	Sonnet-class few-shot, or fine-tuned Haiku if volume > 100k/month	Structured output with short response. Fine-tune amortises above ~100k calls/month.	$40-110
f	Sentiment tone shift across calls	25k / 800	< 20s	Sonnet-class	Balances nuance detection with cost. Haiku underfits on tonal gradation; Opus overspends for the task shape.	$70-130
g	Forecast feature extraction for downstream model	35k / 300 JSON	< 15s	Opus-class when precision matters, Sonnet-class otherwise	Downstream model quality is capped by extraction quality. Hallucinated features corrupt the training set.	$150-400
h	Compliance / disclosure diff detection	2x 90k / 3k	< 45s	Sonnet-class with retrieval layer	Retrieval narrows to diff regions; model verifies semantic meaning of the diff.	$90-170
i	Multi-quarter guidance delta	120k / 1.5k	< 30s	Sonnet-class structured output	Guidance tables need schema adherence and multi-document coverage.	$80-140
j	Quick Q&A against a single filing	40k / 200	< 3s P95	Haiku-class	Interactive. Cost per query matters more than reasoning depth because the human-in-the-loop catches errors.	$20-50

Cost columns assume moderate retail volumes: 500 runs per night for (a), 2,000 real-time queries per day for (b), 20 peer-comparison runs per week for (c). The Token Cost Optimizer rescales any row to a different volume and vendor mix.

Reading the table

Three patterns repeat across the ten rows. First, latency is the dominant filter. Scenario (b) forbids Opus-class because a 6-9 second P95 makes an earnings-reaction summariser unusable; scenario (a) admits Opus-class on cost but the scenario's overnight window removes the latency gate entirely. Second, output-token share matters more than input-token share at Opus-class pricing. Scenario (g)'s 300-token JSON output looks negligible against 35k input, but at Opus output rates the JSON is a quarter of the per-call cost. Third, the batch-mode 50 percent discount reshapes the decision for any workload with tolerance for a 24-hour delivery window; scenarios (a) and (e) both assume batch, which drops the Sonnet-class quote into Haiku-class real-time territory.

The table lists a single recommended tier per row for readability, but every row in practice has a primary and a fallback candidate. Scenario (c) is Opus-class first, Sonnet-1M second; scenario (f) is Sonnet-class first, GPT-5-mini second. The fallback exists because vendor outages happen and because model-family updates can invalidate a primary choice mid-quarter.

The decision tree

The scenario table above is the human-readable version. The machine-readable version is a 30-line function that takes task shape and budget and returns a ranked shortlist. The function below reads a static catalogue of 2026-04 published rates and context limits and produces the shortlist deterministically. It is not a substitute for the eval step; it narrows the candidate set that the eval harness runs against.

from dataclasses import dataclass

@dataclass
class Model:
    tier: str
    vendor: str
    in_price_per_mtok: float   # USD per 1M input tokens
    out_price_per_mtok: float  # USD per 1M output tokens
    context_tokens: int
    p95_latency_ms: int
    supports_batch: bool

CATALOGUE = [
    Model("haiku",     "anthropic", 1.00, 5.00,  200_000,  1200, True),
    Model("sonnet",    "anthropic", 3.00, 15.00, 200_000,  3500, True),
    Model("sonnet-1m", "anthropic", 6.00, 22.50, 1_000_000, 4500, True),
    Model("opus",      "anthropic", 15.0, 75.00, 200_000,  9000, True),
    Model("gpt5-mini", "openai",    0.25, 2.00,  400_000,  2500, True),
    Model("gpt5",      "openai",    1.25, 10.00, 400_000,  6000, True),
    Model("gemini-flash", "google", 0.30, 2.50,  1_000_000, 2800, True),
    Model("gemini-pro",   "google", 1.25, 10.00, 2_000_000, 5500, True),
]

def select(task_in_tok: int, task_out_tok: int, latency_ms: int,
           monthly_budget_usd: float, runs_per_month: int,
           needs_structured: bool = False) -> list:
    shortlist = []
    for m in CATALOGUE:
        if task_in_tok + task_out_tok > m.context_tokens:
            continue
        if m.p95_latency_ms > latency_ms:
            continue
        cost_per_run = (task_in_tok * m.in_price_per_mtok
                        + task_out_tok * m.out_price_per_mtok) / 1_000_000
        monthly_cost = cost_per_run * runs_per_month
        if m.supports_batch and latency_ms > 60_000:
            monthly_cost *= 0.5  # batch discount
        if monthly_cost > monthly_budget_usd:
            continue
        shortlist.append((m.tier, m.vendor, round(monthly_cost, 2)))
    return sorted(shortlist, key=lambda x: x[2])

Example usage on scenario (c), deep peer comparison:

>>> select(task_in_tok=450_000, task_out_tok=6_000,
...        latency_ms=90_000, monthly_budget_usd=500,
...        runs_per_month=80)
[('sonnet-1m', 'anthropic', 309.6),
 ('gemini-pro', 'google', 49.8),
 ('gemini-flash', 'google', 13.1)]

Three candidates survive the filter. The shortlist then goes into the eval harness: twenty representative peer-comparison tasks with known-good answers, each model graded on citation accuracy, numerical precision, and structured-output adherence. The winner is whichever candidate passes the harness at acceptable cost, not whichever is cheapest.

Fallback chains

Production stacks rarely run one model. A common pattern: Haiku-class for the first pass, Sonnet-class on disagreement or low-confidence output, Opus-class reserved for escalation. The Fallback Chain Simulator models the blended cost of a two- or three-stage chain. Fallback logic has to be designed against measured escalation rates, which again routes the decision to the eval harness.

What the framework deliberately does NOT tell you

No accuracy numbers appear anywhere in this article. Not for extraction fidelity, not for long-context retrieval, not for structured-output adherence, not for reasoning. Accuracy statements depend on three buyer-specific factors that public benchmarks cannot capture.

First, document corpus. A 10-K triage score measured against Fortune 500 filings does not transfer to small-cap issuers with non-standard disclosure formats. Second, prompt shape. The same model, on the same document, returns different answers to a zero-shot prompt and a four-shot prompt with schema-pinned output. Vendor demos always use the best prompt shape; the buyer's production shape is rarely identical. Third, error cost. A 95 percent accuracy rate on extraction is excellent when the five percent failures are reviewer-flaggable and catastrophic when they silently corrupt a downstream model.

The honest answer is that model ranking on a finance task is empirical and buyer-specific. The framework narrows the candidate set to two or three; the eval harness picks the winner. D1 /articles/eval-harness-finance-llm/ sets out the harness methodology: sample size, grading rubric, blinding, and the confidence intervals that make the comparison actually mean something.

Published benchmark numbers from vendor marketing pages fail on the same three axes. They are computed on vendor-selected corpora, with vendor-tuned prompts, and without error-cost weighting. They are useful as capability signals, not as procurement evidence.

The framework also does not tell the buyer which vendor relationship to lock in. Lock-in is a separate decision driven by billing, compliance review, procurement speed, and EU data-residency constraints that no accuracy table resolves. A practitioner in a BaFin-regulated context will pick differently from a practitioner running a solo research stack, even when the scenario table recommends the same tier. The BaFin + EU Guide for Retail AI Traders sets out the regulatory overlay.

The tier cheat sheet

The scenarios resolve to five working tiers. The table below names each tier and its natural finance fit without pinning a specific model, because the specific model list rotates quarterly.

Tier	Typical cost / 1M tok in	Context	Natural finance fit
Nano / Flash-Lite	$0.10-0.30	100k-1M	High-volume classification, interactive Q&A, real-time summarisation
Haiku / GPT-mini / Flash	$0.25-1.00	200k-1M	Structured extraction, batch MD&A classification, retrieval augmentation
Sonnet / GPT-5 / Gemini Pro	$1.25-3.00	200k-2M	Peer comparison with retrieval, guidance deltas, compliance diffs
Sonnet 1M / GPT-5 400k	$6-8	400k-1M	Single-prompt multi-document reasoning without retrieval engineering
Opus	$15	200k	Price-blind research, high-stakes forecast feature extraction

Prices are input-side per million tokens at 2026-04 published rates. Output tokens run roughly 4-5x input on all vendors. Batch mode cuts both sides by 50 percent where available⁴.

Validation: always close with an eval

The framework's load-bearing step is the eval harness, not the catalogue filter. A shortlist of three candidates means three harness runs against the same twenty representative tasks, graded with the same rubric, reported with the same confidence interval. The alternative (picking on vibes, picking on Twitter screenshots, picking on a vendor benchmark) is how budgets get spent on models that underperform a cheaper tier for the specific workload.

The harness does not have to be elaborate. Twenty tasks, three candidates, one rubric sheet, one afternoon. D1 /articles/eval-harness-finance-llm/ walks through a minimum viable version. The Model Selector for Finance wraps the scenario-to-tier mapping in a browser tool that outputs the shortlist; the harness turns the shortlist into a decision.

Connects to

The 2026 Engineer's Guide to AI in Markets — pillar on model selection inside a broader AI-in-markets stack.
Finetune vs RAG vs Long Context for Filings — sibling on context-strategy selection for the same workloads.
Batch API Economics for Finance — when overnight batching flips the cost envelope.
Thinking Tokens for Finance Tasks — when extended-reasoning modes earn their token cost.
Eval Harness for Finance LLMs — the validation step this framework hands off to.
Token Cost Reality for LLM Trading Research — grounds the monthly-cost column in real workload patterns.
Model Selector for Finance — interactive tool wrapping the decision tree above.
Token Cost Optimizer — rescales the cost column to a different volume and vendor mix.
Financial Document Token Estimator — converts filing page counts to token counts for the input-size column.

References

Anthropic. "Extended thinking." docs.anthropic.com/en/docs/build-with-claude/extended-thinking, accessed 2026-04-22.
Anthropic. "Prompt caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed 2026-04-22.
OpenAI. "Structured outputs." platform.openai.com/docs/guides/structured-outputs, accessed 2026-04-22.
Google. "Long-context best practices." ai.google.dev/gemini-api/docs/long-context, accessed 2026-04-22.
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL 12. Methodological anchor for long-context retrieval eval design.

Anthropic. "Models overview" and "Pricing." docs.anthropic.com, accessed 2026-04-22. Context window and list pricing for Claude Haiku 4.6, Sonnet 4.6 (200k + 1M beta), and Opus 4.7. ↩ ↩²
OpenAI. "Models" and "Pricing." platform.openai.com, accessed 2026-04-22. Context window and list pricing for GPT-5 and GPT-5-mini. ↩ ↩²
Google. "Gemini API pricing" and "Models." ai.google.dev, accessed 2026-04-22. Context window and list pricing for Gemini 2.5 Flash, Flash-Lite, and Pro. ↩ ↩²
Anthropic. "Batch API." docs.anthropic.com/en/docs/build-with-claude/batch-processing, accessed 2026-04-22. 50 percent batch-mode discount, 24-hour delivery window. OpenAI publishes an equivalent Batch API at platform.openai.com/docs/guides/batch. ↩ ↩²