How Model Selector for Finance works

The Model Selector for Finance ranks eight LLMs against a task profile you provide. It scores every model on five axes — cost, latency, context, capability, and quality sensitivity — and returns a full ranking with per-axis pass/fail notes and plain-English rationale. It does not rank models by benchmark accuracy. That is a deliberate design choice explained below.

What the tool computes

You pick a task (extract, summarize, forecast, compare, rank, synthesize), a latency budget, a cost budget, a context-size need, and a quality sensitivity. The engine evaluates each model against those inputs and outputs a ranked list with:

A top-3 callout with the strongest model highlighted.
A full ranked list of all eight models with why-not notes.
A per-axis comparison table showing published rates, context window, thinking-token support, a reference monthly dollar estimate, and pass/fail badges for cost, latency, context, and capability axes.

Inputs and assumptions

Task type. The engine treats each model's best_for list — derived from vendor positioning — as the capability signal, not accuracy scores.
Latency budget. Tier latency conventions: Haiku-class under 1 second, Sonnet-class under 5 seconds, Opus-class under 30 seconds. These are rough published-positioning guides, not SLAs. Your own prompt length, reasoning-mode settings, and network path dominate real-world latency.
Cost budget. Cost fit is evaluated against a reference monthly workload: 6,000 input tokens and 1,200 output tokens per call, 3,000 calls per month. If your workload is larger or smaller, scale the dollar estimate linearly.
Context-size need. The engine gates on the model's published context window. Retrieval-augmented architectures may let a smaller window still work — the tool does not attempt to simulate RAG.
Quality sensitivity. When you select "high", flagship tiers and models with vendor-documented thinking-tokens support receive a score boost. When you select "low", haiku-tier models receive a boost.

Scoring framework

The score for each model is the sum of five terms:

score = cost_match + latency_match + context_match
      + capability_bonus + quality_boost

cost_match    : 0 if monthly estimate > budget ceiling, else 25 + headroom bonus
latency_match : 0 if tier slower than latency budget, else base + haiku bonus
context_match : 0 if context window < required, else base + large-context bonus
capability    : bonus if task ∈ model.best_for
quality       : boost flagship tiers when quality = high; boost haiku when low

A model that fails any hard gate (cost, latency, or context) is flagged "gate failed" and pushed below all qualifying models in the ranking. It is still displayed, with an axis note explaining which constraint it missed, so you can see what you would gain or lose by loosening a requirement.

Why there are no accuracy numbers

Published LLM leaderboards drift, are gamed, and almost never match your finance workload. A selector that claims "Sonnet scored 89% on this benchmark" pretends those numbers transfer to your extraction, forecast, or comparison pipeline. They usually do not.

The alternative is honest: frame selection around pricing, context, latency, and vendor-documented capabilities — and insist that quality be measured in your harness, on your data. The related article Eval harness for finance LLMs walks through how to build one in a weekend.

So the tool does what it can ground: it picks the models that fit your budget, context, and latency gates, nudges you toward vendor-positioned tiers for your task, and then hands the accuracy question back to you.

Formulas and sources

Reference monthly dollar estimate, used only to compare models against a cost budget:

ref_monthly_usd =
  (REF_INPUT_TOKENS_PER_CALL  / 1e6) × input_rate  × REF_CALLS_PER_MONTH
+ (REF_OUTPUT_TOKENS_PER_CALL / 1e6) × output_rate × REF_CALLS_PER_MONTH

REF_INPUT_TOKENS_PER_CALL  = 6000
REF_OUTPUT_TOKENS_PER_CALL = 1200
REF_CALLS_PER_MONTH        = 3000

Rates and context windows are sourced from vendor pricing pages as of 2026-04-23:

Per-tier latency conventions

Haiku / Flash tier: positioned for sub-1s response at typical prompt sizes. Good for extraction and filtering pipelines.
Sonnet / mid tier: positioned for sub-5s response. Good for summarization, comparison, and most structured tasks.
Opus / flagship tier: positioned for heavier reasoning loops. Thinking-tokens add latency. Plan for sub-30s per call or async batch.

These are vendor-positioning conventions, not guaranteed SLAs. Always measure latency in your own deployment before committing a production path.

Limitations

This is a planning tool, not investment advice.
No accuracy benchmarks are used — by design. Build your own eval harness.
Numbers reflect published rates as of 2026-04-23; vendors change pricing frequently.
Batch-API discounts, enterprise rates, and prompt-cache savings are not applied to the reference estimate.
Latency conventions are rough tier-level defaults; real latency depends on prompt, thinking-mode, and network path.
Capability tagging comes from vendor positioning, not from benchmarks.

Model selection framework for finance — the full decision framework behind this tool.
Eval harness for finance LLMs — how to measure quality on your own data.
Thinking tokens on finance tasks — when reasoning-mode is worth the extra cost and latency.

Changelog

2026-04-23 — Initial release with 8 models across Anthropic, OpenAI, Google.