For an extraction workload with sub-5-second latency budget, $50/month cost ceiling, 200k–1M context need, and medium-quality tolerance, the Model Selector for Finance returns: Gemini 2.5 Flash-Lite (score 96.0, $3/mo, all five axes pass) at rank 1, with Gemini 2.5 Flash (score 92.7, $14/mo, all five axes pass) the only other qualifying model at rank 2. Claude Haiku 4.5 is disqualified (context window 200k below the 200k–1M requirement), and Sonnet / Opus / GPT-5.5 all disqualified on budget. Lifting the latency budget to sub-30s and the cost ceiling to $200+/month puts Claude Opus 4.7, Gemini 2.5 Pro, and Gemini 3.5 Flash all tied at score 98.0 ($180/mo, $58/mo, and $59/mo respectively). Tier selection is the discriminator on this workload; vendor swap is the price-performance lever.

TL;DR

Two scenario runs, same task (extract), four-axis filter:

Inputs Top pick Budget Score Why
sub_5s / $50 / 200k–1M / medium Gemini 2.5 Flash-Lite $3/mo 96.0 All five axes pass; cheapest qualifying model
sub_30s / $200+ / 200k–1M / high Opus 4.7, Gemini 2.5 Pro or Gemini 3.5 Flash $180 / $58 / $59 98.0 tied Different models at same fitness score

The medium-quality workload at the $50 ceiling has two qualifying models, both Gemini Flash-class. The high-quality workload has three tied on fitness and the choice flips to vendor-relationship and quality-tolerance factors that the engine cannot adjudicate.

The five-axis filter

The selector implements a deterministic filter across:

  1. Task fit. Each model has a vendor-positioned bestFor list. Gemini 2.5 Flash-Lite lists "extract" and "summarize", extract matches.
  2. Latency budget. Mapped to tier: Haiku-class for sub-1s/sub-5s; Sonnet/GPT-5.5/Gemini Pro for sub-30s; Opus-class for batch.
  3. Cost budget. Mapped to monthlyBudgetEstimate at the selector's reference workload. The $50/mo ceiling excludes Sonnet ($108/mo at reference workload), Opus ($180/mo), and GPT-5.5 ($198/mo).
  4. Context need. Mapped to contextWindow. The 200k–1M requirement excludes Haiku 4.5 (200k window, below the lower bound of the range, because the selector interprets 200k–1M as strictly above 200k).
  5. Quality sensitivity. Mapped to tier permissiveness. Medium-quality accepts Haiku-class through Opus; high-quality requires Sonnet+ and weights against Haiku.

For the canonical input, two models pass all five gates: Gemini 2.5 Flash-Lite (score 96.0) at rank 1 and Gemini 2.5 Flash (score 92.7) at rank 2. The Flash-Lite score reflects: cost-fit (well under budget at $3/mo), latency-fit (Haiku tier well under 5s), context-fit (1M window covers the requirement), capability-fit (vendor positions for extract), and quality-fit (medium acceptable for Haiku-class). Flash-Lite outranks Flash purely on the cost axis — its $0.10/$0.40 per-Mtok rate is the cheapest in the table.

Why Claude Haiku 4.5 is disqualified at rank 2

The Anthropic Haiku 4.5 ships with a 200k context window. The selector's context-need band "200k–1M" is interpreted as "I need a context window strictly within or above this range", i.e., the model must have at least the lower bound and ideally up to the upper bound. A 200k window does not exceed 200k; the selector disqualifies on the context axis.

That is a strict reading. In practice many extraction workloads fit comfortably in 200k tokens (most 10-Ks are ~20k tokens; even five-peer comparison fits). If the buyer's real context need is 100k–200k they should pick context = "k32_200k" not "k200_1m" and Haiku 4.5 will qualify with score ~76.

The lesson is that the selector's gates are exact, not heuristic. A buyer who picks a context band one notch above what they need will exclude qualifying models. The audit-defensible workflow is to declare the actual extraction-size distribution (median tokens, 95th-percentile tokens) and pick the smallest context band that covers the 95th percentile.

Why Sonnet, Opus, and GPT-5.5 are all disqualified on budget

The selector's reference workload (input/output token shape at default volume) produces:

  • Claude Opus 4.7: $180/mo (input rate $5/Mtok, output rate $25/Mtok)
  • Claude Sonnet 4.6: $108/mo (input $3/Mtok, output $15/Mtok)
  • GPT-5.5: $198/mo (input $5/Mtok, output $30/Mtok)
  • Gemini 3.5 Flash: $59/mo (input $1.50/Mtok, output $9/Mtok)
  • Gemini 2.5 Pro: $58/mo (input $1.25/Mtok at reference, output $10/Mtok)
  • Gemini 2.5 Flash: $14/mo (input $0.30/Mtok, output $2.50/Mtok)
  • Gemini 2.5 Flash-Lite: $3/mo (input $0.10/Mtok, output $0.40/Mtok)

The $50/mo ceiling excludes everything except the two cheap Flash tiers (Flash-Lite at $3/mo and Flash at $14/mo). Gemini 2.5 Pro at $58/mo and Gemini 3.5 Flash at $59/mo are just over the line, and the selector flags them as cost-disqualified in the broader output. Note Gemini 3.5 Flash, despite the "Flash" name, prices like a frontier reasoning tier — its $9/Mtok output is ~3.6× Gemini 2.5 Flash, so it lands above the ceiling, not below it.

The selector's reference workload is a fixed assumption baked into the engine. A real buyer with lower call volume or smaller per-call token counts will see different absolute numbers; the ranking across models holds because the input/output rates are linear. A 10% volume reduction shifts every model's monthly cost by 10%; the disqualification line moves in the same direction.

The high-quality regime: where the choice splits

The second scenario (sub_30s / $200+ / 200k–1M / high) has different filter behaviour. Latency relaxes from sub-5s to sub-30s, opening the Sonnet, GPT-5.5, and Pro tiers. Cost relaxes to $200+, opening Opus. Quality lifts to high, which weights against Haiku and increases the score gap between the cheap and the flagship tiers.

The selector produces a three-way tie at score 98.0 between Opus 4.7 ($180/mo), Gemini 2.5 Pro ($58/mo), and Gemini 3.5 Flash ($59/mo). All three pass all five axes; all three rank as "primary-positioned" for the task; all three ship at the required quality tier. The selector cannot break this tie because it does not measure accuracy on the buyer's own documents, that requires an eval harness, which is outside the selector's scope.

The defensible workflow is: shortlist Opus, Gemini Pro, and Gemini 3.5 Flash, run a 50-task eval harness (see Eval Harness for Finance LLM) on the buyer's actual extraction corpus, pick the winner. The 3× cost gap between Opus ($180) and the two Gemini options ($58–$59) justifies the eval effort: if a Gemini option matches Opus on accuracy, the ~$120/month savings funds the eval many times over.

The cost-per-validated-extraction frame

Raw monthly cost is the first filter. The right second filter is cost-per-validated-extraction, the cost divided by the fraction of extractions that pass quality review. For two models at the same monthly cost:

  • Model A: $58/mo, 95% extractions pass review → effective cost per validated extraction = $58 / 0.95N = (1/0.95) × per-extraction-cost
  • Model B: $58/mo, 80% extractions pass review → effective cost per validated extraction = $58 / 0.80N = (1/0.80) × per-extraction-cost

Model B is 19% more expensive per validated extraction even though the headline budgets match. The selector does not compute this, it requires the eval-harness pass-rate input. The Token Cost Optimizer provides the cost-per-validated-trade framing for the broader workflow.

Where the selector breaks

The selector's reference workload is the largest single source of mis-application. Buyers whose workload is smaller will see all models qualify; buyers whose workload is larger will see disqualifications they would not otherwise expect. The fix is to use the Token Cost Optimizer with the buyer's actual (input_tokens_per_call, calls_per_idea, ideas_per_day) to compute the cost-fit independently.

The selector also treats the published context window as the binding constraint. In practice, models degrade quality before they hit the context window's hard limit — needle-in-haystack performance at 90% of the published window is materially worse than at 50%. The selector does not model this; the eval harness has to.

Finally, the selector treats "extract" as a single task type. In reality there are field-by-field extraction, free-form summarization, citation-required verbatim quotes, and contradiction-triangle cross-check (see LLM Prompt Patterns for 10-K and 8-K Extraction). The selector's bestFor list is coarse-grained; the buyer's actual prompt pattern matters more than the abstract task type.

Connects to

References

  • Anthropic. "Pricing." anthropic.com, accessed 2026-05-21. https://www.anthropic.com/pricing
  • Google. "Gemini API pricing." ai.google.dev, accessed 2026-05-21. https://ai.google.dev/pricing
  • OpenAI. "API Pricing." openai.com/api/pricing, accessed 2026-05-21. https://openai.com/api/pricing/
  • Anthropic. "Claude models overview." docs.anthropic.com/en/docs/about-claude/models, accessed 2026-05-21. Context-window and capability documentation.
  • Google. "Gemini Models." ai.google.dev/gemini-api/docs/models/gemini, accessed 2026-05-21.
  • Liu, N. F., Lin, K., Hewitt, J., et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL 12. Methodological reference for context-window degradation.

Verified engine output

Show the recompute-verified inputs and outputs
Inputs
taskextract
latencysub_5s
costb50
contextk200_1m
qualitymedium
Result
ranked (10 items)[...]

Computed live at build time.

Frequently asked questions

Why do the Gemini Flash tiers beat Claude Haiku 4.5 at the same Haiku tier?
Both Flash-Lite and Flash have a 1M context window (qualifying the 200k–1M filter); Haiku 4.5 has 200k (disqualifying). They also cost a fraction of Haiku at the reference workload.
Can I trust the monthlyBudgetEstimate as an absolute cost prediction?
No, it is the cost at the selector's reference workload. Your actual cost depends on your actual input/output shape and call volume; use the Token Cost Optimizer for your own inputs.
Why are Opus 4.7, Gemini 2.5 Pro, and Gemini 3.5 Flash tied at score 98.0 in the high-quality scenario?
All three pass all five filter axes; the selector's scoring does not adjudicate cost as a tie-breaker. Once a model passes the gate, axis fit is identical at extract / high quality.
How do I break the three-way tie?
Run a 50-task eval on your own documents. The ~3× cost gap between Opus and the Gemini options should fund the eval many times over; if accuracy ties too, pick on cost and vendor-relationship factors.
Does the selector account for batch-API discounts?
No, it uses real-time published rates. For batch workloads with 24-hour SLA, Anthropic and OpenAI publish 50% discounts; use the Batch vs Realtime Cost Calculator.