aifinhub

How Model Selector for Finance works

The Model Selector for Finance ranks eight LLMs against a task profile you provide. It scores every model on five axes — cost, latency, context, capability, and quality sensitivity — and returns a full ranking with per-axis pass/fail notes and plain-English rationale. It does not rank models by benchmark accuracy. That is a deliberate design choice explained below.

What the tool computes

You pick a task (extract, summarize, forecast, compare, rank, synthesize), a latency budget, a cost budget, a context-size need, and a quality sensitivity. The engine evaluates each model against those inputs and outputs a ranked list with:

Inputs and assumptions

Scoring framework

The score for each model is the sum of five terms:

score = cost_match + latency_match + context_match
      + capability_bonus + quality_boost

cost_match    : 0 if monthly estimate > budget ceiling, else 25 + headroom bonus
latency_match : 0 if tier slower than latency budget, else base + haiku bonus
context_match : 0 if context window < required, else base + large-context bonus
capability    : bonus if task ∈ model.best_for
quality       : boost flagship tiers when quality = high; boost haiku when low

A model that fails any hard gate (cost, latency, or context) is flagged "gate failed" and pushed below all qualifying models in the ranking. It is still displayed, with an axis note explaining which constraint it missed, so you can see what you would gain or lose by loosening a requirement.

Why there are no accuracy numbers

Published LLM leaderboards drift, are gamed, and almost never match your finance workload. A selector that claims "Sonnet scored 89% on this benchmark" pretends those numbers transfer to your extraction, forecast, or comparison pipeline. They usually do not.

The alternative is honest: frame selection around pricing, context, latency, and vendor-documented capabilities — and insist that quality be measured in your harness, on your data. The related article Eval harness for finance LLMs walks through how to build one in a weekend.

So the tool does what it can ground: it picks the models that fit your budget, context, and latency gates, nudges you toward vendor-positioned tiers for your task, and then hands the accuracy question back to you.

Formulas and sources

Reference monthly dollar estimate, used only to compare models against a cost budget:

ref_monthly_usd =
  (REF_INPUT_TOKENS_PER_CALL  / 1e6) × input_rate  × REF_CALLS_PER_MONTH
+ (REF_OUTPUT_TOKENS_PER_CALL / 1e6) × output_rate × REF_CALLS_PER_MONTH

REF_INPUT_TOKENS_PER_CALL  = 6000
REF_OUTPUT_TOKENS_PER_CALL = 1200
REF_CALLS_PER_MONTH        = 3000

Rates and context windows are sourced from vendor pricing pages as of 2026-04-23:

Per-tier latency conventions

These are vendor-positioning conventions, not guaranteed SLAs. Always measure latency in your own deployment before committing a production path.

Limitations

Related articles

Changelog

Planning estimates only — not financial, tax, or investment advice.