Evaluation Harness for Finance LLM Tasks

TL;DR

A public benchmark is a signal, not a decision. Published accuracy on FinQA, SEC-filing QA, or similar datasets tells a practitioner how a model performs on that dataset, not on the workload in front of them. Any team deploying an LLM for finance should build its own eval harness against ground truth specific to its tasks. The cost is modest: a day of labeling, 50 to 200 examples, and roughly five dollars of API spend. The payoff is reliable model selection across monthly releases, prompt changes, and schema revisions. This piece covers methodology for building that harness: four finance-relevant eval categories, ground-truth sources that are public and legally usable, a runnable skeleton in about sixty lines of Python, statistical rigor for comparing models, and integration with CI. It publishes zero accuracy numbers. Accuracy numbers belong in a practitioner's own harness, measured on their own corpus, in the month they deploy.

Why public benchmarks are insufficient

Public benchmarks answer a question no practitioner asked: how does model M perform on dataset D in evaluation month E? A finance team deploying a model for revenue extraction on unstructured 10-Q narratives has a different dataset, different prompts, different scoring rules, and a deployment window that arrives weeks before the next public leaderboard update. Four gaps, each independently fatal to benchmark-driven model selection:

Benchmark contamination. Frontier models are trained on web-scale text that likely includes portions of FinQA, TAT-QA, FinanceBench, and most question-answering sets scraped before 2024. Reported accuracy on a contaminated benchmark is an upper bound on memorization, not on reasoning over unseen filings. The contamination direction biases reported scores upward, which practitioners then use to over-estimate deployed performance.

Task-distribution drift. A benchmark question like "What was the company's total revenue in 2022?" is structured, single-fact, and uses canonical phrasing. A production question from an analyst workflow might be "Extract trailing-twelve-month recurring-revenue growth from the MD&A, decomposed by segment, including any segment reclassifications disclosed in footnote 12." Accuracy on the first does not predict accuracy on the second, and the gap grows with prompt complexity.

Scoring mismatch. FinQA scores exact numeric match. A practitioner might accept a 0.5% tolerance on reported revenue to absorb rounding and unit conventions (millions vs. thousands). Exact-match scoring penalizes the model for a correct answer expressed in a different unit; tolerance-based scoring does not. The same responses yield wildly different accuracies under the two rules.

Cold-start on new releases. A new frontier model ships on a Tuesday. Public benchmarks covering it do not exist until a community rerun lands, typically two to six weeks later. A practitioner who waits for published scores before deciding is making a deployment decision weeks late. A practitioner with their own harness runs the eval the same afternoon and ships on Wednesday.

The combined effect: public benchmarks are a useful filter for which models are worth testing, and they are almost useless for the specific question which model should go into this pipeline next Monday. The gap is filled by a local eval harness.

A secondary but often-ignored failure mode: public benchmarks are usually evaluated at temperature zero with a single sample. Production workflows often run at temperature 0.2 or higher to diversify chain-of-thought paths, or they draw multiple samples and aggregate. Variance across samples is invisible in leaderboard reporting, and a model with a lower mean but tighter distribution can outperform a higher-mean, higher-variance competitor on the same workflow. A local harness that records distribution shape alongside point accuracy is the only place that signal surfaces.

The four eval categories for finance

Most finance LLM workloads decompose into four categories. Each demands a different metric; mixing them under a single accuracy number averages away signal.

Numeric extraction

Pull a specific numeric field from a document. Revenue from an income statement, EPS from a press release, weighted-average diluted-share count from a 10-K filing footnote. The answer is a number. The metric is numeric tolerance: the extracted value matches ground truth within an absolute and relative tolerance. Typical tolerances for reported financials: absolute tolerance of 0.01 for ratios, relative tolerance of 0.1% to 1% for dollar amounts to absorb rounding and unit conversion.

Exact-match scoring is wrong for this category. A model returning 12,345.6 when ground truth is 12,345.67 has extracted the correct number at the disclosed precision; penalizing it conflates extraction accuracy with formatting fidelity.

Narrative summarization

Summarize the risk factors section, the MD&A, or a specific disclosure. The answer is a paragraph. Scoring is harder because multiple correct summaries exist. Two practical metrics: ROUGE-L against a gold reference (captures lexical overlap), and semantic similarity via sentence embeddings (captures paraphrase equivalence). Either metric, used alone, has known failure modes; reporting both narrows the confidence interval on whether a summary is actually faithful.

A third option, LLM-as-judge, is tempting but introduces a recursive calibration problem. The eval harness depends on the judge's accuracy, which itself requires evaluation. Practitioners who rely on LLM-as-judge should validate the judge on a held-out human-scored set before trusting aggregate numbers.

Comparative ranking

Given three or five companies and a metric, rank them. Ground truth comes from analyst consensus or computed values from the underlying filings. The metric is Spearman rank correlation or Kendall's tau between the model's ranking and the reference ranking. Exact-position accuracy is too strict; rank-correlation captures the signal that matters for screening and watchlist generation.

Multi-step reasoning

Compute a derived metric that requires chaining: free cash flow, interest coverage ratio, operating-leverage decomposition. The answer is a number, but the path to it traverses multiple document sections. Scoring is numeric tolerance on the final answer plus, optionally, a rubric on intermediate steps. For critical workflows, intermediate-step scoring catches models that stumble into the right answer through incorrect arithmetic.

A fifth category worth naming, though it folds into the four above for scoring, is classification, which covers yes/no questions like "does this 10-K disclose a material cybersecurity incident?" Classification reduces to exact-match on a discrete label, but the metric that matters is not accuracy; it is precision and recall at a chosen operating point, because the base rate of positive cases is often under 5%. A 95%-accurate classifier that always predicts "no" is worthless at that base rate. The harness should report confusion-matrix entries alongside accuracy whenever class balance is skewed.

The four categories map cleanly onto distinct metric families:

Category	Metric	Tolerance
Numeric extraction	numeric_tolerance	abs 0.01, rel 0.001–0.01
Narrative summarization	ROUGE-L + embedding cosine	threshold 0.70–0.85
Comparative ranking	Spearman rho	rho > 0.7 for pass
Multi-step reasoning	numeric_tolerance on final	tightening: rel 0.005

A single aggregate accuracy across mixed categories obscures which category is failing. The harness should always report per-category breakdowns.

Ground-truth sources: public and legal

Ground truth for finance evals is easier than most domains because regulators require disclosure. Four sources carry a practitioner through most workloads.

SEC EDGAR filings. Any extractable fact in a 10-K, 10-Q, 8-K, DEF 14A, S-1, or 13F is ground truth. Revenue, EPS, filing date, cash position, number of shares outstanding, risk-factor presence, executive-compensation figures: all publicly disclosed, all auditable, all legally usable for eval construction. EDGAR full-text search finds filings by company and form type; the bulk-download endpoints allow full corpus ingestion.¹

XBRL-tagged filings. Since 2009 the SEC has required registrants to tag financial statements in XBRL, a machine-readable format that maps every line item to a standardized concept. For pure financial-statement extraction, XBRL is the most efficient ground-truth source because the correct answer for "what was Q3 2024 revenue?" is a single tagged value, directly parseable without human labeling.²

Published consensus estimates. Where licensing permits (Refinitiv, FactSet, Bloomberg, or free tiers like Yahoo Finance), consensus estimates serve as ground truth for forecasting evals. Redistribution is generally restricted, so the eval corpus stays local; only the aggregate metric leaves.

Self-labeled eval sets. For workloads without a public ground-truth source (niche disclosures, proprietary taxonomy mapping, internal document types), a human-labeled set of 50 to 200 examples is the right tool. One labeler-day produces enough ground truth to discriminate between models at the 5% accuracy level.

Source	Best for	Labeling cost
EDGAR raw	Extraction, summarization	Low; XBRL overlap free
XBRL	Financial-statement extraction	Zero; machine-parseable
Consensus estimates	Forecast scoring	Zero; numeric
Self-labeled	Task-specific workflows	1 day / 100 examples

The rule of thumb: start with XBRL for anything touching the statements, add 100 hand-labeled examples for anything else, never rely on a single source.

One caveat on XBRL: tag errors exist. Registrants occasionally mis-tag line items, and the SEC's automated checks catch only a subset. For eval ground truth, this means a small fraction of XBRL values disagree with the human-readable statement. Practitioners building extraction evals should spot-check XBRL against the PDF-rendered filing on a random 10% sample before declaring the tagged value canonical. Disagreements should be resolved in favor of the human-readable filing, which is the document the model sees at inference time.

Eval harness skeleton

The minimum useful harness fits in about sixty lines. A dataclass for cases, a runner, a handful of metric functions, a report. Anything more elaborate is premature optimization.

from dataclasses import dataclass, field
from typing import Callable, Any
from statistics import mean
import re

@dataclass
class EvalCase:
    case_id: str
    input: dict
    expected: Any
    metric: str  # "exact", "numeric", "rouge", "rank"
    category: str  # "extraction", "summarization", "ranking", "reasoning"

@dataclass
class EvalResult:
    per_case: list[dict] = field(default_factory=list)
    aggregate: dict = field(default_factory=dict)

def exact_match(pred: str, exp: str) -> float:
    return 1.0 if str(pred).strip() == str(exp).strip() else 0.0

def numeric_tolerance(pred, exp, abs_tol=0.01, rel_tol=0.005) -> float:
    try:
        p, e = float(pred), float(exp)
    except (TypeError, ValueError):
        return 0.0
    if abs(p - e) <= abs_tol:
        return 1.0
    if e != 0 and abs(p - e) / abs(e) <= rel_tol:
        return 1.0
    return 0.0

def rouge_l(pred: str, exp: str) -> float:
    # Simple longest-common-subsequence ROUGE-L
    p_tokens = re.findall(r"\\w+", pred.lower())
    e_tokens = re.findall(r"\\w+", exp.lower())
    if not p_tokens or not e_tokens:
        return 0.0
    m, n = len(p_tokens), len(e_tokens)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m):
        for j in range(n):
            dp[i+1][j+1] = (dp[i][j] + 1 if p_tokens[i] == e_tokens[j]
                            else max(dp[i+1][j], dp[i][j+1]))
    lcs = dp[m][n]
    prec, rec = lcs / m, lcs / n
    return 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0

def spearman_rank(pred: list, exp: list) -> float:
    from scipy.stats import spearmanr
    rho, _ = spearmanr(pred, exp)
    return max(0.0, float(rho))

METRICS = {
    "exact": exact_match,
    "numeric": numeric_tolerance,
    "rouge": rouge_l,
    "rank": spearman_rank,
}

def run_eval(cases: list[EvalCase], model_client: Callable) -> EvalResult:
    result = EvalResult()
    for case in cases:
        pred = model_client(case.input)
        score = METRICS[case.metric](pred, case.expected)
        result.per_case.append({
            "case_id": case.case_id,
            "category": case.category,
            "score": score,
            "pred": pred,
        })
    by_cat = {}
    for row in result.per_case:
        by_cat.setdefault(row["category"], []).append(row["score"])
    result.aggregate = {cat: mean(s) for cat, s in by_cat.items()}
    result.aggregate["overall"] = mean([r["score"] for r in result.per_case])
    return result

def report(result: EvalResult) -> None:
    print(f"{'category':<20} {'n':>6} {'accuracy':>10}")
    by_cat = {}
    for row in result.per_case:
        by_cat.setdefault(row["category"], []).append(row["score"])
    for cat, scores in sorted(by_cat.items()):
        print(f"{cat:<20} {len(scores):>6} {mean(scores):>10.3f}")
    print(f"{'overall':<20} {len(result.per_case):>6} "
          f"{result.aggregate['overall']:>10.3f}")

model_client is a function that takes the case input dict and returns the model's answer. It is left abstract because the harness should be model-agnostic; the same cases run against Claude, GPT-5, Gemini, or a local Llama by swapping only the client.

The design choice worth defending: per-case scores are primary, aggregate is derived. Every analysis downstream (bootstrap CIs, paired comparison, error-pattern audits) needs per-case data, because the average alone discards the distribution.

Statistical rigor

An accuracy of 0.82 on 50 cases is not comparable to 0.84 on 50 cases without a confidence interval. Two statistical practices turn the harness from a measurement into a decision tool.

Bootstrap confidence intervals. For each metric, resample the per-case scores with replacement, compute the aggregate, repeat 1,000 to 10,000 times. The 2.5th and 97.5th percentiles of the resampled aggregates form a 95% CI. This is distribution-free and handles any metric, including ROUGE-L and rank correlation.

Paired testing between two models. Evaluating model A and model B on the same cases is a paired design; independent-sample tests throw away the pairing information and need larger samples. For binary outcomes (exact-match), McNemar's test on the 2x2 table of agreement/disagreement is the exact test. For continuous scores (ROUGE-L, numeric tolerance with partial credit), Wilcoxon signed-rank is distribution-free.

Sample-size guidance, derived from the Wilson score interval at 80% accuracy:

n cases	95% CI half-width (accuracy ≈ 0.80)
25	± 18 points
50	± 13 points
100	± 9 points
200	± 6 points
500	± 4 points

A 50-case eval discriminates models only when their true accuracy differs by 15 percentage points or more. A 200-case eval resolves 8-point gaps. The tradeoff is labeling cost; 200 hand-curated cases is a day of work.

import numpy as np
from scipy.stats import wilcoxon

def bootstrap_ci(scores: list[float], n_boot: int = 5000,
                 alpha: float = 0.05) -> tuple[float, float]:
    arr = np.asarray(scores, dtype=float)
    rng = np.random.default_rng(42)
    boots = rng.choice(arr, size=(n_boot, len(arr)), replace=True).mean(axis=1)
    lo = float(np.quantile(boots, alpha / 2))
    hi = float(np.quantile(boots, 1 - alpha / 2))
    return lo, hi

def paired_compare(scores_a: list[float], scores_b: list[float]) -> dict:
    a, b = np.asarray(scores_a), np.asarray(scores_b)
    if len(a) != len(b):
        raise ValueError("paired test requires equal-length score vectors")
    diff = a - b
    if np.all(diff == 0):
        return {"statistic": 0.0, "pvalue": 1.0, "mean_diff": 0.0}
    stat, p = wilcoxon(a, b, zero_method="wilcox", alternative="two-sided")
    return {"statistic": float(stat), "pvalue": float(p),
            "mean_diff": float(diff.mean())}

A practitioner reporting "model B beats model A by 3 points" without a paired p-value and a bootstrap CI on the gap is reporting noise dressed up as a finding.

Avoiding the selection-bias trap

Evaluating twenty models on the same hundred cases and publishing the winner inflates the reported accuracy in exactly the same way that backtesting twenty strategies and publishing the best Sharpe inflates the reported Sharpe. The statistical structure is identical: maximum of N noisy draws converges upward as N grows.

Two mitigations, used together:

Held-out test set. Partition the labeled cases 70/30 into a development set and a held-out test set at the beginning. All model selection, prompt iteration, and harness tuning happens on the development set. The test set is touched exactly once, after the winning model is chosen. Accuracy on the test set is the unbiased estimate; development-set accuracy is a biased upper bound.

Deflated-Sharpe-style correction for model comparison. When k models are compared, the expected maximum accuracy under the null (all models equal) can be computed analytically from the variance of the per-case binomial. Bailey & Lopez de Prado's 2014 deflation framework was written for Sharpe, but the multiple-testing correction generalizes directly to any bounded metric; the same intuition applies.³ For a finance eval specifically, see the companion methodology write-up: Did You Overfit? PBO and Deflated Sharpe.

The practical rule: if the development set has been consulted more than a handful of times, the cached winner is probably overfit to it. Refresh the test split or grow the labeled corpus before shipping.

A third mitigation, cheaper than either of the above, is pre-registration. Before the comparison run, write down the hypothesis, the metric, the decision threshold, and the set of models under test. Deviations from the pre-registered plan (switching metrics, dropping poorly-performing models, changing tolerance thresholds after seeing results) are each opportunities for the reported winner to be an artifact. Pre-registration is a habit practitioners borrow from clinical trials; it costs nothing, catches a surprising amount of self-deception.

Operational integration

An eval that runs once during model selection becomes stale the moment someone edits a prompt. The harness earns its keep as a CI step. Every prompt change, every model-version bump, every schema revision triggers the eval; metrics drift beyond a preset threshold blocks deployment.

A minimal pytest integration:

# test_eval_regression.py
import json
import pytest
from pathlib import Path
from harness import run_eval, report, EvalCase

CASES_PATH = Path("evals/cases.jsonl")
BASELINE_PATH = Path("evals/baseline_metrics.json")
THRESHOLDS = {
    "extraction": 0.03,  # accuracy may drop at most 3 points
    "summarization": 0.05,
    "ranking": 0.05,
    "reasoning": 0.03,
    "overall": 0.02,
}

def load_cases() -> list[EvalCase]:
    cases = []
    for line in CASES_PATH.read_text().splitlines():
        d = json.loads(line)
        cases.append(EvalCase(**d))
    return cases

@pytest.fixture(scope="module")
def result():
    from models import current_model_client  # swapped per deployment
    return run_eval(load_cases(), current_model_client)

def test_no_category_regression(result):
    baseline = json.loads(BASELINE_PATH.read_text())
    for category, threshold in THRESHOLDS.items():
        current = result.aggregate.get(category, 0.0)
        base = baseline.get(category, 0.0)
        drop = base - current
        assert drop <= threshold, (
            f"{category}: {current:.3f} vs baseline {base:.3f}, "
            f"drop {drop:.3f} exceeds threshold {threshold:.3f}"
        )

Cost is a variable that deserves explicit budgeting. A 200-case eval run against Claude Opus at 2026-04 rates lands around five dollars per run for typical 10-K extraction prompts. Running the smoke set on every pull request and the full set nightly adds roughly twenty to forty dollars per month per pipeline, an amount dwarfed by the cost of a single model-selection error that ships a regression into production. The only cost trap worth flagging: evals that fan out over many models in parallel can accidentally spike spend if a developer loops the comparison without a rate limiter. A hard cap on per-run spend, enforced in the harness itself, prevents that class of incident.

Three operational practices make this sustainable:

Baseline refresh policy. Update the baseline metric file only after a signed-off promotion. Automatic baseline refresh turns the harness into a rubber stamp.
Case versioning. The cases file is under version control. New cases append; old cases never silently mutate. Renumber through a migration when the schema changes.
Fast vs. full splits. A 20-case smoke set runs on every PR in under a minute. The full 200-case set runs nightly and on release candidates. Slow CI is ignored CI.

What this piece does not do

This piece publishes no accuracy numbers for any model on any task. Any such number would require running a specific harness on a specific corpus in a specific month, and the result would age the moment the next model ships. Fair practitioners regenerate those numbers on their own data, with their own prompts, under their own tolerance rules. Publishing them here would invite a second practitioner to copy-paste rather than build, which is precisely the failure mode the piece is arguing against.

The value of the harness is not the accuracy it produces on any one test. The value is the capability to produce a defensible accuracy number on demand, in-house, for the specific question in front of the deployment team.

Connects to

The 2026 Engineer's Guide to AI in Markets: the broader methodology context this harness slots into.
LLM Prompt Patterns for 10-K Extraction: prompt recipes that become the input side of the extraction cases evaluated here.
Did You Overfit? PBO and Deflated Sharpe: the selection-bias correction framework applied here to model comparison.
Isotonic Calibration for LLM Forecasts: once extraction is scored, forecast outputs need calibration before downstream use.
Token Cost Reality for LLM Trading Research: dollar cost per eval run matters when CI fires dozens of times per week.
Agent Skill Tester: interactive harness for single-prompt agents, complementary to the batch eval described here.
Prompt Regression Tester: browser tool that implements the baseline-comparison pattern for prompt-only changes.
Hallucination Detector: plug-in check for the summarization category, flagging unsupported claims against source filings.

References

Chen, Z., Chen, W., Smiley, C., Shah, S., Borova, I., Langdon, D., Moussa, R., Beane, M., Huang, T., Routledge, B., & Wang, W. Y. (2021). "FinQA: A Dataset of Numerical Reasoning over Financial Data." Proceedings of EMNLP 2021. The canonical public benchmark the piece argues is necessary but insufficient.
Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., & Vidgen, B. (2023). "FinanceBench: A New Benchmark for Financial Question Answering." Patronus AI / Stanford. Second-generation finance QA benchmark tracking real filings.
Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 42(1). Multiple-testing correction framework from quantitative finance, directly applicable to model-selection evals.
Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." Text Summarization Branches Out workshop, ACL 2004. Reference for the ROUGE-L metric used in the summarization category.

U.S. Securities and Exchange Commission. EDGAR Full-Text Search System. https://efts.sec.gov/LATEST/search-index and EDGAR Public Dissemination Service. Primary source for filings by form type, date, and filer. ↩
XBRL International. XBRL 2.1 Specification and Financial Reporting Taxonomies. https://www.xbrl.org/. Machine-readable tagging standard mandated by the SEC for financial-statement data since 2009. ↩
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), pp. 94–107. The multiple-testing correction intuition generalizes directly to any bounded metric under comparison. ↩