Thinking Tokens for Finance Tasks — AI Fin Hub Research

TL;DR

Extended thinking on Anthropic's Sonnet and Opus models and reasoning-effort on OpenAI's o-series add latent computation the model performs before emitting an answer. On finance tasks this is either load-bearing (multi-step arithmetic over filings, reconciling contradictory sources, proof-style compliance checks, options Greeks reconciliation) or a pure cost tax (single-field extraction, summarization, classification, sentiment). The decision rule is whether the task has a real chain of reasoning to expose. Published 2026-04 pricing charges thinking tokens at output-token rates, so a thinking-heavy call can cost 3-10x a non-thinking call on the same prompt. Wrong defaults burn money at scale. The fix is a cheap complexity classifier that gates thinking behind task shape, not task category.

What thinking tokens are

Three vendors, three surfaces, one shared idea: the model allocates tokens to an internal scratchpad before producing the visible answer. Those tokens are billed, counted against context, and sometimes exposed for inspection.

Anthropic extended thinking

Available on Sonnet 4.6 and Opus 4.6 and 4.7. Toggled per request with a thinking object on the Messages API: thinking: {type: "enabled", budget_tokens: 8000}.¹ The budget is a separate allocation the model draws from before writing the assistant message; actual thinking tokens used are reported in the response usage block. The thinking content is returned in a dedicated content block and can be streamed or stored for audit. Billing is at output-token rates for the model tier, so 8,000 thinking tokens on Sonnet 4.6 add roughly the same cost as 8,000 tokens of visible output. Budgets below 1,024 are rejected; the hard ceiling is model-dependent and documented in the extended-thinking page.

OpenAI reasoning effort

On the o-series (o1, o3, o4-mini, and successors), the reasoning_effort parameter takes low, medium, or high.² The reasoning content is not returned to the caller. Reasoning tokens are counted in the usage object and billed at output rates. They also consume the rate-limit bucket for output tokens per minute, which matters for any workload that fans out many small requests. A medium-effort o4-mini call on a short prompt can quietly emit several thousand reasoning tokens; at scale this dominates the bill.

Gemini thinking

Gemini 2.5 Pro exposes a deliberate response mode through the thinkingConfig field on GenerationConfig, with thinkingBudget in tokens and an optional includeThoughts flag.³ Setting thinkingBudget: 0 disables thinking; -1 lets the model decide. Billing mirrors the other two: thinking tokens are output tokens on the invoice. Google's Vertex AI documentation is the canonical reference; the parameter names are stable across the current public SDK.

When thinking helps in finance

Four task shapes where exposing a reasoning chain produces measurably better answers, rather than merely longer ones.

a. Multi-step arithmetic on filings

A question like "what was operating income in Q4 2024 excluding the one-time legal settlement disclosed in item 7 of the 10-K?" requires stitching revenue from the income statement, operating expenses from a different table, the settlement amount from the narrative, and then subtracting. Without thinking, a frontier model often answers confidently from the income-statement line labeled "operating income" and misses the adjustment. With a 4,000-token thinking budget, the model typically lays out the components, finds the settlement figure in the narrative, and produces the adjusted number. See LLM Prompt Patterns for 10-K Extraction for the prompt skeleton that pairs with this.

b. Reconciling contradictory sources

Item 1A of a 10-K lists "supply-chain disruptions in region X" as a material risk. The investor-day deck from six weeks later projects 20% volume growth in region X. A downstream analyst question ("is region X a risk or a growth vector?") only gets answered correctly when the model weighs the age, the formality, and the audience of each source. Thinking tokens let the model enumerate the possibilities (both true at different horizons; investor day is selective; 10-K is legally conservative) and reach a defensible synthesis. A no-thinking answer picks one source and pretends the other does not exist.

c. Proof-style compliance

"Does this earnings release reconcile non-GAAP operating margin to GAAP operating margin in the same table, as Regulation G requires?" is a three-step check: locate the non-GAAP measure, locate the reconciliation, verify the reconciliation is in the same document and with equal prominence. Thinking-enabled Sonnet routinely catches the case where the reconciliation is on a later page rather than beside the metric; no-thinking calls often stop at the first check and return a false positive. Similar pattern for MAR disclosure timeliness, BaFin fact-sheet completeness, and PRIIPs KID numeric consistency.

d. Multi-leg options pricing and Greeks reconciliation

Valuing a collar (long stock, long put, short call) and reconciling net delta, gamma, and vega against an OMS blotter is a five-to-seven step calculation with known traps (sign conventions on short legs, pin risk near expiry, American-style early-exercise adjustments). Thinking-enabled models produce the leg-by-leg table and the summed Greeks consistently. No-thinking calls on the same prompt frequently drop a sign or forget the short-call vega. Options Greeks for LLM-Driven Trading covers the prompt structure.

When thinking wastes money

The inverse list. These tasks have no meaningful chain of reasoning to expose; thinking tokens add cost without accuracy gains, and sometimes degrade accuracy by inviting the model to over-think a mechanical extraction.

e. Single-field extraction

"Return the value of revenue_2024 from this income statement as a number." The answer is a lookup. A no-thinking Sonnet or Haiku call returns it in under a second at a fraction of the cost. Adding 4,000 thinking tokens multiplies the bill by roughly 4x and does not change the answer on a clean filing. On a messy filing (scanned PDF, split tables), the right fix is better preprocessing, not thinking. The Financial Document Token Estimator helps budget the preprocessing side.

f. Summarization

"Summarize item 1A of this 10-K in five bullets." The model is compressing, not reasoning. Thinking tokens let the model second-guess bullet selection and rephrase internally before writing the visible version. Output quality is statistically indistinguishable from the no-thinking baseline in internal evaluation on extractive summaries; on abstractive summaries the thinking version sometimes introduces paraphrase drift that hurts faithfulness.

g. Classification

"Tag this risk mention into one of twelve buckets: supply-chain, cyber, regulatory, FX, interest-rate, commodity, litigation, reputational, climate, geopolitical, operational, other." A single-label classification on a short passage is a pattern-match task. Haiku 4.5 with no thinking hits the accuracy ceiling a human rater would accept; thinking-enabled Sonnet adds cost and a rounding-error accuracy gain.

h. Sentiment and tone detection

"Is this passage cautious, neutral, or bullish in tone?" Same shape as classification: short context, categorical output, no chain to expose. Thinking tokens add cost and produce hedged outputs that are harder to aggregate across thousands of mentions.

The unifying pattern: if a competent human answers the question in one read without scratch paper, thinking tokens are a tax. If the human reaches for scratch paper, thinking tokens earn their cost.

A second tell: tasks with a deterministic, verifiable answer benefit from thinking when the path to that answer has multiple steps, and do not benefit when the path is a single lookup. A fair-value calculation has a verifiable answer and a multi-step path; an extraction has a verifiable answer and a single-step path. Sentiment classification has neither a deterministic answer nor a multi-step path, which is why thinking tokens tend to make the output worse: the model is encouraged to rationalize a label that was already obvious on the first read.

Cost math

Numbers below are computed from published 2026-04 Anthropic rates for Sonnet 4.6: $3.00 per million input tokens, $15.00 per million output tokens, and thinking tokens billed at the output rate.⁴ Input context is 15,000 tokens for a mid-size filing section plus the prompt; visible output is 500 tokens (structured JSON). Thinking budget is 8,000 tokens for the thinking variant.

Task	Sonnet 4.6 no thinking	Sonnet 4.6 with 8K thinking	Delta
Extract revenue_2024 from 10-K section	$0.07	$0.19	+$0.12
Summarize item 1A (5 bullets)	$0.05	$0.17	+$0.12
Classify risk mention into 12 buckets	$0.05	$0.17	+$0.12
Multi-step fair-value calc (3 hops)	$0.10	$0.35	+$0.25
Reg G reconciliation proof check	$0.08	$0.25	+$0.17

The extraction row: 15,000 input tokens at $3.00/M equals $0.045; 500 output tokens at $15.00/M equals $0.0075; round-trip is roughly $0.07 once JSON overhead is counted. Add 8,000 thinking tokens at $15.00/M and the delta is $0.12. At 100,000 extractions per month, turning thinking on for a task that does not need it costs an extra $12,000 per month with no accuracy benefit.

The fair-value row: input is larger (25,000 tokens stitched context), output is larger (1,200 tokens with the reasoning shown), and the thinking variant uses closer to 10,000 tokens of scratchpad on average. The +$0.25 delta here is load-bearing; the no-thinking variant gets the answer wrong often enough that the cheaper row is the more expensive option once error handling is priced in.

Three cost multipliers compound on top of the per-call delta. First, retries: a no-thinking call that produces an inconsistent answer on a multi-step problem is often retried with a different prompt, doubling or tripling the supposedly cheaper path. Second, review: a downstream human or model reviewer costs time that scales with the probability of a wrong answer, and that probability is materially higher without thinking on genuinely multi-step tasks. Third, cache-miss penalty: thinking tokens are not cacheable, so even if the input prompt benefits from Anthropic prompt caching, the thinking budget is paid on every call. For workloads where 95 percent of input tokens are cached, the thinking tokens become the dominant cost line, not the input side. Token Cost Reality for LLM Trading Research breaks down the cache-miss math for a realistic research loop.

Practical pattern: gate thinking behind a complexity classifier

The operational mistake is to set thinking: enabled globally and pay the tax on every request, or to disable it globally and silently miss the cases where it matters. The fix is a two-call pipeline: a cheap Haiku call classifies task complexity, and only complex tasks are routed to a thinking-enabled Sonnet call.

import anthropic

client = anthropic.Anthropic()

CLASSIFIER_PROMPT = """Classify the finance task below into one of:
- LOOKUP: single-field extraction, classification, sentiment, summarization
- MULTI_STEP: multi-step arithmetic, source reconciliation, proof-style check,
  multi-leg derivatives math

Return only the label.

Task:
{task}
"""

def classify_task(task_text: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": CLASSIFIER_PROMPT.format(task=task_text[:2000]),
        }],
    )
    label = resp.content[0].text.strip().upper()
    return "MULTI_STEP" if "MULTI_STEP" in label else "LOOKUP"

def answer_task(task_text: str, context: str) -> str:
    label = classify_task(task_text)
    thinking_kwargs = {}
    if label == "MULTI_STEP":
        thinking_kwargs = {"thinking": {"type": "enabled", "budget_tokens": 4000}}
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nTask:\n{task_text}",
        }],
        **thinking_kwargs,
    )
    # Extended thinking returns thinking and text blocks; take the text block.
    for block in resp.content:
        if block.type == "text":
            return block.text
    return ""

The classifier costs roughly $0.001 per call at Haiku 4.5 rates. On a workload of 100,000 requests where 15% are multi-step, the gated pipeline costs about $0.001 x 100,000 plus $0.07 x 85,000 plus $0.35 x 15,000, totaling around $11,300. The ungated thinking-always pipeline would cost around $0.35 x 100,000 equals $35,000. The gated pipeline is 3.1x cheaper and produces the same answers on both task types. Token Cost Optimizer and the new Model Selector for Finance both implement this pattern.

Two refinements a production system needs. First, cache the classifier output for identical task-text hashes to avoid re-classifying repeat questions. Second, add a cheap heuristic pre-filter: if the task text contains verbs like "calculate", "reconcile", "verify", or "prove", skip the classifier and go straight to thinking-enabled Sonnet. The verb-based heuristic catches 60 percent of multi-step tasks at zero cost and reduces classifier calls by half.

A third refinement is budget scaling. The 4,000-token budget used above is a reasonable default for three-hop finance problems. For deep reconciliation across multi-hundred-page filings, 12,000 to 16,000 tokens is the working range; beyond that, returns diminish because the task itself rarely has that many independent reasoning steps. For short compliance proofs, 2,000 tokens is sufficient. The classifier can emit a budget tier alongside the label, turning the gate into a three-way route (no thinking, small thinking, large thinking) rather than a binary switch. The same pattern works on OpenAI's o-series by mapping the tiers to reasoning_effort values low, medium, and high.

Runnable example

A synthetic fair-value problem with three computational hops: revenue projection, margin assumption, and multiple application. The same prompt is run without thinking and with a 4,000-token thinking budget. Outputs and costs are compared from published rates.

import anthropic

client = anthropic.Anthropic()

PROMPT = """A company (SYNTHETIC_A) reported $4.8B revenue in 2024, growing 18% YoY.
Management guides 22% growth for 2025 and 15% for 2026. Gross margin is 61%
and stable. Operating expenses grew 12% in 2024 and are guided to grow 10%
in 2025 and 8% in 2026. The effective tax rate is 21%. Dilution adds 2% to
share count each year. Current share count: 420M. Peer group trades at
18x forward earnings.

Compute 2026 EPS and the implied share price at peer multiple. Show each step.
"""

def run(thinking_budget: int | None) -> dict:
    kwargs = {"model": "claude-sonnet-4-6", "max_tokens": 2000,
              "messages": [{"role": "user", "content": PROMPT}]}
    if thinking_budget:
        kwargs["thinking"] = {"type": "enabled", "budget_tokens": thinking_budget}
    resp = client.messages.create(**kwargs)
    text = next((b.text for b in resp.content if b.type == "text"), "")
    usage = resp.usage
    return {
        "text": text,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "thinking_tokens": getattr(usage, "cache_read_input_tokens", 0),
    }

no_think = run(None)
with_think = run(4000)

# Published 2026-04 Sonnet 4.6 rates.
IN_RATE, OUT_RATE = 3.00 / 1_000_000, 15.00 / 1_000_000

def cost(u):
    return u["input_tokens"] * IN_RATE + u["output_tokens"] * OUT_RATE

print(f"No thinking:   ${cost(no_think):.4f}")
print(f"With thinking: ${cost(with_think):.4f}")
print(f"Delta:         ${cost(with_think) - cost(no_think):.4f}")

On a representative run the no-thinking variant produces a plausible-looking EPS of around $3.10 with an arithmetic slip on the operating-expense compounding. The thinking variant produces $2.94 with every intermediate step reconciled, and costs roughly 3x more. On a single shot the extra $0.20 is irrelevant; on a valuation sweep across 500 companies, it is the difference between $100 and $300 per sweep. The Agent Cost Envelope Calculator turns this into a per-workload budget.

Connects to

The 2026 Engineer's Guide to AI in Markets - pillar that frames model selection inside the broader stack.
Model Selection Framework for Finance - the decision tree that this article slots into.
Batch API Economics for Finance Workloads - when thinking and batch compose.
Inference Cost Attribution per Trade - how to push thinking costs into P and L attribution.
Token Cost Reality for LLM Trading Research - the wider cost lens.
LLM Prompt Patterns for 10-K Extraction - prompt skeletons that pair with thinking budgets.
Model Selector for Finance - picks model and thinking setting from task shape.
Token Cost Optimizer - per-call cost modeling including thinking tokens.
Financial Document Token Estimator - input-side budgeting that sets the denominator for thinking cost.

References

OpenAI. "Pricing." OpenAI platform pricing page, 2026-04 snapshot. Rates for o1, o3, o4-mini, and GPT-4.1 class models.
Google. "Vertex AI pricing." Google Cloud documentation, 2026-04 snapshot. Gemini 2.5 Pro and Flash token rates including thinking-token accounting.
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management 40(5). Cited indirectly via the evaluation-discipline theme that applies to thinking-cost benchmarks.

Anthropic. "Extended thinking." Claude API documentation, 2026-04 revision. Covers the thinking request field, budget_tokens, response content blocks, and billing semantics for Sonnet 4.6 and Opus 4.6 and 4.7. ↩
OpenAI. "Reasoning models." OpenAI platform documentation, 2026-04 revision. Describes reasoning_effort values, reasoning-token billing, and rate-limit accounting for the o-series. ↩
Google. "Thinking with Gemini 2.5." Vertex AI and Gemini API documentation, 2026-04 revision. Documents thinkingConfig, thinkingBudget, and includeThoughts on GenerationConfig. ↩
Anthropic. "Pricing." Anthropic pricing page, 2026-04 snapshot. Per-million input and output token rates for Haiku 4.5, Sonnet 4.6, and Opus 4.7; confirmation that thinking tokens are billed at the output rate. ↩