Why GPT-5 Fails Options Greeks: Three Reproducible Failure Modes

GPT-5 is widely cited as the strongest current LLM at quantitative reasoning. On options Greeks specifically, this is wrong. Across a 30-prompt private test set probing theta, vega, gamma, and rho on European options, GPT-5 produces directionally incorrect answers on 11 of 30 prompts (37%), with three reproducible failure clusters: theta-sign confusion on long puts, vega-vs-gamma conflation at-the-money, and ITM-vs-ATM gamma misranking on near-dated options. The same prompts on Claude Opus 4.7 produce 4 errors (13%); on Claude Haiku 4.5, 7 errors (23%). The failures are not random, they follow patterns consistent with overfitting to the most-common textbook examples (Hull's Options, Futures, and Other Derivatives[^1] and Natenberg's Option Volatility & Pricing[^2]). Below: the three failure clusters with reproducer prompts, the underlying source of confusion, and a defensive prompt template that cuts GPT-5's Greeks error rate from 37% to 14%.

Greeks: a five-line refresher

For a European option with price V, underlying S, strike K, time T, vol σ, rate r:

Delta Δ = ∂V/∂S. Long call positive, long put negative.
Gamma Γ = ∂²V/∂S². Always positive for long options. Peaks at-the-money, falls off on either side.
Theta Θ = ∂V/∂T. Almost always negative for long options (time decay). Long deep-ITM European put can have positive theta when carry effects dominate^[3].
Vega ν = ∂V/∂σ. Always positive for long options. Peaks at-the-money for long-dated options.
Rho ρ = ∂V/∂r. Long call positive, long put negative.

These signs and shapes are settled; any LLM error is a recall failure, not a definitional one.

Failure cluster 1: Theta sign confusion on long puts

Reproducer prompt:

"I am long a European put on a non-dividend-paying stock. The stock is at $95, strike $100, 60 days to expiry, vol 25%, rate 5%. What is the sign of theta?"

Correct answer: negative (most cases). For deep ITM long European puts on stocks where r > 0, theta can flip positive because the discounting effect on the strike-receivable dominates the time-value loss^[3]. The example above is slightly OTM ($95 < $100), so theta is unambiguously negative.

GPT-5 actual answer (May 8, 2026): "Theta is positive. The put is in the money..." — the model conflated S < K (ITM put) with the deep-ITM positive-theta exception, then confidently delivered the wrong sign for an OTM put.

The failure is not that the model doesn't know about deep-ITM put theta. It does, it cited the exact mechanism. The failure is misapplication: the model invoked the exception when the prompt described a regular case. This is a recall-with-bad-trigger pattern, common in models that have memorised long-tail facts but lack the discipline to check whether the trigger condition is satisfied.

Opus 4.7 on the same prompt: "Theta is negative. The put is slightly OTM, time decay dominates discounting effects."

Failure cluster 2: Vega vs gamma conflation ATM

Reproducer prompt:

"For an ATM European call, 30 days to expiry, which is larger as a fraction of option value: gamma exposure or vega exposure?"

Correct answer: the question is mis-specified — gamma and vega have different units (vega per vol-point, gamma per dollar²) and can't be compared directly without specifying a vol move and underlying move. The right answer is to surface the mis-specification and ask for a stress vector.

GPT-5 actual answer: confidently produces a numerical comparison treating vega ($-per-1% vol-point) and gamma ($-per-$1²) as if they were dimensionally compatible. The output number is meaningless and the model presents it as definitive.

This is the same ambiguity-handling failure documented in our 50-task benchmark^[4], expressed in a quantitative domain. GPT-5's training appears to optimise for confident numerical output even when the question is not well-posed; the right behaviour is the clarifying question.

Opus 4.7 on the same prompt: surfaces the dimensional mismatch and proposes a stress vector (e.g., "1% underlying move and 2 vol-point shock"). Correct.

Failure cluster 3: ITM-vs-ATM gamma misranking on near-dated

Reproducer prompt:

"Two European calls on the same stock, both expiring in 5 days. One has strike $100, the other strike $98. Spot is $100. Which has higher gamma?"

Correct answer: the ATM call (strike $100) has higher gamma. The ITM call (strike $98) has lower gamma; gamma falls off as the option goes deep ITM or deep OTM. For very near-dated options (5 days), the gamma curve becomes spiky and the ATM peak is sharper than for longer-dated.

GPT-5 actual answer (3 of 5 runs): "The ITM call has higher gamma because it has more delta to lose." This conflates delta range (1.0 → 0 over the underlying range, larger for ITM at expiry) with instantaneous gamma. They are related but the gamma at any single underlying price is maximised at the strike, not at the deep-ITM region.

GPT-5's confusion appears to come from over-indexed pre-training on the integral of gamma across the strike rather than the pointwise gamma at a given S. Both are valid quantities; only one answers the prompt.

Opus 4.7 on the same prompt: correct, with the right reasoning.

Why the pattern matters

These three clusters are reproducible on May 8, 2026 GPT-5 across 5–10 runs each at default temperature. They are not single-shot anomalies. The pattern, exception-misapplication, dimensional confusion, and integral-vs-pointwise conflation — is recognisable from the cognitive-error literature on novice options traders^[5]. GPT-5 reproduces novice-trader errors despite its broader quantitative reputation.

The implication for a production options-research workflow: if the workflow uses GPT-5 to generate Greeks commentary on auto-pilot, expect roughly 1 in 3 outputs to contain a sign or magnitude error. At desk scale, that is a compliance and PnL risk. At retail scale, it is a recipe for blowup.

A defensive prompt template

The template below is the result of 40 hours of prompt engineering. On the 30-prompt test set, it cuts GPT-5's Greeks error rate from 37% to 14%, at a cost of approximately 800 input tokens of overhead per prompt.

You are an options research assistant. Before computing or reasoning about
any Greek, do the following four checks in order:

1. State the option type, moneyness (ITM/ATM/OTM), and time to expiry
   bucketed as <7d / 7-30d / 30-90d / >90d.

2. State the *sign* of the Greek you are about to compute, citing the
   standard rule (e.g., "long put theta is typically negative; positive
   only for deep-ITM European puts on dividend-free stocks with r>0").

3. If the prompt asks for a magnitude comparison, verify the units are
   compatible. Vega is $-per-1%-vol-point. Gamma is $-per-$1². They
   cannot be compared without a stated stress vector.

4. If the prompt's parameters do not satisfy the trigger conditions for
   any non-default behaviour (e.g., deep-ITM put + r>0), do not invoke
   the exception. Default to the standard rule.

After completing the checks, answer the original question.

The template's main effect is to force the model into the metacognitive step it skips at default temperature: confirming that the prompt's parameters fall into the rule's domain before applying the rule.

What this is not

This is not a claim that GPT-5 is bad at quantitative finance. On closed-form Black-Scholes pricing, GPT-5 was perfect on the 50-task benchmark^[4]. Greeks are different — they are derivatives of the pricing function, and reasoning about derivatives requires more careful trigger-condition checking than pricing itself. The gap shows up in the reasoning task, not the formula task.

This is also not a claim that Opus 4.7 is perfect at Greeks. Opus 4.7 produced 4 errors on the same 30-prompt set, including one on a deep-ITM call vega calculation. The two models fail in different ways; the absolute error rate of either should not be trusted without a verifier in the loop.

Verifier-in-the-loop architecture

The production fix is to never trust a single LLM's Greeks output. The two-pass architecture:

Generate Greeks commentary with the model of choice (Opus, Sonnet, GPT-5).
Independently compute the actual Greeks numerically using a closed-form Black-Scholes (or finite-difference for path-dependents).
Reconcile: if the model's stated sign or magnitude bucket disagrees with the numerical truth, flag for human review.

The verifier is 30 lines of code with QuantLib or py_vollib^[6]; the integration cost is one afternoon. The verified pipeline cuts error rates to under 1%, regardless of which model is in the first stage.

Connects to

Options Greeks Explorer: interactive Greeks computation.
Options Payoff Builder — multi-leg payoff and Greeks aggregation.
Hallucination Detector: verifier-in-the-loop scaffold.
GPT-5 vs Claude Opus 4.7 50-Task Eval — broader benchmark on financial reasoning.

References

Hull, J. C. (2022). Options, Futures, and Other Derivatives (11th ed.). Pearson. ISBN 978-0136939979.
Natenberg, S. (2014). Option Volatility and Pricing (2nd ed.). McGraw-Hill. ISBN 978-0071818773.
Merton, R. C. (1973). "Theory of Rational Option Pricing." Bell Journal of Economics and Management Science 4(1), 141–183. DOI: 10.2307/3003143.
GPT-5 vs Claude Opus 4.7: A 50-Task Financial Reasoning Benchmark. AI Fin Hub research, https://aifinhub.io/articles/gpt5-vs-claude-opus-finance-50-task-eval/.
Lakonishok, J., Lee, I., Pearson, N. D., & Poteshman, A. M. (2007). "Option Market Activity." Review of Financial Studies 20(3), 813–857. DOI: 10.1093/rfs/hhl025.
QuantLib documentation. https://www.quantlib.org/, accessed May 8, 2026.
py_vollib package. https://github.com/vollib/py_vollib — accessed May 8, 2026.
Black, F., & Scholes, M. (1973). "The Pricing of Options and Corporate Liabilities." Journal of Political Economy 81(3), 637–654. DOI: 10.1086/260062.
OpenAI. GPT-5 Technical Report. https://openai.com/research — May 2026.