Why does the same strategy report alpha = +1.7% against one benchmark and −45.1% against another?

Alpha is the regression intercept after accounting for the benchmark's contribution. The tightly-tracking sector benchmark leaves a small positive alpha; the near-flat broad benchmark forces an enormous beta and a deeply negative intercept. Neither is 'the real alpha' — both are conditional, and the second benchmark is degenerate.

Is the information ratio comparable across strategies?

Only when the benchmark is identical. Best practice is to fix a single benchmark for the strategy family and quote IRs against it.

What R² is good enough for the regression-based metrics?

Between 0.3 and 0.7. Below 0.2 the benchmark is irrelevant; above 0.8 the benchmark is essentially the strategy and there is no alpha to find.

How does the engine handle a benchmark with almost no variance?

Beta explodes (5.57 in the worked example) because the regression divides by a near-zero benchmark variance, and the alpha intercept goes deeply negative. Report the benchmark as degenerate.

Should I report Treynor alongside Sharpe and Sortino?

Only when regression R² is above 0.5. For an LLM-driven retail strategy without a clean CAPM exposure, Treynor is misleading.

Risk-Adjusted Returns: Benchmark Choice Drives the Report

Swap the benchmark and a strategy's alpha can flip from positive to negative on the identical return series, which makes benchmark choice a load-bearing audit decision rather than a footnote. For a 20-day equity-strategy daily return series benchmarked against a tightly-tracking sector index, the Risk-Adjusted Returns engine returns annualized Sharpe 6.36, Sortino 12.49, Calmar 99.22, information ratio 0.354, beta 1.336, alpha +0.0173 annualized, tracking error 0.0157. The same return series re-benchmarked against a dampened near-flat broad index returns identical Sharpe/Sortino/Calmar but information ratio 0.276, beta 5.57, and alpha −0.451 annualized. Sharpe is invariant to benchmark choice. Alpha, beta, IR, and tracking error are not — the alpha even flips sign. The choice of benchmark contributes more variance to the relative-risk metrics than the underlying strategy does, which is why benchmark documentation is a load-bearing audit requirement. Both runs are computed live and shown in the Verified engine output section below.

TL;DR

Same strategy returns, two benchmark choices, two completely different risk reports:

Metric	Benchmark A (sector-weighted)	Benchmark B (broad, low-vol)
Sharpe annualized	6.36	6.36
Sortino annualized	12.49	12.49
Calmar	99.22	99.22
Annualized return	0.496	0.496
Tracking error	0.0157	0.0489
Information ratio	0.354	0.276
Beta	1.336	5.57
Alpha (annualized)	+0.0173	−0.451

The first four metrics are absolute (no benchmark dependency). The last four are relative and move materially — alpha swings from +1.7% to −45.1% on the identical strategy. An audit report that quotes a single alpha figure without naming the benchmark is unverifiable on its face.

The metrics that depend on benchmark

Four metrics in the engine output depend on the benchmark choice:

Tracking error (trackingError), annualized standard deviation of (strategy − benchmark) returns.
Information ratio (informationRatio), annualized mean of (strategy − benchmark) returns divided by tracking error.
Beta (beta), slope of strategy returns regressed on benchmark returns (CAPM definition).
Alpha (alphaAnn), annualized intercept of the same regression.

A defensible report names the benchmark, the start and end dates of the return series, and the rebalancing or weighting convention. Without any of those three the four metrics are not reproducible.

Why Sharpe is benchmark-invariant and IR is not

Sharpe = (mean − r_f) / σ. The risk-free rate r_f is a scalar; the strategy returns are a single series. The benchmark does not enter the formula. That is why the engine returns Sharpe 6.36, Sortino 12.49, and Calmar 99.22 unchanged across both benchmark runs above — they are functions of the strategy series alone.

Information ratio = mean(strategy − benchmark) / σ(strategy − benchmark). The benchmark enters both the numerator (mean of the spread) and the denominator (the spread's volatility). A "tighter" benchmark (one that tracks the strategy closely) keeps the spread small on both axes; a "looser" benchmark widens the spread.

For the two runs above, Benchmark A is closely correlated with the strategy and produces tracking error 0.0157, IR 0.354. Benchmark B is a near-flat low-volatility passive index and produces a much wider tracking error 0.0489 and a lower IR 0.276. The IR difference is not a property of the strategy; it is a property of the benchmark.

The four-step benchmark audit

A defensible benchmark choice for an LLM-driven retail strategy passes four checks:

Step 1: declare the universe. The benchmark must be drawn from the same asset universe as the strategy. An equity long-short strategy on US large-caps cannot be benchmarked against a broad EM index. The universe-mismatch failure is the most common audit gap in retail backtest reports.

Step 2: declare the weighting. Market-cap, equal-weight, sector-weighted, custom-tilted, these produce different return paths. The benchmark's weighting convention has to match the strategy's implicit tilt, not its explicit tilt. A momentum strategy implicitly tilts toward high-momentum names; a sector-weighted benchmark that captures that tilt is more honest than a broad market-cap benchmark.

Step 3: declare the rebalancing. Daily-rebalanced benchmarks have different return characteristics from monthly-rebalanced ones. The engine accepts a raw return series; the rebalancing assumption is implicit in how the series was constructed. Document it.

Step 4: regress the strategy on the benchmark. Beta and alpha are the regression slope and intercept. Benchmark A produced beta 1.336 (the strategy moves ~1.34× as much as the benchmark per unit of benchmark return, on average) and alpha +0.0173 annualized — a defensible "the strategy delivers ~1.7% annualized excess return after accounting for the benchmark's 1.34 beta." Benchmark B, a near-flat index with almost no variance, produced beta 5.57 and alpha −0.451: the engine mechanically regresses the strategy on a benchmark that barely moves, the slope explodes, and the alpha intercept goes deeply negative. The beta 5.57 is the diagnostic flag that the benchmark is degenerate, not that the strategy changed.

Why the engine's annualized Sharpe is so high

The engine returns annualized Sharpe 6.36 on a 20-observation daily series. That is not a credible long-run Sharpe — it is the artefact of a 20-observation sample with positive mean and modest variance, annualized by √252. The engine reports the number faithfully, but the standard error of a Sharpe estimate from 20 observations is large: approximately √((1 + 0.5·SR²) / n). On a per-period (non-annualized) basis the Sharpe is far smaller; the √252 scaling is what inflates the headline number.

A defensible report on a 20-observation series leads with the sample size and the standard error, not the annualized point estimate. The Deflated Sharpe Ratio engine handles the selection-bias correction; the small-sample standard-error correction is a separate adjustment that the risk-adjusted-returns engine does not apply directly.

Treynor and Jensen reconciliation

Treynor ratio = (mean strategy return − r_f) / beta. The engine surfaces the annualized return (0.496) and beta directly. For Benchmark A: Treynor ≈ (0.496 − 0.045) / 1.336 ≈ 0.338. For Benchmark B the same numerator divided by beta 5.57 collapses to ≈ 0.081 — the degenerate benchmark inflates beta and crushes the Treynor ratio. Jensen's alpha is the number the engine returns directly as alphaAnn: +0.0173 (Benchmark A) or −0.451 (Benchmark B).

All three (IR, Treynor, Jensen) collapse to noise once the benchmark is wrong. The engine cannot tell you the benchmark is wrong; only the universe-and-weighting audit can. The relevance for an LLM-driven research report is that the LLM should be required to justify the benchmark choice before computing any relative metric. A bare "the strategy has IR 0.354" without benchmark justification is theatre.

What the engine cannot fix

The engine accepts two arrays of returns and computes the metrics. It does not:

Test whether the benchmark series is at the same frequency as the strategy series (daily vs weekly mismatches produce nonsense).
Test whether the benchmark return series is for the same trading days as the strategy (US-holiday vs EU-holiday mismatches drop or duplicate observations).
Compute the regression R² that would let the user assess whether the benchmark is even relevant.

A defensible audit pipeline pre-processes both series to identical date indices and computes R² before consuming the engine's metrics. The metrics that emerge are then trustworthy. Without the pre-processing the metrics are computable but uninterpretable — Benchmark B above is computable and entirely uninterpretable.

Connects to

The Sharpe Ratio Trap: why raw Sharpe is gameable and which complements to report.
Deflated Sharpe Derivation Worked Example: selection-bias adjustment for the Sharpe number.
How to Read a Backtest Report 2026: the broader template for evaluating risk reports.
Risk-Adjusted Returns: engine endpoint.
Correlation Matrix Visualizer: companion for benchmark-correlation diagnostics.
Efficient Frontier Builder: portfolio-level relative-metric context.

References

Sharpe, W. F. (1994). "The Sharpe Ratio." Journal of Portfolio Management 21(1), 49–58. The 1994 generalisation of the ratio for benchmark-relative reporting.
Treynor, J. L., & Black, F. (1973). "How to Use Security Analysis to Improve Portfolio Selection." Journal of Business 46(1), 66–86. The information ratio's theoretical foundation.
Jensen, M. C. (1968). "The Performance of Mutual Funds in the Period 1945–1964." Journal of Finance 23(2), 389–416. Original Jensen alpha derivation.
Sortino, F. A., & Price, L. N. (1994). "Performance Measurement in a Downside Risk Framework." Journal of Investing 3(3), 59–64.
BIS Working Paper 22 (2002). "Comparing Trading Strategies." Bank for International Settlements. Cross-strategy benchmark conventions.

Verified engine output

Show the recompute-verified inputs and outputs

Benchmark A — strategy vs tightly-tracking sector index

Inputs
risk_free_annual	0.045
returns (20 items)	[...]
benchmark (20 items)	[...]

Result
count	20
mean daily	0.0014214285714285713
stdev daily	0.0035451968989932813
downside stdev	0.0018068569790819482
skewness	-0.26500572865331457
excess kurtosis	-1.2089777615043384
sharpe ann	6.364802768799225
sortino ann	12.488248544229396
omega	2.557729941291585
max drawdown	0.004999999999999893
calmar	99.22480593477835
ann return	0.49612402967388114
ann vol	0.056278256060961575
tracking error	0.015685393809126008
information ratio	0.3542198443244075
beta	1.3358633776091087
alpha ann	0.01732587720197909

Computed live at build time.

Benchmark B — same strategy vs dampened broad index

Inputs
risk_free_annual	0.045
returns (20 items)	[...]
benchmark (20 items)	[...]

Result
count	20
mean daily	0.0014214285714285713
stdev daily	0.0035451968989932813
downside stdev	0.0018068569790819482
skewness	-0.26500572865331457
excess kurtosis	-1.2089777615043384
sharpe ann	6.364802768799225
sortino ann	12.488248544229396
omega	2.557729941291585
max drawdown	0.004999999999999893
calmar	99.22480593477835
ann return	0.49612402967388114
ann vol	0.056278256060961575
tracking error	0.04893529562488345
information ratio	0.27573822767646866
beta	5.56521739130435
alpha ann	-0.45108485658852715

Computed live at build time.

Frequently asked questions

Why does the same strategy report alpha = +1.7% against one benchmark and −45.1% against another?: Alpha is the regression intercept after accounting for the benchmark's contribution. The tightly-tracking sector benchmark leaves a small positive alpha; the near-flat broad benchmark forces an enormous beta and a deeply negative intercept. Neither is 'the real alpha' — both are conditional, and the second benchmark is degenerate.
Is the information ratio comparable across strategies?: Only when the benchmark is identical. Best practice is to fix a single benchmark for the strategy family and quote IRs against it.
What R² is good enough for the regression-based metrics?: Between 0.3 and 0.7. Below 0.2 the benchmark is irrelevant; above 0.8 the benchmark is essentially the strategy and there is no alpha to find.
How does the engine handle a benchmark with almost no variance?: Beta explodes (5.57 in the worked example) because the regression divides by a near-zero benchmark variance, and the alpha intercept goes deeply negative. Report the benchmark as degenerate.
Should I report Treynor alongside Sharpe and Sortino?: Only when regression R² is above 0.5. For an LLM-driven retail strategy without a clean CAPM exposure, Treynor is misleading.