TL;DR
Sharpe ratio is the most over-used single number in quantitative finance. A standalone Sharpe tells you very little because it silently assumes Gaussian returns, treats upside volatility as equivalent to downside volatility, and is trivially inflated by selection bias. Short-volatility strategies routinely show Sharpe 3 until one tail event wipes out a decade of returns. The fix is not to abandon Sharpe but to always report it alongside four companions: Sortino (downside-only denominator), Calmar (return divided by max drawdown), tail ratio (95th-percentile gain divided by 5th-percentile loss), and Deflated Sharpe (Bailey & Lopez de Prado 2014, which corrects for multiple testing and non-normal moments). Below: what each assumption of Sharpe breaks, when each companion catches what Sharpe misses, and a single Python function that computes all five from a returns series.
What Sharpe actually measures
Sharpe = (mean(R) − R_f) / stdev(R)
Usually annualised by multiplying by √freq (√252 for daily, √12 for monthly). R_f is the risk-free rate, which most retail analyses set to zero without comment.
Three silent assumptions are baked in:
- Returns are Gaussian. The standard deviation fully characterises the distribution's spread. Skew and kurtosis are irrelevant.
- Upside and downside risk are symmetric. A volatile winner is penalised identically to a volatile loser.
- Moments are stationary. The mean and variance in the sample are good estimates of population mean and variance.
All three are violated by real return streams. The violations compound.
The three ways Sharpe lies
Lie 1: Non-Gaussian returns
Equity returns have fat tails. Option-selling strategies have pathologically fat tails. A short-volatility strategy collecting $0.10 per day from selling SPX puts has daily returns that look like +0.1%, +0.1%, +0.1%, +0.1%, −15%. The mean and standard deviation computed on a clean sample can show Sharpe 3. Skew is deeply negative; kurtosis is elevated; the Gaussian summary is meaningless.
The XIV ETF is the textbook case. It ran with reported Sharpe above 2 for years until February 5, 2018, when it lost 96% of NAV in a single session. The pre-event Sharpe was arithmetically correct. It just did not describe the distribution it was computed on.
Lie 2: Symmetric volatility penalty
Sharpe treats a strategy with +30% / −5% monthly returns identically to one with +10% / −5%. The first has higher upside variance and gets penalised for it. Rational investors do not mind upside surprises; they mind downside surprises. Sharpe conflates the two.
Lie 3: Gameability via selection bias
If 1,000 strategies are tested and the best-Sharpe one is published, that Sharpe is not an unbiased estimate — it is the maximum of 1,000 noisy draws. Even if every strategy has true Sharpe zero, the best observed Sharpe can easily exceed 2 over a few years of daily data. Bailey & Lopez de Prado (2014) quantify this as the expected maximum Sharpe under the null; see Did You Overfit?1
The four companions
Sortino ratio
Replace the denominator with downside deviation only:
Sortino = (mean(R) − MAR) / stdev(R[R < MAR])
MAR (minimum acceptable return) is usually zero for retail work. Sortino penalises only negative surprises. A strategy with big positive tails and small negative tails — exactly what investors want — looks better on Sortino than on Sharpe. Option-buying strategies (long-tail, lottery-ticket shape) are the clearest case.
What Sortino catches that Sharpe misses: positive-skew strategies that Sharpe unfairly downgrades.
Calmar ratio
Calmar = annualised_return / |max_drawdown|
Max drawdown is the worst peak-to-trough decline in the equity curve. Calmar asks: how much annual return did the strategy earn per unit of maximum pain experienced?
Calmar is particularly useful for trend-following and momentum strategies, which typically have modest Sharpe (0.7–1.2) but long stretches of steady returns punctuated by sharp drawdowns. A strategy with Sharpe 1.0 and max DD 15% (Calmar ≈ 0.7) is qualitatively different from one with Sharpe 1.0 and max DD 45% (Calmar ≈ 0.25), even though Sharpe is identical.
What Calmar catches that Sharpe misses: concentrated tail losses that stdev averages away.
Tail ratio
tail_ratio = |95th percentile of R| / |5th percentile of R|
Ratio above 1.0 means upside tails are larger than downside tails; below 1.0 means the opposite. Short-vol strategies have tail ratios around 0.3 (big left tail, small right tail). Long-vol and trend-following strategies have tail ratios above 1.5.
Tail ratio is the simplest non-Gaussian diagnostic: no distributional assumptions, no multi-test correction, just the raw distribution shape.
What tail ratio catches that Sharpe misses: directional asymmetry in the return distribution.
Deflated Sharpe Ratio
Bailey & Lopez de Prado (2014) wrote the definitive fix for selection-bias inflation:1
DSR = Φ( (SR − E[max SR*]) · √(T − 1)
/ √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )
Three inputs beyond the observed Sharpe SR: skewness γ₃, excess kurtosis γ₄, and the number of independent trials N that went into the expected-maximum term E[max SR*]. The output is a probability: given the number of strategies tested and the non-Gaussian shape of this one's returns, what is the probability the true Sharpe is positive?
A DSR above 0.95 is a real edge. A DSR below 0.5 is noise masquerading as edge, even when raw Sharpe looks impressive. Did You Overfit? Running PBO and Deflated Sharpe on Your Backtest has a runnable DSR in 30 lines.
What DSR catches that Sharpe misses: multi-testing inflation and non-normality in one statistic.
One function, all five
import numpy as np
import pandas as pd
from scipy.stats import norm
def risk_adjusted_suite(
returns: pd.Series,
rf_daily: float = 0.0,
mar_daily: float = 0.0,
freq: int = 252,
n_trials: int = 1,
) -> dict:
r = returns.dropna().astype(float).values
if len(r) < 20:
raise ValueError("need at least 20 observations")
mu, sd = r.mean(), r.std(ddof=0)
sharpe = np.sqrt(freq) * (mu - rf_daily) / sd
downside = r[r < mar_daily]
dd_std = downside.std(ddof=0) if len(downside) > 1 else np.nan
sortino = np.sqrt(freq) * (mu - mar_daily) / dd_std if dd_std else np.nan
equity = (1 + pd.Series(r)).cumprod()
peak = equity.cummax()
max_dd = float(((equity - peak) / peak).min())
ann_ret = float((1 + mu) ** freq - 1)
calmar = ann_ret / abs(max_dd) if max_dd < 0 else np.nan
p95, p5 = np.percentile(r, 95), np.percentile(r, 5)
tail_ratio = abs(p95) / abs(p5) if p5 != 0 else np.nan
g3 = float(pd.Series(r).skew())
g4 = float(pd.Series(r).kurtosis()) # excess kurtosis
T = len(r)
gamma = 0.5772156649
emax = (1 - gamma) * norm.ppf(1 - 1 / n_trials) + \
gamma * norm.ppf(1 - 1 / (n_trials * np.e)) if n_trials > 1 else 0.0
denom = np.sqrt(1 - g3 * sharpe + ((g4 - 1) / 4) * sharpe ** 2)
dsr = float(norm.cdf((sharpe - emax) * np.sqrt(T - 1) / denom)) if denom > 0 else np.nan
return {
"sharpe": float(sharpe),
"sortino": float(sortino),
"calmar": float(calmar),
"tail_ratio": float(tail_ratio),
"deflated_sharpe": dsr,
"max_drawdown": max_dd,
"skew": g3,
"excess_kurtosis": g4,
}
Usage:
>>> risk_adjusted_suite(daily_returns, n_trials=200)
{
"sharpe": 2.31,
"sortino": 3.05,
"calmar": 0.41, # red flag: big DD despite high Sharpe
"tail_ratio": 0.42, # red flag: left tail dominates
"deflated_sharpe": 0.38, # red flag: likely selection bias
"max_drawdown": -0.58,
"skew": -2.1,
"excess_kurtosis": 11.4,
}
That pattern — Sharpe 2.3 but Calmar 0.4, tail ratio 0.4, DSR 0.4, skew −2, kurtosis 11 — is the textbook short-volatility signature. Sharpe alone would have blessed the strategy.
When each metric bites
| Strategy type | Sharpe | Sortino | Calmar | Tail ratio | DSR |
|---|---|---|---|---|---|
| Short volatility | High | High | Low | Low | Low if many trials |
| Trend following | Medium | Medium | Medium | High | High if few trials |
| Long-only equity | Medium | Medium | Medium | Near 1 | High if few trials |
| Mean reversion, overfit | High | High | Medium | Near 1 | Low |
| Option buying | Low | Medium | Low | High | Medium |
Bolded cells are where that metric surfaces something Sharpe alone would hide.
Practical reporting rule
Every backtest report should include Sharpe, Sortino, Calmar, tail ratio, max drawdown, skew, excess kurtosis, and Deflated Sharpe with n_trials set to the actual number of strategy variants tested. Any report that shows Sharpe alone is either incomplete or hiding something.
Connects to
- Risk-Adjusted Returns Calculator — browser tool that runs this suite on an uploaded returns CSV.
- Returns Distribution Analyzer — visualises skew, kurtosis, and QQ-plots against Gaussian and Student-t benchmarks.
- Backtest Overfitting Score — CSCV and PBO to complement DSR.
- Did You Overfit? Running PBO and Deflated Sharpe on Your Backtest — DSR derivation and implementation.
- How to Read a Backtest Report — Sharpe-plus-four is part of the five-question checklist.
- Walk-Forward Validation: A Cookbook for Honest Backtests — walk-forward generates the OOS returns series on which the suite should be computed.
References
- Sharpe, W. F. (1966). "Mutual Fund Performance." Journal of Business 39(1), pp. 119–138. The original ratio, then called the reward-to-variability ratio.
- Sortino, F. A., & Price, L. N. (1994). "Performance Measurement in a Downside Risk Framework." Journal of Investing 3(3).
- Young, T. W. (1991). "Calmar Ratio: A Smoother Tool." Futures magazine. Coined "Calmar" as California Managed Accounts Reports.
- Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism." Notices of the AMS 61(5).
- Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 42(1). Frames multiple testing in the cross-section of published anomalies.