Cohen's Kappa vs ICC for Agreement
When you validate an LLM judge against human annotators, or compare several graders, you need a number for how much they agree beyond chance. Raw percent agreement is misleading because two raters can agree often just by both favoring a common label. Cohen's kappa corrects for that on categorical data. The ICC does the analogous job for continuous scores, partitioning variance to ask how much of the total variation is between subjects versus between raters. The right choice hinges on whether your ratings are categories or numbers, and on how many raters you have. This matrix compares them for grading and annotation tasks.
On This Page
A chance-corrected agreement statistic for two raters assigning categorical labels. Ranges below and up to one, where one is perfect and zero is chance-level agreement.
Pros
- Corrects for chance agreement, unlike raw percent agreement which inflates on skewed categories
- The standard, well-understood metric for categorical inter-rater reliability
- Interpretable against established, if rough, benchmark ranges for agreement strength
- Extends to weighted kappa for ordinal categories where some disagreements are worse than others
Cons
- Basic Cohen's kappa handles only two raters; more need Fleiss' kappa
- Suffers the kappa paradox: high agreement can yield low kappa when one category dominates
- For categorical data only, so it cannot score agreement on a continuous scale
- Sensitive to the marginal distribution of categories, complicating cross-study comparison
Two raters on categorical labels, classification-style grading, and chance-corrected agreement on discrete decisions
An agreement statistic for continuous or ordinal ratings that partitions variance to measure how consistently raters score the same subjects. Handles any number of raters.
Pros
- Designed for continuous and ordinal scores, where kappa does not apply
- Naturally accommodates any number of raters, not just two
- Multiple forms distinguish absolute agreement from consistency, and single from average raters
- Grounded in variance components, giving a principled decomposition of disagreement sources
Cons
- Requires choosing the correct ICC form, and the wrong choice changes the value meaningfully
- Assumes interval-scale ratings, so applying it to true categories is inappropriate
- Sensitive to the range of subjects: a narrow range of true values depresses the ICC
- Less intuitive to interpret than kappa for those used to categorical agreement
Continuous or ordinal scores, more than two raters, and distinguishing absolute agreement from mere consistency
Decision Table
See the tradeoffs side by side
| Criterion | Cohen's Kappa | Intraclass Correlation (ICC) |
|---|---|---|
| Rating type | Categorical | Continuous or ordinal |
| Number of raters | Two (Fleiss for more) | Any number |
| Chance correction | Yes | Via variance partition |
| Forms to choose | Unweighted or weighted | Several, by design and agreement type |
| Known pitfall | Kappa paradox on skewed categories | Wrong form, narrow subject range |
| Right for LLM-judge labels | Discrete labels | Numeric scores |
Verdict
Let the rating scale decide. If your graders assign discrete labels, pass or fail, correct or incorrect, a category, use Cohen's kappa for two raters and Fleiss' kappa for more, because they correct for the chance agreement that raw percent agreement hides. If your graders assign scores on a numeric scale, a one-to-five quality rating, a confidence number, use the ICC, choosing the form that matches your design and whether you care about absolute agreement or mere consistency. Do not apply kappa to numbers by binning them, which throws away information, and do not apply the ICC to true categories, which assumes a scale that is not there. Watch each metric's trap: kappa can read paradoxically low when one category dominates even though the raters mostly agree, and the ICC can be depressed when the subjects span a narrow range of true values. For validating an LLM judge against humans, pick the metric that matches the judge's output type and report it with the rater count and, for kappa, the marginal distribution so others can interpret it.
Try These Tools
Run the numbers next
Forecast Scoring Sandbox
Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
Quant Interview Question Generator
Curated bank of probability, stats, derivatives, microstructure, and regression questions across easy/medium/hard difficulty. Reproducible by seed. No AI.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- A Coefficient of Agreement for Nominal Scales — Jacob Cohen, Educational and Psychological Measurement (1960)
- A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research — Koo and Li, Journal of Chiropractic Medicine (2016)
Related Content
Keep the topic connected
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.
Prompt Injection
Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.