Skip to main content
aifinhub
AI in Markets Comparison

Cohen's Kappa vs ICC for Agreement

When you validate an LLM judge against human annotators, or compare several graders, you need a number for how much they agree beyond chance. Raw percent agreement is misleading because two raters can agree often just by both favoring a common label. Cohen's kappa corrects for that on categorical data. The ICC does the analogous job for continuous scores, partitioning variance to ask how much of the total variation is between subjects versus between raters. The right choice hinges on whether your ratings are categories or numbers, and on how many raters you have. This matrix compares them for grading and annotation tasks.

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Cohen's Kappa Option

A chance-corrected agreement statistic for two raters assigning categorical labels. Ranges below and up to one, where one is perfect and zero is chance-level agreement.

Pros

  • Corrects for chance agreement, unlike raw percent agreement which inflates on skewed categories
  • The standard, well-understood metric for categorical inter-rater reliability
  • Interpretable against established, if rough, benchmark ranges for agreement strength
  • Extends to weighted kappa for ordinal categories where some disagreements are worse than others

Cons

  • Basic Cohen's kappa handles only two raters; more need Fleiss' kappa
  • Suffers the kappa paradox: high agreement can yield low kappa when one category dominates
  • For categorical data only, so it cannot score agreement on a continuous scale
  • Sensitive to the marginal distribution of categories, complicating cross-study comparison

Two raters on categorical labels, classification-style grading, and chance-corrected agreement on discrete decisions

Intraclass Correlation (ICC) Option

An agreement statistic for continuous or ordinal ratings that partitions variance to measure how consistently raters score the same subjects. Handles any number of raters.

Pros

  • Designed for continuous and ordinal scores, where kappa does not apply
  • Naturally accommodates any number of raters, not just two
  • Multiple forms distinguish absolute agreement from consistency, and single from average raters
  • Grounded in variance components, giving a principled decomposition of disagreement sources

Cons

  • Requires choosing the correct ICC form, and the wrong choice changes the value meaningfully
  • Assumes interval-scale ratings, so applying it to true categories is inappropriate
  • Sensitive to the range of subjects: a narrow range of true values depresses the ICC
  • Less intuitive to interpret than kappa for those used to categorical agreement

Continuous or ordinal scores, more than two raters, and distinguishing absolute agreement from mere consistency

Decision Table

See the tradeoffs side by side

Criterion Cohen's Kappa Intraclass Correlation (ICC)
Rating type Categorical Continuous or ordinal
Number of raters Two (Fleiss for more) Any number
Chance correction Yes Via variance partition
Forms to choose Unweighted or weighted Several, by design and agreement type
Known pitfall Kappa paradox on skewed categories Wrong form, narrow subject range
Right for LLM-judge labels Discrete labels Numeric scores

Verdict

Let the rating scale decide. If your graders assign discrete labels, pass or fail, correct or incorrect, a category, use Cohen's kappa for two raters and Fleiss' kappa for more, because they correct for the chance agreement that raw percent agreement hides. If your graders assign scores on a numeric scale, a one-to-five quality rating, a confidence number, use the ICC, choosing the form that matches your design and whether you care about absolute agreement or mere consistency. Do not apply kappa to numbers by binning them, which throws away information, and do not apply the ICC to true categories, which assumes a scale that is not there. Watch each metric's trap: kappa can read paradoxically low when one category dominates even though the raters mostly agree, and the ICC can be depressed when the subjects span a narrow range of true values. For validating an LLM judge against humans, pick the metric that matches the judge's output type and report it with the rater count and, for kappa, the marginal distribution so others can interpret it.

Try These Tools

Run the numbers next

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

When one category is far more common than the others, two raters can agree on a very high proportion of cases yet produce a low Cohen's kappa. This happens because kappa corrects for the high chance agreement that the skewed marginals imply, so most of the observed agreement is attributed to chance and little is left as genuine. The practical lesson is to report the marginal distribution alongside kappa, and to be cautious interpreting a low kappa when raw agreement is high and categories are imbalanced.

Sources & References

Related Content

Keep the topic connected

Planning estimates only — not financial, tax, or investment advice.