Which ICC form should I use?

The choice depends on your design and goal. You decide whether raters are a fixed set or a random sample, whether you want absolute agreement or only consistency of ranking, and whether the reliability applies to a single rater or the average of several. These choices map to distinct ICC forms that can give noticeably different values on the same data, so the form must be selected deliberately and reported, not left to a default. Misreporting the form is a common source of incomparable ICC values across studies.

Can I use kappa for an LLM that outputs a numeric score?

Not directly, and binning the scores into categories to force kappa discards the ordering and magnitude information that the numeric scale carries, which usually understates true agreement. If the LLM judge emits a continuous or ordinal score, the ICC is the appropriate metric because it is built for that scale and uses the full information. Reserve kappa for when the judge genuinely emits discrete labels; if it emits numbers, use the ICC and pick the correct form.

AI in Markets Comparison

Cohen's Kappa vs ICC for Agreement

When you validate an LLM judge against human annotators, or compare several graders, you need a number for how much they agree beyond chance. Raw percent agreement is misleading because two raters can agree often just by both favoring a common label. Cohen's kappa corrects for that on categorical data. The ICC does the analogous job for continuous scores, partitioning variance to ask how much of the total variation is between subjects versus between raters. The right choice hinges on whether your ratings are categories or numbers, and on how many raters you have. This matrix compares them for grading and annotation tasks.

6 CRITERIAPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Options 6 criteria Verdict FAQ

Cohen's Kappa Option

A chance-corrected agreement statistic for two raters assigning categorical labels. Ranges below and up to one, where one is perfect and zero is chance-level agreement.

Pros

Corrects for chance agreement, unlike raw percent agreement which inflates on skewed categories
The standard, well-understood metric for categorical inter-rater reliability
Interpretable against established, if rough, benchmark ranges for agreement strength
Extends to weighted kappa for ordinal categories where some disagreements are worse than others

Cons

Basic Cohen's kappa handles only two raters; more need Fleiss' kappa
Suffers the kappa paradox: high agreement can yield low kappa when one category dominates
For categorical data only, so it cannot score agreement on a continuous scale
Sensitive to the marginal distribution of categories, complicating cross-study comparison

Two raters on categorical labels, classification-style grading, and chance-corrected agreement on discrete decisions

Intraclass Correlation (ICC) Option

An agreement statistic for continuous or ordinal ratings that partitions variance to measure how consistently raters score the same subjects. Handles any number of raters.

Pros

Designed for continuous and ordinal scores, where kappa does not apply
Naturally accommodates any number of raters, not just two
Multiple forms distinguish absolute agreement from consistency, and single from average raters
Grounded in variance components, giving a principled decomposition of disagreement sources

Cons

Requires choosing the correct ICC form, and the wrong choice changes the value meaningfully
Assumes interval-scale ratings, so applying it to true categories is inappropriate
Sensitive to the range of subjects: a narrow range of true values depresses the ICC
Less intuitive to interpret than kappa for those used to categorical agreement

Continuous or ordinal scores, more than two raters, and distinguishing absolute agreement from mere consistency

Decision Table

See the tradeoffs side by side

Criterion	Cohen's Kappa	Intraclass Correlation (ICC)
Rating type	Categorical	Continuous or ordinal
Number of raters	Two (Fleiss for more)	Any number
Chance correction	Yes	Via variance partition
Forms to choose	Unweighted or weighted	Several, by design and agreement type
Known pitfall	Kappa paradox on skewed categories	Wrong form, narrow subject range
Right for LLM-judge labels	Discrete labels	Numeric scores

Verdict

Let the rating scale decide. If your graders assign discrete labels, pass or fail, correct or incorrect, a category, use Cohen's kappa for two raters and Fleiss' kappa for more, because they correct for the chance agreement that raw percent agreement hides. If your graders assign scores on a numeric scale, a one-to-five quality rating, a confidence number, use the ICC, choosing the form that matches your design and whether you care about absolute agreement or mere consistency. Do not apply kappa to numbers by binning them, which throws away information, and do not apply the ICC to true categories, which assumes a scale that is not there. Watch each metric's trap: kappa can read paradoxically low when one category dominates even though the raters mostly agree, and the ICC can be depressed when the subjects span a narrow range of true values. For validating an LLM judge against humans, pick the metric that matches the judge's output type and report it with the rater count and, for kappa, the marginal distribution so others can interpret it.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Forecast Scoring Sandbox

Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.

Launch toolOpen ->

PlaygroundsCalculator

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.

Launch toolOpen ->

GeneratorsCalculator

Quant Interview Question Generator

Curated bank of probability, stats, derivatives, microstructure, and regression questions across easy/medium/hard difficulty. Reproducible by seed. No AI.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

When one category is far more common than the others, two raters can agree on a very high proportion of cases yet produce a low Cohen's kappa. This happens because kappa corrects for the high chance agreement that the skewed marginals imply, so most of the observed agreement is attributed to chance and little is left as genuine. The practical lesson is to report the marginal distribution alongside kappa, and to be cautious interpreting a low kappa when raw agreement is high and categories are imbalanced.

Sources & References

A Coefficient of Agreement for Nominal Scales — Jacob Cohen, Educational and Psychological Measurement (1960)
A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research — Koo and Li, Journal of Chiropractic Medicine (2016)

Keep the topic connected

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets1 FAQS

Prompt Injection

Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.

Keep readingRead ->