Statistical Methods

Statistical Methods for Rater Agreement

Statistical Methods
for
Rater Agreement

Recep ÖZCAN

recepozcan06@gmail.com

http://recepozcan06.blogcu.com/

2009

1


INDEX
Page
1. Statistical Methods for Rater Agreement…………………………………………… 5
1.0 Basic Considerations……………………………………………...…...…………… 5
1.1 Know the goals……………………………………………………………..………. 5
1.2 Consider theory………………………………………………………..…………… 5
1.3 Reliability vs. validity…………………………………………………..………….. 6
1.4 Modeling vs. description…………………………………………………………… 6
1.5 Components of disagreement……………………………………….……………… 7
1.6 Keep it simple………………………………………………………………………. 7
1.6.1 An example……………………………………………………...……………… 7
1.7 Recommended Methods………………………………………….………………… 8
1.7.1 Dichotomous data………………………………………………………………. 8
1.7.2 Ordered-category data…………………………………………….……………. 9
1.7.3 Nominal data…………………………………………………………………… 9
1.7.4 Likert-type items………………………………………………….……………. 10
2. Raw Agreement Indices……………………………………………………...………. 12
2.0 Introduction ………………………………………………………………...……… 12
2.1 Two Raters, Dichotomous Ratings ………………………………………………… 12
2.2 Proportion of overall agreement ………………………………………...…………. 12
2.3 Positive agreement and negative agreement ……………………………..………… 13
2.4 Significance, standard errors, interval estimation …………………………………. 13
2.4.1 Proportion of overall agreement ………………………………….……………. 13
2.4.2 Positive agreement and negative agreement ………………………..………….. 14
2.5 Two Raters, Polytomous Ratings …………………………………….……………. 15
2.6 Overall Agreement ……………………………………………...…….…………… 16
2.7 Specific agreement …………………………………………………..…………….. 17
2.8 Generalized Case ………………………………………………………………….. 17
2.9 Specific agreement ……………………………………………………...…………. 17
2.10 Overall agreement …………………………………………………..……………. 18
2.11 Standard errors, interval estimation, significance …………………..……………. 19
3. Intraclass Correlation and Related Method……………………………...………… 21
3.0 Introduction …………………………………………………………..……………. 21
3.1 Different Types of ICC……………………………………………….……………. 23

2


3.2 Pros and Cons……………………………………………………………...……….. 23
3.2.1 Pros………………………………………………………………...…………… 23
3.2.2 Cons…………………………………………………………………………….. 24
3.3 The Comparability Issue……………………………………….…………………… 25
4. Kappa Coefficients ………………………………………………………..…………. 27
4.0 Summary …………………………………………………………………………… 27
5. Tests of Marginal Homogeneity ……………………………………………………. 28
5.0 Introduction …………………………………………………………….………….. 28
5.1 Graphical and descriptive methods………………………………………………… 29
5.2 Nonparametric tests ……………………………………………………..…………. 30
5.3 Bootstrapping………………………………………………………………………. 30
5.4 Loglinear, association and quasi-symmetry modeling ………………….…………. 31
5.5 Latent trait and related models ………………………………………..…………… 32
6. The Tetrachoric and Polychoric Correlation Coefficients………….…………….. 34
6.0 Introduction……………………………………………………………...…………. 34
6.0.1 Summary ……………………………………………………………….……… 34
6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients……………… 34
6.1.1 Pros……………………………………………………………………..………. 34
6.1.2 Cons…………………………………………………………………….………. 35
6.2 Intuitive Explanation ………………………………………………………………. 35
7. Detailed Description …………………………………………………………………. 38
7.0 Introduction ………………………………………………………………...……… 38
7.1 Measurement Model ……………………………………………………….………. 38
7.2 Using the Polychoric Correlation to Measure Agreement …………………..…….. 40
7.3 Extensions and Generalizations ………………………………………….………… 42
7.3.1 Examples ……………………………………………………….……………… 42
7.4 Factor analysis and SEM……………………………………………...……………. 45
7.4.1 Programs for tetrachoric correlation……………………………………………. 45
7.4.2 Programs for polychoric and tetrachoric correlation………………...…………. 46
7.4.3 Generalized latent correlation…………………………………………...……… 47
8. Latent Trait Models for Rater Agreement………………………………………….. 49
8.0 Introduction ………………………………………………………………...……… 49
8.1 Measurement Model ……………………………………………………..………… 49
8.2 Evaluating the Assumptions ……………………………………………..………… 50

3


8.3 What the Model Provides ………………………………………………..………… 50
9. Odds Ratio and Yule's Q…………………………………………………….………. 52
9.0 Introduction …………………………………………………………….………….. 52
9.1 Intuitive explanation…………………………………………………….………….. 52
9.2 Yule's Q………………………………………………………………..…………… 53
9.3 Log-odds ratio……………………………………………………………………… 53
9.4 Pros and Cons: the Odds Ratio…………………………………………..…………. 54
9.4.1 Pros………………………………………………………………….………….. 54
9.4.2 Cons………………………………………………………………….…………. 54
9.5 Extensions and alternatives …………………………………………….………….. 55
9.5.1 Extensions ……………………………………………………………………… 55
9.5.2 Alternatives…………………………………………………………..…………. 55
10. Agreement on Interval-Level Ratings …………………………………..………… 57
10.0 Introduction ………………………………………………………….…………… 57
10.1 General Issues …………………………………………………………….………. 58
10.2 Rater Association ………………………………………………………...……….. 58
10.3 Rater Bias …………………………………………………………………..…….. 59
10.4 Rating Distribution ……………………………………………………………….. 59
10.5 Rater vs. Rater or Rater vs. Group ……………………………………………….. 59
10.6 Measuring Rater Agreement …………………………………………..………….. 60
10.7 Measuring Rater Association ………………………………………..…………… 60
10.8 Measuring Rater Bias …………………………………………………………….. 62
10.9 Rater Distribution Differences …………………………………...………………. 62
10.10 Using the Results ………………………………………………..………………. 63
10.11 The Delphi Method ………………………………………………...……………. 63
10.12 Rater Bias ………………………………………………………….……………. 63
10.13 Rater Association ……………………………………………………..…………. 63
10.14 Distribution of Ratings ………………………………………………..………… 64
10.15 Discussion of Ambiguous Cases ……………………………………...………… 64

4


1. Statistical Methods for Rater Agreement

1.0 Basic Considerations

In many fields it is common to study agreement among ratings of multiple judges, experts,
diagnostic tests, etc. We are concerned here with categorical ratings: dichotomous (Yes/No,
Present/Absent, etc.), ordered categorical (Low, Medium, High, etc.), and nominal
(Schizophrenic, Bi-Polar, Major Depression, etc.) ratings. Likert-type ratings--intermediate
between ordered-categorical and interval-level ratings, are also considered.

There is little consensus about what statistical methods are best to analyze rater agreement (we
will use the generic words quot;ratersquot; and quot;ratingsquot; here to include observers, judges, diagnostic
tests, etc. and their ratings/results.) To the non-statistician, the number of alternatives and lack
of consistency in the literature is no doubt cause for concern. This site aims to reduce confusion
and help researchers select appropriate methods for their applications.

Despite the many apparent options for analyzing agreement data, the basic issues are very
simple. Usually there are one or two methods best for a particular application. But it is
necessary to clearly identify the purpose of analysis and the substantive questions to be
answered.

1.1 Know the goals

The most common mistake made when analyzing agreement data is not having a explicit goal.
It is not enough for the goal to be quot;measuring agreementquot; or quot;finding out if raters agree.quot; There
is presumably some reason why one wants to measure agreement. Which statistical method is
best depends on this reason.

For example, rating agreement studies are often used to evaluate a new rating system or
instrument. If such a study is being conducted during the development phase of the instrument,
one may wish to analyze the data using methods that identify how the instrument could be
changed to improve agreement. However if an instrument is already in a final format, the same
methods might not be helpful.

Very often agreement studies are an indirect attempt to validate a new rating system or
instrument. That is, lacking a definitive criterion variable or quot;gold standard,quot; the accuracy of a
scale or instrument is assessed by comparing its results when used by different raters. Here one
may wish to use methods that address the issue of real concern--how well do ratings reflect the
true trait one wants to measure?

In other situations one may be considering combining the ratings of two or more raters to obtain
evaluations of suitable accuracy. If so, again, specific methods suitable for this purpose should
be used.

1.2 Consider theory

A second common problem in analyzing agreement is the failure to think about the data from
the standpoint of theory. Nearly all statistical methods for analyzing agreement make

5


assumptions. If one has not thought about the data from a theoretical point of view it will be
hard to select an appropriate method. The theoretical questions one asks do not need to be
complicated. Even simple questions, like quot;is the trait being measured really discrete, like
presence/absence of a pathogen, or is the trait really continuous and being divided into discrete
levels (e.g., quot;low,quot; quot;medium, quot;highquot;) for convenience? If the latter, is it reasonable to assume
that the trait is normally distributed? Or is some other distribution plausible?

Sometimes one will not know the answers to these questions. That is fine, too, because there are
methods suitable for that case also. The main point is to be inclined to think about data in this
way, and to be attuned to the issue of matching method and data on this basis.

These two issues--knowing ones goals and considering theory, are the main keys to successful
analysis of agreement data. Following are some other, more specific issues that pertain to the
selection of methods appropriate to a given study.

1.3 Reliability vs. validity

One can broadly distinguish two reasons for studying rating agreement. Sometimes the goal is
estimate the validity (accuracy) of ratings in the absence of a quot;gold standard.quot; This is a
reasonable use of agreement data: if two ratings disagree, then at least one of them must be
incorrect. Proper analysis of agreement data therefore permits certain inferences about how
likely a given rating is to be correct.

Other times one merely wants to know the consistency of ratings made by different raters. In
some cases, the issue of accuracy may even have no meaning--for example ratings may concern
opinions, attitudes, or values.

1.4 Modeling vs. description

One should also distinguish between modeling vs. describing agreement. Ultimately, there are
only a few simple ways to describe the amount of agreement: for example, the proportion of
times two ratings of the same case agree, the proportion of times raters agree on specific
categories, the proportions of times different raters use the various rating levels, etc.

The quantification of agreement in any other way inevitably involves a model about how ratings
are made and why raters agree or disagree. This model is either explicit, as with latent structure
models, or implicit, as with the kappa coefficient. With this in mind, two basic principles are
evident:

It is better to have a model that is explicitly understood than one which is only implicit
•
and potentially not understood.
The model should be testable.
•

Methods vary with respect to how well they meet the these two criteria.

1.5 Components of disagreement

6


Consider that disagreement has different components. With ordered-category (including
dichotomous) ratings, one can distinguish between two different sources of disagreement.
Raters may differ: (a) in the definition of the trait itself; or (b) in their definitions of specific
rating levels or categories.

A trait definition can be thought of as a weighted composite of several variables. Different
raters may define or understand the trait as different weighted combinations. For example, to
one rater Intelligence may mean 50% verbal skill and 50% mathematical skill; to another it may
mean 33% verbal skill, 33% mathematical skill, and 33% motor skill. Thus their essential
definitions of what the trait means differ. Similarity in raters' trait definitions can be assessed
with various estimates of the correlation of their ratings, or analogous measures of association.

Category definitions, on the other hand, differ because raters divide the trait into different
intervals. For example, by quot;low skillquot; one rater may mean subjects from the 1st to the 20th
percentile. Another rater, though, may take it to mean subjects from the 1st to the 10th
percentile. When this occurs, rater thresholds can usually be adjusted to improve agreement.
Similarity of category definitions is reflected as marginal homogeneity between raters. Marginal
homogeneity means that the frequencies (or, equivalently, the quot;base ratesquot;) with which two
raters use various rating categories are the same.

Because disagreement on trait definition and disagreement on rating category widths are distinct
components of disagreement, with different practical implications, a statistical approach to the
data should ideally quantify each separately.

1.6 Keep it simple

All other things being equal, a simpler statistical method is preferable to a more complicated
one. Very basic methods can reveal far more about agreement data than is commonly realized.
For the most part, advanced methods are complements to, not substitutes for simple methods.

1.6.1 An example:

To illustrate these principles, consider the example for rater agreement on screening
mammograms, a diagnostic imaging method for detecting possible breast cancer. Radiologists
often score mammograms on a scale such as quot;no cancer,quot; quot;benign cancer,quot; quot;possible
malignancy,quot; or quot;malignancy.quot; Many studies have examined rater agreement on applying these
categories to the same set of images.

In choosing a suitable statistical approach, one would first consider theoretical aspects of the
data. The trait being measured, degree of evidence for cancer, is continuous. So the actual rating
levels would be viewed as somewhat arbitrary discretizations of the underlying trait. A
reasonable view is that, in the mind of a rater, the overall weight of evidence for cancer is an
aggregate composed of various physical image features and weights attached to each feature.
Raters may vary in terms of which features they notice and the weights they associate with
each.

One would also consider the purpose of analyzing the data. In this application, the purpose of
studying rater agreement is not usually to estimate the accuracy of ratings by a single rater. That

7


can be done directly in a validity study, which compares ratings to a definitive diagnosis made
from a biopsy.

Instead, the aim is more to understand the factors that cause raters to disagree, with an ultimate
goal of improving their consistency and accuracy. For this, one should separately assess
whether raters have the same definition of the basic trait (that different raters weight various
image features similarly) and that they have similar widths for the various rating levels. The
former can be accomplished with, for example, latent trait models. Moreover, latent trait models
are consistent with the theoretical assumptions about the data noted above. Raters' rating
category widths can be studied by visually representing raters' rates of use for the different
rating levels and/or their thresholds for the various levels, and statistically comparing them with
tests of marginal homogeneity.

Another possibility would be to examine if some raters are biased such that they make generally
higher or lower ratings than other raters. One might also note which images are the subject of
the most disagreement and then to try identify the specific image features that are the cause of
the disagreement.

Such steps can help one identify specific ways to improve ratings. For example, raters who
seem to define the trait much differently than other raters, or use a particular category too often,
can have this pointed out to them, and this feedback may promote their making ratings in a way
more consistent with other raters.

1.7 Recommended Methods

This section suggests statistical methods suitable for various levels of measurement based on
the principles outlined above. These are general guidelines only--it follows from the discussion
that no one method is best for all applications. But these suggestions will at least give the reader
an idea of where to start.

1.7.1 Dichotomous data

Two raters

 Assess raw agreement, overall and specific to each category.
 Use Cohen's kappa: (a) from its p-value, establish that agreement exceeds that expected
under the null hypothesis of random ratings; (b) interpret the magnitude of kappa as an
intraclass correlation. If different raters are used for different subjects, use the Scott/Fleiss
kappa instead of Cohen's kappa.
 Alternatively, calculate the intraclass correlation directly instead of a kappa statistic.
 Use McNemar's test to evaluate marginal homogeneity.
 Use the tetrachoric correlation coefficient if its assumptions are sufficiently plausible.
 Possibly test association between raters with the log odds ratio;

Multiple raters

 Assess raw agreement, overall and specific to each category.
 Calculate the appropriate intraclass correlation for the data. If different raters are used for
each subject, an alternative is the Fleiss kappa.

8


 If the trait being rated is assumed to be latently discrete, consider use of latent class models.
 If the trait being rated can be interpreted as latently continuous, latent trait models can be
used to assess association among raters and to estimate the correlation of ratings with the
true trait; these models can also be used to assess marginal homogeneity.
 In some cases latent class and latent trait models can be used to estimate the accuracy (e.g.,
Sensitivity and Specificity) of diagnostic ratings even when a 'gold standard' is lacking.

1.7.2 Ordered-category data

Two raters

 Use weighted kappa with Fleiss-Cohen (quadratic) weights; note that quadratic weights are
not the default with SAS and you must specify (WT=FC) with the AGREE option in PROC
FREQ.

 Alternatively, estimate the intraclass correlation.

 Ordered rating levels often imply a latently continuous trait; if so, measure association
between the raters with the polychoric correlation or one of its generalizations.

 Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.

 Test (a) for differences in rater thresholds associated with each rating category and (b) for a
difference between the raters' overall bias using the respectively applicable McNemar tests.

 Optionally, use graphical displays to visually compare the proportion of times raters use
each category (base rates).

 Consider association models and related methods for ordered category data. (See Agresti A.,
Categorical Data Analysis, New York: Wiley, 2002).

Multiple raters

 Estimate the intraclass correlation.
 Test for differences in rater bias using ANOVA or the Friedman test.
 Use latent trait analysis as a multi-rater generalization of the polychoric correlation. Latent
trait models can also be used to test for differences among raters in individual rating
category thresholds.
 Graphically examine and compare rater base rates and/or thresholds for various rating
categories.
 Alternatively, consider each pair of raters and proceed as described for two raters.

1.7.3 Nominal data

 Two raters
 Assess raw agreement, overall and specific to each category.
 Use the p-value of Cohen's unweighted kappa to verify that raters agree more than
chance alone would predict.
Often (perhaps usually), disregard the actual magnitude of kappa here; it is problematic

with nominal data because ordinarily one can neither assume that all types of

9


disagreement are equally serious (unweighted kappa) nor choose an objective set of
differential disagreement weights (weighted kappa). If, however, it is genuinely true that
all pairs of rating categories are equally quot;disparatequot;, then the magnitude of Cohen's
unweighted kappa can be interpreted as a form of intraclass correlation.
 Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.
 Test marginal homogeneity relative to individual categories using McNemar tests.
 Consider use of latent class models.
 Another possibility is use of loglinear, association, or quasi- symmetry models.
 Multiple raters
 Assess raw agreement, overall and specific to each category.
 If different raters are used for different subjects, use the Fleiss kappa statistic; again, as
with nominal data/two raters, attend only to the p-value of the test unless one has a
genuine basis for regarding all pairs of rating categories as equally quot;disparatequot;.
 Use latent class modeling. Conditional tests of marginal homogeneity can be made
within the context of latent class modeling.
 Use graphical displays to visually compare the proportion of times raters use each
category (base rates).
 Alternatively, consider each pair of raters individually and proceed as described for two
raters.

1.7.4 Likert-type items

Very often, Likert-type items can be assumed to produce interval-level data. (By a quot;Likert-type
itemquot; here we mean one where the format clearly implies to the rater that rating levels are
evenly-spaced, such as

lowest highest
|-------|-------|-------|-------|-------|-------|
1 2 3 4 5 6 7
(circle level that applies)

 Two raters
 Assess association among raters using the regular Pearson correlation coefficient.
 Test for differences in rater bias using the t-test for dependent samples.
 Possibly estimate the intraclass correlation.
 Assess marginal homogeneity as with ordered-category data.
 See also methods listed in the section Methods for Likert-type or interval-level data.
 Multiple raters
 Perform a one-factor common factor analysis; examine/report the correlation of each
rater with the common factor (for details, see the section Methods for Likert-type or
interval-level data).
Test for differences in rater bias using two-way ANOVA models.

Possibly estimate the intraclass correlation.

Use histograms to describe raters' marginal distributions.

If greater detail is required, consider each pair of raters and proceed as described for two

raters

10


2. Raw Agreement Indices

11


2.0 Introduction

Much neglected, raw agreement indices are important descriptive statistics. They have unique
common-sense value. A study that reports only simple agreement rates can be very useful; a
study that omits them but reports complex statistics may fail to inform readers at a practical
level.

Raw agreement measures and their calculation are explained below. We examine first the case
of agreement between two raters on dichotomous ratings.

2.1 Two Raters, Dichotomous Ratings

Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized
by Table 1:

Rater 2

Rater 1 + - total

+ a b a+b

- c d c+d

total a+c b+d N

The values a, b, c and d here denote the observed frequencies for each possible combination of
ratings by Rater 1 and Rater 2.

2.2 Proportion of overall agreement

The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2
agree. That is:

a+d a+d
po = ------------- = -----. (1)
a+b+c+d N

This proportion is informative and useful, but, taken by itself, has possible has limitations. One
is that it does not distinguish between agreement on positive ratings and agreement on negative
ratings.

Consider, for example, an epidemiological application where a positive rating corresponds to a
positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here
we might not be much impressed if po is very high -- even above .99. This result would be due
almost entirely to agreement on disease absence; we are not directly informed as to whether
diagnosticians agree on disease presence.

12


Further, one may consider Cohen's (1960) criticism of po: that it can be high even with
hypothetical raters who randomly guess on each case according to probabilities equal to the
observed base rates. In this example, if both raters simply guessed quot;positivequot; the large majority
of times they would usually agree on the diagnosis. Cohen proposed to remedy this by
comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters
who randomly guess. As described on the kappa coefficients page, this logic is questionable; in
particular, it is not clear what advantage there is to compare an actual level of agreement, po,
with a hypothetical value, pc, which would occur under an obviously unrealistic model.

A much simpler way to address this issue is described immediately below.

2.3 Positive agreement and negative agreement

We may also compute observed agreement relative to each rating category individually.
Generically the resulting indices are called the proportions of specific agreement (Spitzer &
Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and
negative agreement (NA). They are calculated as follows:

2a 2d
PA = ----------; NA = ----------. (2)
2a + b + c 2d + b + c
.
PA, for example, estimates the conditional probability, given that one of the raters, randomly
selected, makes a positive rating, the other rater will also do so.

A joint consideration of PA and NA addresses the potential concern that, when base rates are
extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would
affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is
arguably less need or purpose in comparing actual to chance- predicted agreement using a
kappa statistic. But in any case, PA and NA provide more information relevant to understanding
and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990).

2.4 Significance, standard errors, interval estimation

2.4.1 Proportion of overall agreement

Statistical significance. In testing the significance of po, the null hypothesis is that raters are
independent, with their marginal assignment probabilities equal to the observed marginal
proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a
contingency table. Any of the following could potentially be used:

test of a nonzero kappa coefficient
•
test of a nonzero log-odds ratio
•
a Pearson chi-squared (X²) or likelihood-ratio chi-squared (G²) test of independence
•
the Fisher exact test
•
test of fit of a loglinear model with main effects only
•

13


A potential advantage of a kappa significance test is that the magnitude of kappa can be
interpreted as approximately an intra-class correlation coefficient. All of these tests, except the
last, can be done with SAS PROC FREQ.

Standard error. One can use standard methods applicable to proportions to estimate the
standard error and confidence limits of po. For a sample size N, the standard error of po is:

SE(po) = sqrt[po(1 - po)/N] (3.1)

One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric
bootstrap or the jackknife, as described the next section.

Confidence intervals. The Wald or quot;normal approximationquot; method estimates confidence limits
of a proportion as follows:

CL = po - SE × zcrit (3.2)
CU = po + SE × zcrit (3.3)

where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper
confidence limits, and zcrit is the z-value associated with a confidence range with coverage
probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit =
1.645.

When po is either very large or very small (and especially with small sample sizes) the Wald
method may produce confidence limits less than 0 or greater than 1; in this case better
approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below)
can be used instead.

2.4.2 Positive agreement and negative agreement

Statistical significance. Logically, there is only one test of independence in a 2×2 table;
therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and
Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such
quot;specific kappasquot;, but both have the same value and statistical significance as the overall kappa.

Standard errors.

As shown by Mackinnon (2000; p. 130), asymptotic (large sample) standard errors of
•
PA and NA are estimated by the following formulas:
SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2 (3.4)
•
SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2 (3.5)
•
Alternatively, one can estimate standard errors using the nonparametric bootstrap or the
jackknife. These are described with reference to PA as follows:

With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large
•
number of simulated data sets of size N by sampling with replacement from the
observed data. For a 2×2 table, this can be done simply by using random numbers to
assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however,
with large N, is more efficient algorithms are preferable.) One then computes the
proportion of positive agreement for each simulated data set -- which we denote PA*.

14


The standard deviation of PA* across all simulated data sets estimates the standard error
SE(PA).
The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative tables
•
where 1 is subtracted from each of the four cells of the original 2 × 2 table. A few
simple calculations then provide an estimate of the standard error SE(PA). The delete-1
jackknife requires less computation, but the nonparametric bootstrap is usually
considered more accurate.

Confidence intervals.

Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3.,
•
substituting PA and NA for po and using the asymptotic standard errors given by Eqs.
3.4 and 3.5.
Alternatively, the bootstrap can be used. Again, we describe the method for PA. As
•
with bootstrap standard error estimation, ones generate a large number (e.g., 100,000)
of simulated data sets, computing an estimate PA* for each one. Results are then sorted
by increasing value of PA*. Confidence limits of PA are obtained with reference to the
percentiles of this ranking. For example, the 95% confidence range of PA is estimated
by the values of PA* that correspond to the 2.5 and 97.5 percentiles of this distribution.

An advantage of bootstrapping is that one can use the same simulated data sets to estimate
not only the standard errors and confidence limits of PA and NA, but also those of po or any
other statistic defined for the 2×2 table.

A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of
PA and NA has been written. For a free standalone program that supplies both bootstrap and
asymptotic standard errors and confidence limits, please email the author.

Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a
comparison of different methods for estimating confidence intervals for PA and NA.

2.5 Two Raters, Polytomous Ratings

We now consider results for two raters making polytomous (either ordered category or purely
nominal) ratings. Let C denote the number of rating categories or levels. Results for the two
raters may be summarized as a C × C table such as Table 2.

Table2
Summary of polytomous ratings by two raters

Rater 2

Rater 1 1 2 ... C total

15


1 n11 n12 ... n1C n1.

2 n21 n22 ... n2C n2.

. . . . .
...
. . . . .

C nC1 nC2 ... nCC nC.

total n.1 n.2 ... n.C N

Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by
Rater j, with i, j = 1, ..., C. When a quot;.quot; appears in a subscript, it denotes a marginal sum over the
corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for
category i; n.. = N denotes the total number of cases.

2.6 Overall Agreement

For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by
sample size, or

C
po = 1/N SUM nii (4)
i=1

Statistical significance

One may test the statistical significance of po with Cohen's kappa. If kappa is
•
significant/nonsignificant, then po may be assumed significant/nonsignificant, and vice
versa. Note that the numerator of kappa is the difference between po and the level of
agreement expected under the null hypothesis of statistical independence.
The parametric bootstrap can also be used to test statistical significance. This is like the
•
nonparametric bootstrap already described, except that samples are generated from the
null hypothesis distribution. Specifically, one constructs many -- say 5000 -- simulated
samples of size N from the probability distribution {πij}, where

ni.n.j
πij = ------. (5)
N

and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for
the actual data is considered statistically significant if it exceeds a specified percentage
(e.g., 5%) of the p*o values.

If one already has a computer program for nonparametric bootstrapping only slight
modifications are needed to adapt it to perform a parametric bootstrap significance test.

Standard error and confidence limits. Here the standard error and confidence intervals of po can
again be calculated with the methods described for 2×2 tables.

16


2.7 Specific agreement

With respect to Table 2, the proportion of agreement specific to category i is:

2nii
ps(i) = ---------. (6)
ni. + n.i

Statistical significance

Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering
this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq.
(2). This is done for each category i successively. In each reduced table one may perform a test
of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher
exact test.

Standard errors and confidence limits

Again, for each category i, we may collapse the original C × C table into a 2×2 table,
•
taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4) for
PA may then be used, and the Wald method confidence limits given by Eqs. (3.1) and
(3.2) may be computed.
Alternatively, one can use the nonparametric bootstrap to estimate standard errors
•
and/or confidence limits. Note that this does not require a successive collapsing of the
original table.
The delete-1 jackknife can be used to estimate standard errors, but this does require
•
successive collapsings of the C × C table.

2.8 Generalized Case

We now consider generalized formulas for the proportions of overall and specific agreement.
They apply to binary, ordered category, or nominal ratings and permit any number of raters,
with potentially different numbers of raters or different raters for each case.

2.9 Specific agreement

Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized
as:

{njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk}

where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a
case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk}
= {3, 2}.

Let nk denote the total number of ratings made on case k; that is,

C
nk = SUM njk. (7)
j=1

17


For case k, the number of actual agreements on rating level j is

njk (njk - 1). (8)

The total number of agreements specifically on rating level j, across all cases is

K
S(j) = SUM njk (njk - 1). (9)
k=1

The number of possible agreements specifically on category j for case k is equal to

njk (nk - 1) (10)

and the number of possible agreements on category j across all cases is:

K
Sposs(j) = SUM njk (nk - 1). (11)
k=1

The proportion of agreement specific to category j is equal to the total number of agreements on
category j divided by the total number of opportunities for agreement on category j, or

S(j)
ps(j) = -------. (12)
Sposs(j)

2.10 Overall agreement

The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9)
across all categories, or

C
O = SUM S(j). (13)
j=1

The total number of possible agreements is

K
Oposs = SUM nk (nk - 1). (14)
k=1

Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or

O
po = ------. (15)
Oposs

18


2.11 Standard errors, interval estimation, significance

The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard
errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes
cases are independent and identically distributed (iid). In general, this assumption will be
accepted when:

the same raters rate each case, and either there are no missing ratings or ratings are
•
missing completely at random.
the raters for each case are randomly sampled and the number of rating per case is
•
constant or random.
in a replicate rating (reproducibility) study, each case is rated by the procedure the same
•
number of times or else the number of replications for any case is completely random.

In these cases, one may construct each simulated sample by repeated random sampling with
replacement from the set of K cases.

If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a
study systematically rotates raters), simple modifications of the bootstrap method--such as two-
stage sampling, can be made.

The parametric bootstrap can be used for significance testing. A variation of this method,
patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:

Loop through s, where s indexes simulated data sets
Loop through all cases k
Loop through all ratings on case k

For each actual rating, generate a
random simulated rating, chosen such that:

Pr(Rating category=j|Rater=i) = base
rate of category j for Rater i.

If rater identities are unknown or for a
reproducibility study, the total base rate
for category j is used.

End loop through case k's ratings
End loop through cases
Calculate p*o and p*s(j)
(and any other statistics
of interest) for sample s.
End main loop

The significance of po, ps(j), or any other statistic calculated, is determined with reference to the
distribution of corresponding values in the simulated data sets. For example, po is significant at
the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the simulated data sets.

References

19


Agresti A. An introduction to categorical data analysis. New York: Wiley, 1996.

Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the
paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558.

Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 1960, 20, 37-46.

Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal
agreement: two families on agreement measures. Canadian Journal on Statistics, 1995,
23, 333-344.

Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society
for Industrial and Applied Mathematics, 1982.

Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall,
1993.

Fleiss JL. Measuring nominal scale agreement among many raters. Psychological
Bulletin, 1971, 76, 378-381.

Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley,
1981.

Graham P, Bull B. Approximate standard errors and confidence intervals for indices of
positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771.

Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the
assessment of diagnostic tests and inter-rater agreement. Computers in Biology and
Medicine, 2000, 30, 127-134.

Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British
Journal on Psychiatry, 1974, 341-47.

Uebersax JS. A design-independent method for measuring the reliability of psychiatric
diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342.

20


3. Intraclass Correlation and Related Method

3.0 Introduction

The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of
different ratings of the same subject to the total variation across all ratings and all subjects.

The theoretical formula for the ICC is:

s 2(b)
ICC = ------------ [1]
s 2(b) + s 2 (w)

where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait between
subjects.

21


It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all
ratings, regardless of whether they are for the same subject or not. Hence the interpretation of
the ICC as the proportion of total variance accounted for by within-subject variation.

Equation [1] would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and
must instead estimate them from sample data. For this we wish to use all available information;
this adds terms to Equation [1].

For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know a
subject's true trait level, we estimate it from the subject's mean rating across the raters who rate
the subject. Each mean rating is subject to sampling variation--deviation from the subject's true
trait level, or it's surrogate, the mean rating that would be obtained from a very large number of
raters. Since the actual mean ratings are often based on two or a few ratings, these deviations
are appreciable and inflate the estimate of between-subject variance.

We can estimate the amount and correct for this extra, error variation. If all subjects have k
ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as (1/k) s
2
(w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects have k
ratings, s2(w) equals the average variance of the k ratings of each subject (each calculated using
k-1 as denominator). To get the ICC we then:

1. Estimate s 2(b) as [s 2(b) - s 2(w)/k], where s2(b) is the variance of
subjects' mean ratings,
2. Estimate s 2(w) as s 2(w), and
3. Apply Equation [1]

For the various other types of ICC's, different corrections are used, each producing it's own
equation. Unfortunately, these formulas are usually expressed in their computational form--with
terms arranged in a way that facilitates calculation, rather than their derivational form--which
would make clear the nature and rationale of the correction terms.

3.1 Different Types of ICC

In their important paper, Shrout and Fleiss (1979) describe three classes of ICC for reliability,
which they term Case 1, Case 2 and Case 3. Each Case applies to a different rater agreement
study design.

Case 1 Raters for each subject are selected at random
Case 2 The same raters rate each case. These are a random sample.
Case 3 The same raters rate each case. These are the only raters.

Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool k
different raters to rate this subject. Therefore the raters who rate one subject are not necessarily
the same as those who rate another. This design corresponds to a 1-way Analysis of Variance
(ANOVA) in which Subject is a random effect, and Rater is viewed as measurement error.

Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater ×
Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In Case

22


2, Rater is considered a random effect; this means the k raters in the study are considered a
random sample from a population of potential raters. The Case 2 ICC estimates the reliability of
the larger population of raters.

Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates the
ICC that applies only to the k raters in the study. Since this does not permit generalization to
other raters, the Case 3 ICC is not often used.

Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the ICC
in two ways:

To estimate the reliability of a single rating, or
•
To estimate the reliability of a mean of several ratings.
•

For each of the Cases, then, there are two forms, producing a total of 6 different versions of the
ICC.

3.2 Pros and Cons

3.2.1 Pros

Flexible
•

The ICC, and more broadly, ANOVA analysis of ratings, is very flexible. Besides the
six ICCs discussed above, one can consider more complex designs, such as a grouping
factor among raters (e.g., experts vs. nonexperts), or covariates. See Landis and Koch
(1977a,b) for examples.

Software
•

Software to estimate the ICC is readily available (e.g, SPSS and SAS). Output from
most any ANOVA software will contain the values needed to calculate the ICC.

Reliability of mean ratings
•

The ICC allows estimation of the reliability of both single and mean ratings. quot;Prophecyquot;
formulas let one predict the reliability of mean ratings based on any number of raters.

Combines information about bias and association.
•

An alternative to the ICC for Cases 2 and 3 is to calculate the Pearson correlation
between all pairs of rater. The Pearson correlation measures association between raters,
but is insensitive to rater mean differences (bias). The ICC decreases in response to both
lower correlation between raters and larger rater mean differences. Some may see this
advantage, but others (see Cons) as a limitation.

Number of categories
•

23


The ICC can be used to compare the reliability of different instruments. For example,
the reliability of a 3-level rating scale can be compared to the reliability of a 5-level
scale (provided they are assessed relative to the same sample or population; see Cons).

3.2.2 Cons

Comparability across populations
•

The ICC is strongly influenced by the variance of the trait in the sample/population in
which it is assessed. ICCs measured for different populations might not be comparable.

For example, suppose one has a depression rating scale. When applied to a random
sample of the adult population the scale might have a high ICC. However, if the scale is
applied to a very homogeneous population--such as patients hospitalized for acute
depression--it might have a low ICC.

This is evident from the definition of the ICC as s 2(b)/ [s 2(b)+s 2(w)]. In both
populations above, s 2(w), variance of different raters' opinions of the same subject, may
be the same. But between-subject variance, s 2(b), may be much smaller in the clinical
population than in the general population. Therefore the ICC would be smaller in the
clinical population.

The the same instrument may be judged
quot;reliablequot; or quot;unreliable,quot; depending on the
population in which it is assessed.

This issue is similar to, and just as much a concern as, the quot;base ratequot; problem of the
kappa coefficient. It means that:

1. One cannot compare ICCs for samples or populations with different between-
subject variance; and
2. The often-reproduced table which shows specific ranges for quot;acceptablequot; and
quot;unacceptablequot; ICC values should not be used.

For more discussion on the implications of this topic see, The Comparability Issue
below.

Assumes equal spacing
•

To use the ICC with ordered-category ratings, one must assign the rating categories
numeric values. Usually categories are assigned values 1, 2, ..., C, where C is the
number of rating categories; this assumes all categories are equally wide, which may not
be true. An alternative is to assign ordered categories numeric values from their
cumulative frequencies via probit (for a normally distributed trait) or ridit (for a
rectangularly distributed trait) scoring; see Fleiss (1981).

Association vs. bias
•

The ICC combines, or some might say, confounds, two ways in which raters differ: (1)
association, which concerns whether the raters understand the meaning of the trait in the

24


same way, and (2) bias, which concerns whether some raters' mean ratings are higher or
lower than others. If a goal is to give feedback to raters to improve future ratings, one
should distinguish between these two sources of disagreement. For discussion on
alternatives that separate these components, see the Likert Scale page of this website.

Reliability vs. agreement
•

With ordered-category or Likert-type data, the ICC discounts the fact that we have a
natural unit to evaluate rating consistency: the number or percent of agreements on each
rating category. Raw agreement is simple, intuitive, and clinically meaningful. With
ordered category data, it is not clear why one would prefer the ICC to raw agreement
rates, especially in light of the comparability issue discussed below. A good idea is to
report reliability using both the ICC and raw agreement rates.

3.3 The Comparability Issue

Above it was noted that the ICC is strongly dependent on the trait variance within the
population for which it is measured. This can complicate comparisons of ICCs measured in
different populations, or in generalizing results from a single population.

Some suggest avoiding this problem by eliminating or holding constant the quot;problematicquot; term,
s 2(b).

Holding the term constant would mean choosing some fixed value for s 2(b), and using this in
place of the different value estimated in each population. For example, one might pick as s 2(b)
the trait variance in the general adult population--regardless of what population the ICC is
measured in.

However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not
simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been
suggested, though not used in practice much.

But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step
further and express reliability simply as raw agreement rates--for example, the percent of times
two raters agree on the exact same category, and the percent of time they are within on level of
one another?

An advantage of including s 2(b) is that it automatically controls for the scaling factor of an
instrument. Thus (at least within the same population), ICCs for instruments with different
numbers of categories can be meaningfully compared. Such is not the case with raw agreement
measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new scale may wish
to include the ICC along with other measures if they expect later researchers might compare
their results to those of a new or different instrument with fewer or more categories.

25


4. Kappa Coefficients

4.0 Summary

There is wide disagreement about the usefulness of kappa statistics to assess rater agreement. At
the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal standard
or default way to quantify agreement; (2) one should be concerned about using a statistic that is
the source of so much controversy; and (3) oneshould consider alternatives and make an
informed choice.

One can distinguish between two possible uses of kappa: as a way to test rater independence
(i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size
measure). The first use involves testing the null hypothesis that there is no more agreement than
might occur by chance given random guessing; that is, one makes a qualitative, quot;yes or noquot;
decision about whether raters are independent or not. Kappa is appropriate for this purpose
(although to know that raters are not independent is not very informative; raters are dependent
by definition, inasmuch as they are rating the same cases).

It is the second use of kappa--quantifying actual levels of agreement--that is the source of
concern. Kappa's calculation uses a term called the proportion of chance (or expected)
agreement. This is interpreted as the proportion of times raters would agree by chance alone.

26


However, the term is relevant only under the conditions of statistical independence of raters.
Since raters are clearly not independent, the relevance of this term, and its appropriateness as a
correction to actual agreement levels, is very questionable.

Thus, the common statement that kappa is a quot;chance-corrected measure of agreementquot;
misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a
measure of the level of agreement, kappa is not quot;chance-correctedquot;; indeed, in the absence of
some explicit model of rater decisionmaking, it is by no means clear how chance affects the
decisions of actual raters and how one might correct for it.

A better case for using kappa to quantify rater agreement is that, under certain conditions, it
approximates the intra-class correlation. But this too is problematic in that (1) these conditions
are not always met, and (2) one could instead directly calculate the intraclass correlation.

5. Tests of Marginal Homogeneity

5.0 Introduction
Consider symptom ratings (1 = low, 2 = moderate, 3 = high) by two raters on the same sample
of subjects, summarized by a 3×3 table as follows:
Table 1. Summarization of ratings by Rater 1 (rows) and Rater 2
(columns).

1 2 3

1 p11 p12 p13 p1.

2 p21 p22 p23 p2.

3 p31 p32 p33 p3.

p.1 p.2 p.3 1.0

Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by
Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote the

27


marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories 1, 2
and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2.

Marginal homogeneity refers to equality (lack of significant difference) between one or more of
the row marginal proportions and the corresponding column proportion(s). Testing marginal
homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because
of different propensities to use each rating category. When such differences are observed, it
may be possible to provide feedback or improve instructions to make raters' marginal
proportions more similar and improve agreement.

Differences in raters' marginal rates can be formally assessed with statistical tests of marginal
homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates
different cases, testing marginal homogeneity is straightforward: one can compare the marginal
frequencies of different raters with a simple chi-squared test. However this cannot be done
when different raters rate the same cases--the usual situation with rater agreement studies; then
the ratings of different raters are not statistically independent and this must be accounted for.

Several statistical approaches to this problem are available. Alternatives include:

Nonparametric tests
•
Bootstrap methods
•
Loglinear, association, and quasi-symmetry models
•

Latent trait and related models
•

These approaches are outlined here.

5.1 Graphical and descriptive methods

Before discussing formal statistical methods, non-statistical methods for comparing raters'
marginal distributions should be briefly mentioned. Simple descriptive methods can be very
useful. For example, a table might report each raters' rate of use for each category. Graphical
methods are especially helpful. A histogram can show the distribution of each raters' ratings
across categories. The following example is from the output of the MH program:

Marginal Distributions of Categories
for Rater 1 (**) and Rater 2 (==)

0.304 + **
| ** ==
| ** == ==
| ** == ** == ** ==
| ** == ** == ** ==
| ** == ** == ** ==
| ** == ** == ** == ** ==
| ** == ** == ** == ** == ** ==

28


| ** == ** == ** == ** == ** == ** ==
| ** == ** == ** == ** == ** == ** ==
0 +----+-------+-------+-------+-------+-------+----
1 2 3 4 5 6

Notes: x-axis is category number or level.
y-axis is proportion of cases.

Vertical or horizontal stacked-bar histograms are good ways to summarize the data. With
ordered-category ratings, a related type of figure shows the cumulative proportion of cases
below each rating level for each rater. An example, again from the MH program, is as follows:

Proportion of cases below each level

1 234 5 6
*---*-*-*-----*-------------------*-------------------------- Rater 1
*---*-*-*--------*------------*------------------------------ Rater 2
1 234 5 6

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Scale
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1

These are merely examples. Many other ways to graphically compare marginal distributions are
possible.

5.2 Nonparametric tests

The main nonparametric test for assessing marginal homogeneity is the McNemar test. The
McNemar test assesses marginal homogeneity in a 2×2 table. Suppose, however, that one has an
N×N crossclassification frequency table that summarizes ratings by two raters for an N-category
rating system. By collapsing the N×N table into various 2×2 tables, one can use the McNemar
test to assess marginal homogeneity of each rating category. With ordered-category data one
can also collapse the N×N table in other ways to test rater equality of category thresholds, or test
raters for overall bias (i.e., a tendency to make higher or lower rating than other raters.)

The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across all
categories simultaneously. It thus complements McNemar tests of individual categories by
providing an overall significance value.

Further explanation of these methods and their calculation can be found by clicking on the test
names above.

MH, a computer program for testing marginal homogeneity with these methods is available
online. For more information, click here.

29


These tests are remarkably easy to use and are usually just as effective as more complex
methods. Because the tests are nonparametric, they make few or no assumptions about the data.
While some of the methods described below are potentially more powerful, this comes at the
price of making assumptions which may or may not be true. The simplicity of the
nonparametric tests lends persuasiveness to their results.

A mild limitation is that these tests apply only for comparisons of two raters. With more than
two raters, of course, one can apply the tests for each pair of raters.

5.3 Bootstrapping

Bootstrap and related jackknife methods (Efron, 1982; Efron & Tibshirani, 1993) provide a
very general and flexible framework for testing marginal homogeneity. Again, suppose one has
an N×N crossclassification frequency table summarizing agreement between two raters on an N-
category rating. Using what is termed the nonparametric bootstrap, one would repeatedly
sample from this table to produce a large number (e.g., 500) of pseudo-tables, each with the
same total frequency as the original table.

Various measures of marginal homogeneity would be calculated for each pseudo-table; for
example, one might calculate the difference between the row marginal proportion and the
column marginal proportion for each category, or construct an overall measure of row vs.
column marginal differences.

Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same
measure calculated for the original table. From the pseudo-tables, one can empirically calculate
the standard deviation of d*, or , d*. Let d' denote the true population value of d. Assuming that
d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null
hypothesis by calculating the z value:
*
z = d// d

and determining the significance of the standard normal deviate z by usual methods (e.g., a table
of z value probabilities).

The method above is merely an example. Many variations are possible within the framework of
bootstrap and jackknife methods.

An advantage of bootstrap and jackknife methods is their flexibility. For example, one could
potentially adapt them for simultaneous comparisons among more than two raters.

A potential disadvantage of these methods is that the user may need to write a computer
program to apply them. However, such a program could also be used for other purposes, such as
providing bootstrap significance tests and/or confidence intervals for various raw agreement
indices.

5.4 Loglinear, association and quasi-symmetry modeling

If one is using a loglinear, association or quasi-symmetry model to analyze agreement data, one
can adapt the model to test marginal homogeneity.

30


For each type of model the basic approach is the same. First one estimates a general form of the
model--that is, one without assuming marginal homogeneity; let this be termed the quot;unrestricted
model.quot; Next one adds the assumption of marginal homogeneity to the model. This is done by
applying equality restrictions to some model parameters so as to require homogeneity of one or
more marginal probabilities (Barlow, 1998). Let this be termed the quot;restricted model.quot;

Marginal homogeneity can then be tested using the difference G2 statistic, calculated as:

difference G2 = G2(restricted) - G2(unrestricted)

where G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit statistics
(Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models.

The difference G2 can be interpreted as a chi-squared value and its significance determined from
a table of chi-squared probabilities. The df are equal to the difference in df for the unrestricted
and restricted models. A significant value implies that the rater marginal probabilities are not
homogeneous.

An advantage of this approach is that one can test marginal homogeneity for one category,
several categories, or all categories using a unified approach. Another is that, if one is already
analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of
marginal homogeneity tests may require relatively little extra work.

A possible limitation is that loglinear, association, and quasi-symmetry models are only well-
developed for analysis of two-way tables. Another is that use of the difference G2 test typically
requires that the unrestricted model fit the data, which sometimes might not be the case.

For an excellent discussion of these and related models (including linear-by-linear models), see
Agresti (2002).

5.5 Latent trait and related models

Latent trait models and related methods such as the tetrachoric and polychoric correlation
coefficients can be used to test marginal homogeneity for dichotomous or ordered-category
ratings. The general strategy using these methods is similar to that described for loglinear and
related models. That is, one estimates both an unrestricted version of the model and a restricted
version that assumes marginal homogeneity, and compares the two models with a difference G2
test.

With latent trait and related models, the restricted models are usually constructed by assuming
that the thresholds for one or more rating levels are equal across raters.

A variation of this method tests overall rater bias. That is done by estimating a restricted model
in which the thresholds of one rater are equal to those of another plus a fixed constant. A
comparison of this restricted model with the corresponding unrestricted model tests the
hypothesis that the fixed constant, which corresponds to bias of a rater, is 0.

Another way to test marginal homogeneity using latent trait models is with the asymptotic
standard errors of estimated category thresholds. These can be used to estimate the standard

31


error of the difference between the thresholds of two raters for a given category, and this
standard error used to test the significance of the observed difference.

An advantage of the latent trait approach is that it can be used to assess marginal homogeneity
among any number of raters simultaneously. A disadvantage is that these methods require more
computation than nonparametric tests. If one is only interested in testing marginal homogeneity,
the nonparametric methods might be a better choice. However, if one is already using latent
trait models for other reasons, such as to estimate accuracy of individual raters or to estimate
the correlation of their ratings, one might also use them to examine marginal homogeneity;
however, even in this case, it might be simpler to use the nonparametric tests of marginal
homogeneity.

If there are many raters and categories, data may be sparse (i.e., many possible patterns of
ratings across raters with 0 observed frequencies). With very sparse data, the difference G2
statistic is no longer distributed as chi-squared, so that standard methods cannot be used to
determine its statistical significance.

References

Agresti A. Categorical data analysis. New York: Wiley, 2002.

Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P.
Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998.

Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and
practice. Cambridge, Massachusetts: MIT Press, 1975

Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society
for Industrial and Applied Mathematics, 1982.

Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall,
1993.

32


6. The Tetrachoric and Polychoric Correlation Coefficients

6.0 Introduction

This page describes the tetrachoric and polychoric correlation coefficients, explains their
meaning and uses, gives examples and references, provides programs for their estimation, and
discusses other available software. While discussion is primarily oriented to rater agreement
problems, it is general enough to apply to most other uses of these statistics.

A clear, concise description of the tetrachoric and polychoric correlation coefficients, including
issues relating to their estimation, is found in Drasgow (1988). Olsson (1979) is also helpful.

What distinguishes the present discussion is the view that the tetrachoric and polychoric
correlation models are special cases of latent trait modeling. (This is not a new observation, but
it is sometimes overlooked). Recognizing this opens up important new possibilities. In
particular, it allows one to relax the distributional assumptions which are the most limiting
feature of the quot;classicalquot; tetrachoric and polychoric correlation models.

6.0.1 Summary

The tetrachoric correlation (Pearson, 1901), for binary data, and the polychoric correlation, for
ordered-category data, are excellent ways to measure rater agreement. They estimate what the
correlation between raters would be if ratings were made on a continuous scale; they are,
theoretically, invariant over changes in the number or quot;widthquot; of rating categories. The
tetrachoric and polychoric correlations also provide a framework that allows testing of marginal
homogeneity between raters. Thus, these statistics let one separately assess both components of

33


rater agreement: agreement on trait definition and agreement on definitions of specific
categories.

These statistics make certain assumptions, however. With the polychoric correlation, the
assumptions can be tested. The assumptions cannot be tested with the tetrachoric correlation if
there are only two raters; in some applications, though, theoretical considerations may justify
the use of the tetrachoric correlation without a test of model fit.

6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients

6.1.1 Pros:

These statistics express rater association in a familiar form--a correlation coefficient.
•
They provide a way to separately quantify association and similarity of category
•
definitions.
They do not depend on number of rating levels; results can be compared for studies
•
where the number of rating levels is different.
They can be used even if different raters have different numbers of rating levels.
•
The assumptions can be easily tested for the polychoric correlation.
•
Estimation software is routinely available (e.g., SAS PROC FREQ, and PRELIS).
•

6.1.2 Cons:

Model assumptions not always appropriate--for example, if the latent trait is truly
•
discrete.
For only two raters, there is no way to test the assumptions of the tetrachoric
•
correlation.

6.2 Intuitive Explanation

Consider the example of two psychiatrists (Raters 1 and 2) making a diagnosis for
presence/absence of Major Depression. Though the diagnosis is dichotomous, we allow that
depression as a trait is continuously distributed in the population.

+---------------------------------------------------------------+
| |
| |
|| * |
|| * * |
|| * * |
|| * |* |
|| * |* |
|| ** | ** |
|| *** | *** |
|| *** | *** |
| | ***** | ***** |
| +--------------------------------+----------------> Y |
| not depressed t depressed |

34


| |
+---------------------------------------------------------------+

Figure 1 (draft). Latent continuous variable (depression
severity, Y); and discretizing threshold (t).

In diagnosing a given case, a rater considers the case's level of depression, Y, relative to some
threshold, t: if the judged level is above the threshold, a positive diagnosis is made; otherwise
the diagnosis is negative.

Figure 2 portrays the situation for two raters. It shows the distribution of cases in terms of
depression level as judged by Rater 1 and Rater 2.

Figure 2. Joint distribution (ellipse) of depression
severity as judged by two raters (Y1 and Y2); and
discretizing thresholds (t1 an t2)

a, b, c and d denote the proportion of cases that fall in each region defined by the two raters'
thresholds. For example, a is the proportion below both raters' thresholds and therefore
diagnosed negative by both.

These proportions correspond to a summary of data as a 2 x 2 cross-classification of the raters'
ratings.

+------------------------------------------------+
| |
| Rater 1 |
| - + |
| +-------+-------+ |
| -| a | b |a+b |
| Rater 2 +-------+-------+ |
| +| c | d |c+d |
| +-------+-------+ |

35


| a+c b+d 1 |
| |
+------------------------------------------------+

Figure 3 (draft). Crossclassification proportions
for binary ratings by two raters.

Again, a, b, c and d in Figure 3 represent proportions (not frequencies).

Once we know the observed cross-classification proportions a, b, c and d for a study, it is a
simple matter to estimate the model represented by Figure 2. Specifically, we estimate the
location of the discretizing thresholds, t1 and t2, and a third parameter, rho, which determines
the quot;fatnessquot; of the ellipse. Rho is the tetrachoric correlation, or r*. It can be interpreted here as
the correlation between judged disease severity (before application of thresholds) as viewed by
Rater 1 and Rater 2.

The principle of estimation is simple: basically, a computer program tries various combinations
for t1, t2 and r* until values are found for which the expected proportions for a, b, c and d in
Figure 2 are as close as possible to the observed proportions in Figure 3. The parameter values
that do so are regarded as (estimates of) the true, population values.

The polychoric correlation, used when there are more than two ordered rating levels is a
straightforward extension of the model above. The difference is that there are more thresholds,
more regions in Figure 2, and more cells in Figure 3. But again the idea is to find the values for
thresholds and r* that maximize similarity between model-expected and observed cross-
classification proportions.

36


7. Detailed Description

7.0 Introduction

In many situations, even though a trait may be continuous, it may be convenient to divide it into
ordered levels. For example, for research purposes, one may classify levels of headache pain
into the categories none, mild, moderate and severe. Even for trait usually viewed as discrete,
one might still consider continuous gradations--for example, people infected with the flu virus
exhibit varying levels of symptom intensity.

The tetrachoric correlation and polychoric correlation coefficients are appropriate when the
latent trait that forms the basis of ratings can be viewed as continuous. We will outline here the
measurement model and assumptions for the tetrachoric correlation. The model and
assumptions for the polychoric correlation are the same--the only difference is that there are
more threshold parameters for the polychoric correlations, corresponding to the greater number
ordered rating levels.

7.1 Measurement Model

We begin with some notation and definitions. Let:

X1 and X2 be the manifest (observed) ratings by Raters (or procedures, diagnostic tests,
etc.) 1 and 2; these are discrete-valued variables;

Y1, Y2 be latent continuous variables associated with X1 and X2; these are the pre-
discretized, continuous quot;impressionsquot; of the trait level, as judged by Raters 1 and 2;

T be the true, latent trait level of a case.

A rating or diagnosis of a case begins with the case's true trait level, T. This information, along
with quot;noisequot; (random error) and perhaps other information unrelated to the true trait which a
given rater may consider (unique variation), leads to each rater's impression of the case's trait
level (Y1 and Y2). Each rater applies discretizing thresholds to this judged trait level to yield a
dichotomous or ordered-category rating (X1 and X2).

37


Stated more formally, we have:

Y1 = bT + u1 + e1,
Y2 = bT + u2 + e2,

where b is a regression coefficient, u1 and u2 are the unique components of the raters'
impressions, and e1 and e2 represent random error or noise. It turns out that unique variation
and error variation behave more or less the same in the model, and the former can be subsumed
under the latter. Thus we may consider the simpler model:

Y1 = b1T + e1,
Y2 = b2T + e2.

The tetrachoric correlation assumes that the latent trait T is normally distributed. As scaling is
arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed (and
independent both between raters and across cases). For reasons we need not pursue here, the
model loses no generality by assuming that var(e1) = var(e2). We therefore stipulate that e1, e2
~ N(0, sigmae). A consequence of these assumptions is that Y1 and Y2 must also be normally
distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b1 = b2 = b =
the correlation of both Y1 and Y2 with the latent trait.

We define the tetrachoric correlation, r*, as

r* = b2

A simple quot;path diagramquot; may clarify this:

+-------------------------------------+
| |
| |
| bb |
| Y1 <--- T ---> Y2 |
| |
| |
+-------------------------------------+

Figure 4 (draft). Path diagram.

Here b is the path coefficient that reflects the influence of T on both Y1 and Y2. Those familiar
with the rules of path analysis will see that the correlation of Y1 and Y2 is simply the product
of their degree of dependence on T--that is b2.

As an aside, one might consider that the value of b is interesting in its own right, inasmuch as it
offers a measure of the association of ratings with the true latent trait--i.e., a measure of rating
validity or accuracy.

The tetrachoric correlation r* is readily interpretable as a measure of the association between
the ratings of Rater 1 and Rater 2. Because it estimates the correlation that exists between the
pre-discretized judgements of the raters, it is, in theory, not affected by (1) the number of rating

38


levels, or (2) the marginal proportions for rating levels (i.e., the 'base rates.') The fact that this
association is expressed in the familiar form of a correlation is also helpful.

The assumptions of the tetrachoric correlation coefficient may be expressed as follows:

• The trait on which ratings are based is continuous.
• The latent trait is normally distributed.
• Rating errors are normally distributed.
• Var(e) is homogeneous across levels of T.
• Errors are independent between raters.
• Errors are independent between cases.

Assumptions 1--4 can be alternatively expressed as the assumption that Y1 and Y2 follow a
bivariate normal distribution.

We will assume that the one has sufficient theoretical understanding of the application to accept
the assumption of latent continuity.

The second assumption--that of a normal distribution for T--is potentially more questionable.
Absolute normality, however, is probably not necessary; a unimodal, roughly symmetrical
distribution may be close enough. Also, the model implicitly allows for a monotonic
transformation of the latent continuous variables. That is, a more exact way to express
Assumptions 1-4 is that one can obtain a bivariate normal distribution by some monotonic
transformation of Y1 and Y2.

The model assumptions can be tested for the polychoric correlation. This is done by comparing
the observed numbers of cases for each combination of rating levels with those predicted by the
model. This is done with the likelihood ratio chi-squared test, G2 (Bishop, Fienberg & Holland,
1975), which is similar the usual Pearson chi-squared test (the Pearson chi-square test can also
be used; for more information on these tests, see the FAQ for testing model fit on the Latent
Class Analysis web site.

The G2 test is assessed by considering the associated p value, with the appropriate degrees of
freedom (df). The df are given by:

df = RC - R - C
where R is the number of levels used by the first rater and C is the number of levels used by the
second rater. As this is a quot;goodness-of-fitquot; test, it is standard practice to set the alpha level fairly
high (e.g., .10). A p value lower than the alpha level is evidence of model fit.

For the tetrachoric correlation R = C = 2, and there are no df with which to test the model. It is
possible to test the model, though, when there are more than two raters.

7.2 Using the Polychoric Correlation to Measure Agreement

39


Here are the steps one might follow to use the tetrachoric or polychoric correlation to assess
agreement in a study. For convenience, we will mainly refer to the polychoric correlation,
which includes the tetrachoric correlation as a special case.

i) Calculate the value of the polychoric correlation.

For this a computer program, such as those described in the software section, is required.

ii) Evaluate model fit.

The next step is to determine if the assumptions of the polychoric correlation are empirically
valid. This is done with the goodness-of-fit test that compares observed crossclassification
frequencies to model-predicted frequencies described previously. As noted, this test cannot be
done for the tetrachoric correlation.

PRELIS includes a test of model fit when estimating the polychoric correlation. It is unknown
whether SAS PROC FREQ includes such a test.

iii) Assess magnitude and significance of correlation.

Assuming that model fit is acceptable, the next step is to note is the magnitude of the polychoric
correlation. Its value is interpreted in the same way as a Pearson correlation. As the value
approaches 1.0, more agreement on the trait definition is indicated. Values near 0 indicate little
agreement on the trait definition.

One may wish to test the null hypothesis of no correlation between raters. There are at least two
ways to do this. The first makes use of the estimated standard error of the polychoric correlation
under the null hypothesis of r* = 0. At least for the tetrachoric correlation, there is a simple
closed-form expression for this standard error (Brown, 1977). Knowing this value, one may
calculate a z value as:

r*
z = -----------
sigmar*(0)

where the denominator is the standard error of r* where r* = 0. One may then assess statistical
significance by evaluating the z value in terms of the associated tail probabilities of the standard
normal curve.

The second method is via a chi-squared test. If r* = 0, the polychoric correlation model is the
same as the model of statistical independence. It therefore seems reasonable to test the null
hypothesis of r* = 0 by testing the statistical independence model. Either the Pearson (X2) or
likelihood-ratio (G2) chi-squared statistics can be used to test the independence model. The df
for either test is (R - 1)(C - 1). A significant chi-squared value implies that r* is not equal to 0.

[I now question whether the above is correct. For the polychoric correlation, data may fail the
test of independence even with when r* = 0 (i.e., there may be some other kind of 'structure' to
the data). If so, a better alternative would be to calculate a difference G2 statistic as:

G2H0 - G2H1,

40

Statistical Methods

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Statistical Methods

Similar to Statistical Methods (20)

Statistical Methods