Interrater reliability

A stratified permutation test for multi-rater inter-rater reliability.

There are S strata. There are N_s items in stratum s. There are N = \sum_{s=1}^S N_s items in all.

There are C non-exclusive categories to which each of the N items might belong; an item might belong to none of the categories. That is, each item might be “labeled” with any of the 2^C subsets of the C labels, including the empty set.

There are R “raters,” each of whom labels each of the N items with zero or more elements of C.

Define L_{s,i,c,r} = 1, if rater r assigns label c to item i in stratum s; and L_{s,i,c,r} = 0 if not.

We observe \{ L_{s,i,c,r} \} for s=1...S; i=1, ..., N_s; c=1, ..., C; and r=1, ..., R.

We want to know whether the categorizations are “reliable,” in the sense that agreement among the raters is higher than would be expected “by chance.” The reliability of each category c is of interest, rather than an overall rating for all C categories.

Fix c, since we are considering only one category at a time.

The null hypothesis for category c is that, for each rater r, and each stratum s, the values \{ L_{s,i,c,r} \} are exchangeable; that for each rater r, the values \{ L_{s,i,c,r} \} for different strata s are independent; and that the values are independent across raters.

Our test conditions on the sets of labels each rater assigns within each stratum, but not on the items to which those labels are assigned. The null distribution involves permuting the assignments each given rater makes of category c to items within each stratum s, permuting independently across across raters and across strata.

The test statistic within stratum s is

\rho_s \equiv \frac{1}{N_s {R \choose 2}} \sum_{i=1}^{N_s}
\sum_{r=1}^{R-1} \sum_{v=r+1}^R 1(L_{s,i,r} = L_{s,i,v})
= \frac{1}{N_s R(R-1)} \sum_{i=1}^{N_s}
  (y_{si}(y_{si}-1) + (R-y_{si})(R-y_{si}-1)).

That is, within each stratum, we count the number of concordant pairs of assignments. If all R raters agree whether item i in stratum s belongs to category c, that contributes a term {R \choose 2} to the sum. If only half agree, the term for item i contributes 2 {N/2 \choose 2} to the sum. The normalization makes perfect agreement within stratum s correspond to \rho_s = 1.

To combine the results across strata to get an overall p-value, we could use any of the methods we’ve discussed, or the NPC (nonparametric combination of test) methods described in Pesarin and Salmaso, based on the p-values in different strata. For instance, Fisher’s combination statistic is

\lambda = - \sum_{s=1}^S w_s \log \hat{p}_s,

where the nonnegative weights \{w_s\} are chosen in some sensible manner (e.g., w_s = N_s^{-1/2} would be reasonable).

permute.irr.compute_ts(ratings)[source]

Compute the test statistic

\rho_s \equiv \frac{1}{N_s {R \choose 2}} \sum_{i=1}^{N_s}
\sum_{r=1}^{R-1} \sum_{v=r+1}^R 1(L_{s,i,r} = L_{s,i,v})
= \frac{1}{N_s R(R-1)} \sum_{i=1}^{N_s}
  (y_{si}(y_{si}-1) + (R-y_{si})(R-y_{si}-1)).

Parameters
ratingsarray_like

Input array of dimension [R, Ns] Each row corresponds to the ratings given by a single rater; columns correspond to items rated.

Returns
rho_sfloat

concordance of the ratings, where perfect concordance is 1.0

permute.irr.simulate_npc_dist(perm_distr, size, obs_ts=None, pvalues=None, plus1=True)[source]

Simulates the permutation distribution of the combined NPC test statistic for S matrices of ratings ratings corresponding to S strata. The distribution comes from applying simulate_ts_dist to each of the S strata.

If obs_ts is not null, computes the reference value of the test statistic before the first permutation. Otherwise, uses the value obs_ts for comparison.

If keep_dist, return the distribution of values of the test statistic; otherwise, return only the number of permutations for which the value of the irr test statistic is at least as large as obs_ts.

Parameters
perm_distrarray_like

Input array of dimension [B, S] Column s is the permutation distribution of \rho_s, for s=1,…,S

sizearray_like

Input array of dimension S Each entry corresponds to the number of items, Ns, in the s-th stratum.

obs_tsarray_like

Optional input array of dimension S The s-th entry is \rho_s, the concordance for the s-th stratum. If not input, pvalues must be specified.

pvaluesarray_like

Optional input array of dimension S The s-th entry is the p-value corresponding to \rho_s, the concordance for the s-th stratum. If not input, obs_ts must be specified.

plus1bool

flag for whether to add 1 to the numerator and denominator of the p-value based on the empirical permutation distribution. Default is True.

Returns
dict

A dictionary containing:

obs_npcfloat

observed value of the combined test statistic for the input data, or the input value of obs_ts if obs_ts was given as input

pvaluefloat

A single p-value for the global test. The number of times that obs_npc was at least as extreme as the distribution of combined IRR statistics.

num_permint

number of permutations

permute.irr.simulate_ts_dist(ratings, obs_ts=None, num_perm=10000, keep_dist=False, seed=None, plus1=True)[source]

Simulates the permutation distribution of the irr test statistic for a matrix of ratings ratings

If obs_ts is not None, computes the reference value of the test statistic before the first permutation. Otherwise, uses the value obs_ts for comparison.

If keep_dist, return the distribution of values of the test statistic; otherwise, return only the number of permutations for which the value of the irr test statistic is at least as large as obs_ts.

Parameters
ratingsarray_like

Input array of dimension [R, Ns]

obs_tsfloat

if None, obs_ts is calculated as the value of the test statistic for the original data

num_permint

number of random permutation of the elements of each row of ratings

keep_distbool

flag for whether to store and return the array of values of the irr test statistic

seedRandomState instance or {None, int, RandomState instance}

If None, the pseudorandom number generator is the RandomState instance used by np.random; If int, seed is the seed used by the random number generator; If RandomState instance, seed is the pseudorandom number generator

plus1bool

flag for whether to add 1 to the numerator and denominator of the p-value based on the empirical permutation distribution. Default is True.

Returns
dict

A dictionary containing:

obs_tsint

observed value of the test statistic for the input data, or the input value of obs_ts if obs_ts was given as input

geqint

number of iterations for which the test statistic was greater than or equal to obs_ts

num_permint

number of permutations

pvaluefloat

geq / num_perm

distarray-like

if keep_dist, the array of values of the irr test statistic from the num_perm iterations. Otherwise, None.