Is weighted Kappa calculated by `irr` package in R wrong?

Question

I found the irr package has 2 big bugs for the calculation of weighted kappa.

Please tell me if the 2 bugs are really there or I misunderstood someting.

You can replicate the bugs using the following examples.

First bug: The sort of labels in confusion matrix needs to be corrected.

I have 2 pairs of scores for disease extent (from 0 to 100 while 0 is healthy, 100 is extremely ill).

In label_test.csv (you can just copy and paste the data to your disk to do the following test):

In pred_test.csv:

in script_r.R:

library(irr)
label <- read.csv('label_test.csv',header=FALSE)
pred <- read.csv('pred_test.csv',header=FALSE)

kapp <- kappa2(data.frame(label,pred),"unweighted")
kappa <- getElement(kapp,"value")
print(kappa)  # output: 0.245283

w_kapp <- kappa2(data.frame(label,pred),"equal")
weighted_kappa <- getElement(w_kapp,"value")
print(weighted_kappa)  # output: 0.443038

When I use Python to calculate the kappa and weighted_kappa, in script_python.py:

from sklearn.metrics import cohen_kappa_score

label = pd.read_csv(label_file, header=None).to_numpy()
pred = pd.read_csv(pred_file, header=None).to_numpy()
kappa = cohen_kappa_score(label.astype(int), pred.astype(int))
print(kappa)  # output: 0.24528301886792447
weighted_kappa = cohen_kappa_score(label.astype(int), pred.astype(int), weights='linear', labels=np.array(list(range(100))) )
print(weighted_kappa)  # output: 0.8359908883826879

We can find that the kappa calculated by R and Python is the same, but the weighted_kappa from R is far lower than the weighted_kappa in sklearn from Python. Which is wrong? After 2-day research, I found that the weighted_kappa from irr package in R is wrong. Details are as follows.

During the debuging, we will find the confusion matrix in irr from R is:

We can find that the order is wrong. The order of labels should be changed from [0, 1, 14, 3, 4, 53, 54, 6] to [0, 1, 3, 4, 6, 14, 53, 54] as it is in Python. It seems that irr package used a string-based sort method instead of integer-based sort method, which will put 14 to the front of 3. This mistake could be and should be corrected easily.

Second bug: The confusion matrix is not complete in R

In my pred_test.csv and label_test.csv, the values can not cover all possible values from 0 to 100. So the default confusion matrix in irr from R will miss those values which does not appear in data. This should be fixed.

Let's see another example.

In pred_test.csv, let's change the label from 54 to 99. Then, we run script_r.R and script_python.py again. The results are:

In R:
kappa: 0.245283
weighted_kappa: 0.443038

In Python:
kappa: 0.24528301886792447
weighted_kappa: 0.592891760904685

We can find the weighted_kappa from irr in R is unchanged at all. But the weighted_kappa from sklearn in Python is decreased from 0.83 to 0.59. So we know irr made a mistake again.

The reason is that sklearn can let us to pass the full labels to the confusion matrix so that the confusion matrix shape will be 100 * 100, however in irr, the labels of confusion matrix is calculated from the unique values from label and pred, which will miss a lot of other possible values. This mistake will assign the same weight to 53 and 99 here. So it is better to provide an option in irr package to let custumer provide the custum labels like what they have done in sklearn from Python.

You may want to contact the creator(s) of the package to pick this up with them. — slamballais, May 13 '21 at 16:48

score 1 · Answer 1 · answered May 14 '21 at 09:48

The solution from the authors is not going to work because in the code of kappa2 function, it converts your ratings into a matrix, and once you convert a factor into matrix, the levels are lost, this is the line:

ratings <- as.matrix(na.omit(ratings))

You can try it on your data, it is converted into a character:

lvl = 0:100
ratings = data.frame(label = factor(label[,1],levels=lvl),
                     pred = factor(pred[,1],levels=lvl))

 as.matrix(ratings)
     label pred
[1,] "0"   "0" 
[2,] "1"   "1" 
[3,] "1"   "1" 
[4,] "1"   "0" 
[5,] "0"   "3" 
[6,] "14"  "4" 
[7,] "53"  "54"
[8,] "3"   "6"

Same results:

kappa2(ratings,weight="equal")
 Cohen's Kappa for 2 Raters (Weights: equal)

 Subjects = 8 
   Raters = 2 
    Kappa = 0.368 

        z = 1.79 
  p-value = 0.0742

I suggest using DescTools, you just need to provide the confusion matrix using table() function in R, with the factors declared correctly as above:

library(DescTools)

CohenKappa(table(ratings$label,ratings$pred), weight="Unweighted")
[1] 0.245283

CohenKappa(table(ratings$label,ratings$pred), weight="Equal-Spacing")
[1] 0.8359909

score 0 · Answer 2 · answered May 14 '21 at 08:33

I have sent email to the author of the package, and he said he will fix the bug in next update.

Details are as follows:

Actually, I am aware of this awkward behavior of the kappa2-function. This is due to the conversion and reordering of factor levels. These are actually not two bugs but only one that results in an incorrect generation of the confusion matrix (which you already found out). You can easily fix it by deleting the first row in the kappa2-function ("ratings <- as.matrix(na.omit(ratings))"). This conversion to numerical value as part of the removal of NA ratings is responsible for the error.

In general, my function needs to know the factor levels in order to correctly compute kappa. Thus, for your data, you would need to store the values as factors with the appropriate possible factor levels. E.g.

label <- c(0, 1, 1, 1, 0, 14, 53, 3) label <- factor(label, levels=0:100) pred <- c(0, 1, 1, 0, 3, 4, 54, 6) pred <- factor(pred, levels=0:100)

ratings <- data.frame(label,pred)

When you now run the modified kappa2-function (i.e. without the first line), the results should be correct.

kappa2(ratings) # unweighted kappa2(ratings, "equal") # weighted kappa with equal weights

For the next update of my package, I will take this into account.

Is weighted Kappa calculated by `irr` package in R wrong?

2 Answers2