Is there a canonical 'correct' way to make calculations based on factor levels?

Question

Ok so I've read this question Confusion between factor levels and factor labels. But still feel like I am missing a lot. So this is maybe not a question per se - more like a presentation of my frustration.

Sample data

sample <- dput(structure(list(Logistik_1 = structure(c(3L, 2L, 3L, 3L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_2 = structure(c(4L, 4L, 4L, 3L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_3 = structure(c(3L, 4L, 3L, 4L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_4 = structure(c(4L, 2L, 3L, 4L, 2L, 3L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor")),
                         .Names = c("Logistik_1","Logistik_2", "Logistik_3", "Logistik_4"), row.names = c(NA, 6L), class = "data.frame"))

The output of sample shows me the labels.

    Logistik_1   Logistik_2   Logistik_3   Logistik_4
1 I nogen grad   I høj grad I nogen grad   I høj grad
2 I ringe grad   I høj grad   I høj grad I ringe grad
3 I nogen grad   I høj grad I nogen grad I nogen grad
4 I nogen grad I nogen grad   I høj grad   I høj grad
5 I nogen grad I nogen grad I nogen grad I ringe grad
6   I høj grad   I høj grad   I høj grad I nogen grad

I can not make calculations with these nominal data rowSums(sample):

Error in rowSums(sample) : 'x' must be numeric

I can change each and single variable to a numeric. E.g. if I want to add all the integer values I can do this: sample$test <- as.numeric(sample[[1]])+as.numeric(sample[[2]])+as.numeric(sample[[3]])+as.numeric(sample[[4]]) which will work. But its lot of typing I think?

However: If I cbind the columns, the output returns the levels: Output of with(sample, cbind(Logistik_1, Logistik_2)):

     Logistik_1 Logistik_2
[1,]          3          4
[2,]          2          4
[3,]          3          4
[4,]          3          3
[5,]          3          3
[6,]          4          4

And I can make calculations on these levelse. E.g. if I want to add all the integer values I can do this: sample$total_score <-with(sample, rowSums(cbind(Logistik_1, Logistik_2, Logistik_3, Logistik_4))) [a]

    Logistik_1   Logistik_2   Logistik_3   Logistik_4 total_score
1 I nogen grad   I høj grad I nogen grad   I høj grad          14
2 I ringe grad   I høj grad   I høj grad I ringe grad          12
3 I nogen grad   I høj grad I nogen grad I nogen grad          13
4 I nogen grad I nogen grad   I høj grad   I høj grad          14
5 I nogen grad I nogen grad I nogen grad I ringe grad          11
6   I høj grad   I høj grad   I høj grad I nogen grad          15

But I am confused, and think I am doing something which is simple too complicated. Is there a canonical 'correct' way to make calculations on factor levels? Is as.numeric more correct than cbind? And why does cbind work like this to begin with?

My hope was something like this would work: sum(as.numeric(sample[1:4])) - but that returns Error: (list) object cannot be coerced to type 'double' (because I am calling as.numeric on dataframe).

[a] I am aware that most statisticians will frown upon the common practice of assigning integer values to survey responses (e.g. "Highly agree" =5, "Agree somewhat" = 4 etc.) - but please just accept that's how we do it in the social sciences :-).The labels are responses in a survey and the levels are the integer values assigned to those responses.

I'd point this: http://stackoverflow.com/questions/1632772/appending-rows-to-a-dataframe-the-factor-problem as a common caveat regarding factors. :-) — Ferdinand.kraft, Aug 01 '13 at 14:36
hong and @Alexander. Thanks for the replies. As I stated in my footnote I am aware that 'real' statisticians cringe when people like me do this sort of thing. However - in social sciences, psychology, business research this is a perfectly accepted method to construct robust indexes of everything from political orientation, personality traits and customer satisfaction (of course these indexes are checked statistically). The sapply solution makes me very happy - but the 'data.matrix' was probably what I was looking for. Thanks again. — Andreas, Aug 01 '13 at 16:15
@Ferdinand.kraft thanks. stringsAsFactors=False - is not what I need here. But good to be reminded :-) — Andreas, Aug 01 '13 at 16:19
packages such as spss are very good at dealing with 'factors' that also have a numerical value. - Just saying... — Andreas, Aug 01 '13 at 16:21

score 4 · Answer 1 · answered Aug 01 '13 at 14:28

The fact that you can convert factor variables to integer isn't something you should consider as useful for analytical purposes. R stores factors internally as integers, with each number corresponding to a different level: this is simply more efficient than replicating the factor labels for every observation. But those numbers don't necessarily correspond to anything that makes sense in the outside world, and by default they're assigned simply by sorting the labels in alphabetical order.

So yes, you can do arithmetic on factors by converting them to integers. That doesn't mean you should do it. If you want to analyse ordinal data like Likert scales, use functions designed for the purpose.

IRTFM · Accepted Answer · 2015-10-16T16:53:37.227

4

The other respondents have clearly laid out the case against doing arithmetic on factors, but if such coercion were meaningful (say by having some ordinal interpretation), then this code which coerces to a matrix, would be reasonably compact:

> rowSums(data.matrix(sample))
 1  2  3  4  5  6 
14 12 13 14 11 15

It would not alter the value of sample. BTW there is a very useful function named sample so it would be better if you avoid the use of that particularly name while coding.

edited Oct 16 '15 at 16:53

answered Aug 01 '13 at 15:12

IRTFM

258,963
21
364
487

I didn't know about data.matrix. Thanks! – Andreas Aug 01 '13 at 16:07

score 3 · Answer 3 · answered Aug 01 '13 at 14:39

The theory is that if you're storing something as a factor, then you don't want to do calculations on it! What does it mean to add the numbers? Why should "Highly agree"+"Neither agree nor disagree" equal 8?

Instead of

sample$total_score <-with(sample, rowSums(cbind(Logistik_1, Logistik_2, Logistik_3, Logistik_4)))

you might prefer to use something like

sample$total_score <- sapply(1:nrow(sample),function(n) sum(as.numeric(sample[n,])))

so that you don't have to type the names of all the columns.

Is there a canonical 'correct' way to make calculations based on factor levels?

3 Answers3