Numeric comparisons with NA values causing bad subsets in R

Question

Can somebody explain to me why logical evaluations that resolve to NA produce bogus rows in vector-comparison-based subsets? For example:

employee <- c("Big Shot CEO", "Programmer","Intern","Guy Who Got Fired Last Week")
salary <-   c(      10000000,        50000,       0,                           NA)
emp_salary <- data.frame(employee,salary)

# how many employees paid over 100K?
nrow(emp_salary[salary>100000,]) # Returns 2 instead of 1 -- why?

emp_salary[salary>100000,]
# returns a bogus row of all NA's (not "Guy Who Got Fired")
#        employee salary
# 1  Big Shot CEO  1e+07
# NA         <NA>   <NA>

salary[salary>100000]
# returns:
# [1] 1e+07    NA

NA > 100000 #returns NA

Given this unexpected behavior, what is the preferred way to count employees making over 100K in the above example?

Ben Bolker · Accepted Answer · 2014-06-03T17:32:34.660

First of all, you probably don't want to cbind() first -- that will coerce all of your variables to character.

 emp_salary <- data.frame(employee,salary)

Two possible solutions:

subset automatically excludes cases where the criterion is NA:

nrow(subset(emp_salary,salary>1e5))

count the results directly and use na.rm=TRUE:

sum(salary>1e5,na.rm=TRUE)

As for the logic behind the bogus rows:

bigsal <- salary>1e5 is a logical vector which contains NAs, as it must (because there is no way to know whether an NA value satisfies the criterion or not).
when indexing the rows of a data frame with a logical vector containing NAs, this is probably the most salient bit of document (from help("[")):

When extracting, a numerical, logical or character ‘NA’ index picks an unknown element and so returns ‘NA’ in the corresponding element of a logical, integer, numeric, complex or character result, and ‘NULL’ for a list.

(I searched help("[.data.frame") and couldn't see anything more useful.)

The thing to remember is that once the indexing is being done, R no longer has any knowledge that the logical vector was created from the salary column, so there's no way for it to do what you might want, which is to retain the values in the other columns. Here's one way to think about the seemingly strange behaviour of filling in all the columns in the NA row with NAs: if R leaves the row out entirely, that would correspond to the criterion being FALSE. If it retains it (and remember that it can't retain just a few columns and drop the others), then that would correspond to the criterion being TRUE. If the criterion is neither FALSE nor TRUE, then it's hard to see what other behaviour makes sense ...

Thanks, I edited the example to remove cbind, but it was getting un-coerced in the comparison anyway. In my actual problem I had not used cbind, just merged two data sets. Are you able to explain the problem with using the logical evaluation as an index? — C8H10N4O2, Jun 03 '14 at 17:09
FWIW I think it's not R's fault this time. Any time you start trying to deal with the logic of `NA`s in a consistent way things are likely to get weird. — Ben Bolker, Jun 03 '14 at 18:02
Great explanation of what things look like "from R's point of view". — Josh O'Brien, Jun 03 '14 at 18:47

Numeric comparisons with NA values causing bad subsets in R

1 Answers1