How to delete rows from a dataframe that contain n*NA

Question

I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.

My Dataframe looks something like this:

 ID q  r  s  t  u  v  w  x  y  z
 A  1  5  NA 3  8  9  NA 8  6  4
 B  5  NA 4  6  1  9  7  4  9  3 
 C  NA 9  4  NA 4  8  4  NA 5  NA
 D  2  2  6  8  4  NA 3  7  1  32

And I would like to be able to delete the rows that contain more than 2 cells containing NA to get

ID q  r  s  t  u  v  w  x  y  z
 A 1  5  NA 3  8  9  NA 8  6  4
 B 5  NA 4  6  1  9  7  4  9  3 
 D 2  2  6  8  4  NA 3  7  1  32

complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?

Alternatively, this dataframe is generated by merging several dataframes using

    file1<-read.delim("~/file1.txt")
    file2<-read.delim(file=args[1])

    file1<-merge(file1,file2,by="chr.pos",all=TRUE)

Perhaps the merge function could be altered?

Thanks

Hugh · Answer 1 · 2013-08-08T01:40:57.820

17

Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:

df <- df[rowSums(is.na(df)) != n, ]

or to remove rows that contain n or more NA values:

df <- df[rowSums(is.na(df)) < n, ]

in both cases of course replacing n with the number that's required

edited Aug 08 '13 at 01:40

answered Aug 08 '13 at 01:25

Hugh

15,521
12
57
100

2

+1 for the use of `n`. You might want to explain what n is meant to represent though. – Ricardo Saporta Aug 08 '13 at 01:34
This generates a new column named `row.names` in `df`, why is that? This is one of the R phenomena that I just do not understand. Sometimes functions output extra stuff that I don' t expect. – Zhubarb Dec 04 '13 at 10:17

Ricardo Saporta · Answer 2 · 2013-08-08T01:33:46.523

If dat is the name of your data.frame the following will return what you're looking for:

keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]

What this is doing:

is.na(dat) 
# returns a matrix of T/F
# note that when adding logicals 
# T == 1, and F == 0

rowSums(.)
# quickly computes the total per row 
# since your task is to identify the
# rows with a certain number of NA's 

rowSums(.) < 2 
# for each row, determine if the sum 
# (which is the number of NAs) is less
# than 2 or not.  Returns T/F accordingly

We use the output of this last statement to identify which rows to keep. Note that it is not necessary to actually store this last logical.

score 2 · Answer 3 · answered Aug 08 '13 at 01:25

2

If d is your data frame, try this:

d <- d[rowSums(is.na(d)) < 2,]

answered Aug 08 '13 at 01:25

Blue Magister

13,044
5
38
56

score 1 · Answer 4 · answered Aug 08 '13 at 01:24

1

This will return a dataset where at most two values per row are missing:

dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

answered Aug 08 '13 at 01:24

IRTFM

258,963
21
364
487

How to delete rows from a dataframe that contain n*NA

4 Answers4

What this is doing:

Linked

Related