R data frame subsetting based on a column value frequency threshold

Question

I am a new R user and this is my first question submission (hopefully in compliance with the protocol).

I have a data frame with two columns.

df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" )) 
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])

   v1 n  
1   A 2
2   A 2
3   B 4
4   B 4
5   B 4
6   B 4
7   C 1
8   D 2
9   D 2
10  E 1

I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.

The result would include the following values for v1:

  v1
1  A
2  A
3  B
4  B
5  B
6  C
7  D
8  D
9  E

Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.

I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:

df1 <- df %>%
  group_by(v1) %>%
  filter(n() < 4)

This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.

df2 <- df %>%
  group_by(v1) %>%
  filter(n() > 3)

This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.

df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])

This approach has the same result as the first approach.

None of these methods produce the result I need.

As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.

Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.

Thanks.

score 1 · Accepted Answer · answered Nov 22 '16 at 15:22

1

Using dplyr's top_n:

df %>% group_by(v1) %>% top_n(3)

answered Nov 22 '16 at 15:22

Jacob

3,437
3
18
31

Well...that was a lot simpler than my solution. – William Nov 22 '16 at 15:33
Jacob - Thanks for the solution. It worked great. – danbret Nov 22 '16 at 17:10

akrun · Answer 2 · 2016-11-22T15:33:39.313

0

We can use data.table

library(data.table)
setDT(df)[, if(.N >3) head(.SD, 3) else .SD , v1]

edited Nov 22 '16 at 15:33

answered Nov 22 '16 at 15:02

akrun

874,273
37
540
662

score 0 · Answer 3 · answered Nov 22 '16 at 15:19

This seems to do it:

index <- vector("numeric", nrow(df))

for (i in 1:nrow(df)) {
  if (sum(df[1:i, ] == as.character(df[i, 1])) <= 3) {

    index[i] <- i

  } else {

     cat(i)
   }

}


df[index, ]
   v1 n
1   A 2
2   A 2
3   B 4
4   B 4
5   B 4
7   C 1
8   D 2
9   D 2
10  E 1

R data frame subsetting based on a column value frequency threshold

3 Answers3