Fill missing values by rolling forward in each group using data.table

Question

My aim is to fill missing values by group by rolling forward.

Dummy data

library(data.table)

DT <- structure(list(CLASS = c("A", "A", "A", "A", "A", "A", "B", "B","B"),
VAL = c(NA, 1, NA, NA, 2, NA, 50, NA, 100)),
.Names = c("CLASS", "VAL"),
row.names = c(NA, -9L), class = c("data.table", "data.frame"))

> DT
   CLASS VAL
1:     A  NA
2:     A   1
3:     A  NA
4:     A  NA
5:     A   2
6:     A  NA
7:     B  50
8:     B  NA
9:     B 100

Desired result

   CLASS VAL
1:     A  NA
2:     A   1
3:     A   1
4:     A   1
5:     A   2
6:     A   2
7:     B  50
8:     B  50
9:     B 100

Note, the results from here are not applicable.

1) This assigns first non-missing value to every observation in a group

#1
DT[, VAL:= VAL[!is.na(VAL)][1L] , by = CLASS]
> DT
   CLASS VAL
1:     A   1
2:     A   1
3:     A   1
4:     A   1
5:     A   1
6:     A   1
7:     B  50
8:     B  50
9:     B  50

2) If rows to be assigned are filtered to missing values only in i, it fails to pick up any non-NA values when grouping in by. So nothing is changed in the result.

> DT[is.na(VAL), VAL:= VAL[!is.na(VAL)][1L] , by = CLASS]
> DT
   CLASS VAL
1:     A  NA
2:     A   1
3:     A  NA
4:     A  NA
5:     A   2
6:     A  NA
7:     B  50
8:     B  NA
9:     B 100
9:     B  50

3) The solution using fill() from tidyr works, but unfortunately using the real data with 3.5 million rows rows and 2 million groups; the running time is ~6 hrs. So I am looking for a more efficient data.table solution.

> DT <- DT %>% group_by(CLASS) %>% fill(VAL)
> DT
# A tibble: 9 x 2
# Groups:   CLASS [2]
  CLASS    VAL
  <chr>  <dbl>
1 A      NA   
2 A       1.00
3 A       1.00
4 A       1.00
5 A       2.00
6 A       2.00
7 B      50.0 
8 B      50.0 
9 B     100

mtoto · Accepted Answer · 2018-01-29T13:56:38.857

6

You can use na.locf() function from the zoo package:

DT[, VAL:=zoo::na.locf(VAL, na.rm = FALSE), "CLASS"]

edited Jan 29 '18 at 13:56

answered Jan 29 '18 at 13:35

mtoto

23,919
4
58
71

1

`fromLast=FALSE` is the default so it could be omitted. `na.locf0(VAL)` could also be used. – G. Grothendieck Jan 29 '18 at 13:53
1

That's absolutely brilliant! It ran in 1.2 min vs `fill()` that took 6 hrs – Mihail Jan 29 '18 at 14:17

Fill missing values by rolling forward in each group using data.table

1 Answers1