Rolling regression on irregular time series

Question

Summary (tldr)

I need to perform a rolling regression on an irregular time series (i.e. the interval may not even be periodic and go from 0, 1, 2, 3... to ...7, 20, 24, 28...) that's simple numeric and does not necessarily require date/time, but the rolling window needs be by time. So if I have a timeseries that is irregularly sampled for 600 seconds and the window is 30, the regression is performed every 30 seconds, and not every 30 samples.

I've read examples, and while I could replicate doing rolling sums and medians by time, I can't seem to figure it out for regression.

The problem

First of all, I have read some of the other questions with regards to performing rolling functions on irregular time series data, such as this: optimized rolling functions on irregular time series with time-based window, and this: Rolling window over irregular time series.

The issue is that the examples provided, so far, are simple for equations like sum or median, but I have not yet figured out how to perform a simple rolling regression, i.e. using lm, that is still based on the same caveat that the window is based on an irregular time series. Also, my timeseries is much, much simpler; no date is necessary, it's simply time "elapsed".

Anyway, getting this right is important to me because with irregular time - for example, a skip in the time interval - may give an over- or underestimate of the coefficients in the rolling regression, as the sample window will include additional time.

So I was wondering if anyone can help me with creating a function that does this in the simplest way? The dataset is based on measuring a variable over time i.e. 2 variables: time, and response. Time is measured every x time elapsed units (seconds, minutes, so not date/time formatted), but once in a while it becomes irregular.

For every row in the function, it should perform a linear regression based on a width of n time units. The width should never exceed n units, but may be floored (i.e. reduced) to accomodate irregular time sampling. So for example, if the width is specified at 20 seconds, but time is sampled every 6 seconds, then the window will be rounded to 18, not 24 seconds.

I have looked at the question here: How to calculate the average slope within a moving window in R, and I tested that code on an irregular time series, but it looks like it's based on regular time series.

Sample data:

sample <- 
structure(list(x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 47, 48, 
49), y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42, 
41, 40, 41, 40, 39, 38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32, 
31, 30, 29, 28, 29, 28, 27, 26, 25, 26, 25, 24, 23, 22, 21, 20, 
19)), .Names = c("x", "y"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -46L))

My current code (based on a previous question I referred to). I know it's not subsetting by time:

library(zoo)
clm <- function(z) coef(lm(y ~ x, as.data.frame(z)))
rollme <- rollapplyr(zoo(sample), 10, clm, by.column = F, fill = NA)

The expected output (manually calculated) is below. The output is different from a regular rolling regression -- the numbers are different as soon as the time interval skips at 29 (secs):

    NA
    NA
    NA
    NA
    NA
    NA
    NA
    NA
    NA
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.605042017
    -0.638888889
    -0.716981132
    -0.597560976
    -0.528301887
    -0.5
    -0.521008403
    -0.642857143
    -0.566666667
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.605042017
    -0.638888889
    -0.716981132

I hope I'm providing enough information, but let me know (or give me a guide to a good example somewhere) for me to try this?

Other things I have tried: I've tried converting the time to POSIXct format but I don't know how to perform lm on that:

require(lubridate)    
x <- as.POSIXct(strptime(sample$x, format = "%S"))

Update : Added tldr section.

To be clear, the task is to regress time, `y`, on covariate, `x`, over a rolling time window of say 20 units with non-equal time differences. — Benjamin Christoffersen, Oct 21 '17 at 08:16
Checking the first few rows of output from the code you posted it gives the same slope coefficient as you have listed as expected. Please clarify exactly what the problem is. — G. Grothendieck, Oct 22 '17 at 22:53
Sorry, I thought I was being clear enough (but maybe not). I'll edit and clarify the problem asap. — omgCat, Oct 23 '17 at 00:09

Robert · Answer 1 · 2017-10-21T13:56:15.737

2

Try this:

# time interval is 1    
sz=10
    pl2=list()
    for ( i in 1:nrow(sample)){
      if (i<sz) period=sz else
      period=length(sample$x[sample$x>(sample$x[i]-sz) & sample$x<=sample$x[i]])-1
      pl2[[i]]=seq(-period,0)
    }

#update for time interval > 1
sz=10
tint=1
pl2=list()
for ( i in 1:nrow(sample)){
  if (i<sz) period=sz else
  period=length(sample$x[sample$x>(sample$x[i]-sz*tint) & sample$x<=sample$x[i]])-1
  pl2[[i]]=seq(-period,0)
}

rollme3 <- rollapplyr(zoo(sample), pl2, clm, by.column = F, fill = NA)

> tail(rollme3)
   (Intercept)          x
41    47.38182 -0.5515152
42    49.20000 -0.6000000
43    53.03030 -0.6969697
44    49.26050 -0.6050420
45    50.72222 -0.6388889
46    54.22642 -0.7169811

edited Oct 21 '17 at 13:56

answered Oct 21 '17 at 11:23

Robert

5,038
1
25
43

Hi Robert, thanks for that! It works well when the time interval is 1, but there is an error when that is bigger than that (e.g. every 5 seconds): `Error in eval(predvars, data, env) : object 'y' not found`. Am wondering if that is a simple fix in the code... I'm reading it and trying to understand it right now. – omgCat Oct 21 '17 at 13:00
See the update, define `tint =5` and see if it works. – Robert Oct 21 '17 at 13:57
Hi Robert, thanks! I'll test it on a more aggressive irregular time series and get back to you soon. – omgCat Oct 23 '17 at 00:30

Uwe · Answer 2 · 2019-03-04T17:57:36.247

For the sake of completeness, here is an answer which uses data.table to aggregate in a non-equi join.

Although there many similar questions, e.g., r calculating rolling average with window based on value (not number of rows or date/time variable), this question deserves an answer on its own as the OP is looking for the coefficients of a rolling regression.

library(data.table)
ws <- 10   # size of sliding window in time units
setDT(sample)[.(start = x - ws, end = x), on = .(x > start, x <= end),
              as.list(coef(lm(y ~ x.x))), by = .EACHI]

      x  x (Intercept)        x.x
 1: -10  0    50.00000         NA
 2:  -9  1    50.00000 -1.0000000
 3:  -8  2    50.00000 -1.0000000
 4:  -7  3    50.00000 -1.0000000
 5:  -6  4    50.00000 -1.0000000
 6:  -5  5    49.61905 -0.7142857
 7:  -4  6    49.50000 -0.6428571
 8:  -3  7    49.50000 -0.6428571
 9:  -2  8    49.55556 -0.6666667
10:  -1  9    49.63636 -0.6969697
11:   0 10    49.20000 -0.6000000
12:   1 11    48.88485 -0.5515152
13:   2 12    48.83636 -0.5515152
14:   3 13    49.20000 -0.6000000
15:   4 14    50.12121 -0.6969697
16:   5 15    49.20000 -0.6000000
17:   6 16    48.64242 -0.5515152
18:   7 17    48.59394 -0.5515152
19:   8 18    49.20000 -0.6000000
20:   9 19    50.60606 -0.6969697
21:  10 20    49.20000 -0.6000000
22:  11 21    48.40000 -0.5515152
23:  12 22    48.35152 -0.5515152
24:  13 23    49.20000 -0.6000000
25:  14 24    51.09091 -0.6969697
26:  15 25    49.20000 -0.6000000
27:  16 26    48.15758 -0.5515152
28:  17 27    48.10909 -0.5515152
29:  18 28    49.20000 -0.6000000
30:  19 29    51.57576 -0.6969697
31:  22 32    49.18487 -0.6050420
32:  23 33    50.13889 -0.6388889
33:  24 34    52.47170 -0.7169811
34:  25 35    48.97561 -0.5975610
35:  26 36    46.77358 -0.5283019
36:  27 37    45.75000 -0.5000000
37:  28 38    46.34454 -0.5210084
38:  29 39    50.57143 -0.6428571
39:  30 40    47.95556 -0.5666667
40:  31 41    47.43030 -0.5515152
41:  32 42    47.38182 -0.5515152
42:  33 43    49.20000 -0.6000000
43:  34 44    53.03030 -0.6969697
44:  37 47    49.26050 -0.6050420
45:  38 48    50.72222 -0.6388889
46:  39 49    54.22642 -0.7169811
      x  x (Intercept)        x.x

Please note that rows 10 to 30 where the time series is regularly spaced are identical to OP's rollme.

The call to as.list() forces the result of coef(lm(...)) to appear in separate columns.

The code above uses a right aligned rolling window. However, the code can be easily adapted to support a left aligned window as well:

# left aligned window
setDT(sample)[.(start = x, end = x + ws), on = .(x >= start, x < end),
              as.list(coef(lm(y ~ x.x))), by = .EACHI]

Is there any way to get this to work with GLS regression? `setDT(sample)[.(start = x, end = x + ws), on = .(x >= start, x < end), as.list(coef(gls(y ~ x.x))), by = .EACHI]` Throws an error about not being able to find `y` `Error in eval(predvars, data, env) : object 'y' not found` — KaanKaant, Nov 27 '19 at 00:01

score 0 · Answer 3 · answered Apr 13 '20 at 07:39

With runner one can apply any R function in irregular time series. User has to specify put data to x argument and vector of dates to idx argument (to make windows time dependent). Window width k can be a integer k = 30 or character like in seq.POSIXt k = "30 secs".

First example shows how to obtain both parameters from lm function - output will be a matrix

library(runner)

runner(
  x = sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))
  }
)

Or one can execute runner separately for each parameter

library(runner)

sample$intercept <- runner(
  sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))[1]
  }
)

sample$slope <- runner(
  sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))[2]
  }
)

head(sample, 15)

#               datetime  x  y intercept      slope
# 1  2020-04-13 09:27:20  0 50  50.00000         NA
# 2  2020-04-13 09:27:21  1 49  50.00000 -1.0000000
# 3  2020-04-13 09:27:25  2 48  50.00000 -1.0000000
# 4  2020-04-13 09:27:29  3 47  50.00000 -1.0000000
# 5  2020-04-13 09:27:29  4 46  50.00000 -1.0000000
# 6  2020-04-13 09:27:32  5 47  49.61905 -0.7142857
# 7  2020-04-13 09:27:34  6 46  49.50000 -0.6428571
# 8  2020-04-13 09:27:38  7 45  49.50000 -0.6428571
# 9  2020-04-13 09:27:38  8 44  49.55556 -0.6666667
# 10 2020-04-13 09:27:41  9 43  49.63636 -0.6969697
# 11 2020-04-13 09:27:44 10 44  49.45455 -0.6363636
# 12 2020-04-13 09:27:47 11 43  49.38462 -0.6153846
# 13 2020-04-13 09:27:48 12 42  49.38462 -0.6153846
# 14 2020-04-13 09:27:49 13 41  49.42857 -0.6263736
# 15 2020-04-13 09:27:50 14 40  49.34066 -0.6263736

Data with datetime column

sample <- structure(
  list(
    datetime = c(3, 1, 4, 4, 0, 3, 2, 4, 0, 3, 3, 3, 1, 1, 1, 3, 0, 2, 4, 2, 2, 
                 3, 0, 1, 2, 4, 0, 1, 4, 4, 1, 2, 1, 3, 0, 4, 4, 1, 3, 0, 0, 2, 
                 1, 0, 2, 0) + Sys.time(),
    x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
          20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 32, 33, 34, 35, 36, 37, 38, 
          39, 40, 41, 42, 43, 44, 47, 48, 49), 
    y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42, 41, 40, 41, 40, 39,
          38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32, 31, 30, 29, 28, 29, 28, 27, 
          26, 25, 26, 25, 24, 23, 22, 21, 20,19)
  ), 
  .Names = c("x", "y"), 
  class = c("tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -46L)
)

Rolling regression on irregular time series

Summary (tldr)

The problem

3 Answers3

Linked