Assume I have data (t,y), where I expect a linear dependency y(t). Furthermore, there exist attributes to each observation par1, par2, par3. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10) but was not able to get the formula for the best fit.
The final result should look like this if plotted. For data see below.
- Adding
par1andpar2gives the best fit - The models are
y_i = a_i * t_i + b_iwith givena_iandb_i
Reproducible example:
t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset
# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
d[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10)
y[[i]] <- a[i]*d[[i]] + b[i] + noise
y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]
# That is what the data is looking like:
plot(mydata$t, mydata$y)
# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")
Comment and question on @Roland's answer:
I understand that that with the given three parameters there are 2^3=8 groups with 2*3*3=18 factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".
I tried the following
g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)
and realized that the fit is off. Looking back this is clear because
- I include the wrong factor levels (most likely parameter 3 is not relevant)
- and thus get the wrong data for the fit
So my last question is:
- Where can I find the relevant groups included in the best model and
- what are the corresponding fit parameters from the regression?
Sorry, if this was obvious but to me it is mystery


