My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss dataset has six variables.
library(swiss)
names(swiss)
# "Fertility" "Agriculture" "Examination" "Education"
# "Catholic" "Infant.Mortality"
Now I introduce a new variable ec. It is the linear combination of Examination and Education.
ec <- swiss$Examination + swiss$Catholic
When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA as their coefficients. The command below illustrates the point perfectly.
lm(Fertility ~ . + ec, swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality ec
0.1041 1.0770 NA
However, when we regress first on ec and then all of the regressors as shown below,
lm(Fertility ~ ec + ., swiss)
Coefficients:
(Intercept) ec Agriculture Examination
66.9152 0.1041 -0.1721 -0.3621
Education Catholic Infant.Mortality
-0.8709 NA 1.0770
I would expect the coefficients of both Catholic and Examination to be NA. The variable ec is linear combination of both of them but in the end the coefficient of Examination is not NA whereas that of the Catholic is NA.
Could anyone explain the reason of that?