R regression output incorrect referent?

Question

I'm hoping that this question isn't the same as previous questions but in different words. I've tried using the solutions for previous questions but they haven't worked for me, so bear with me!

So I'm having some trouble with my linear regression model output in R. I'm concerned that the model is using an incorrect referent group as part of the interaction term I've placed into the model and even though I've tried to relevel the individual terms before they're placed into the interaction term, I'm not getting the output I expected.

I have a dataset with a continuous and categorical variables. Let's say that variables A and B are continuous and variables C, D, and E are categorical (0 = No, 1 = Yes). The referent groups for the categorical variables has been set at "No" (0). Here's an example:

ID      A      B      C      D      E
1       53.6   25     No     Yes    No
2       51.1   12     Yes    No     Yes
3       50.9   NA     Yes    Yes    No
4       49.3   2      No     No     No
5       48.1   NA     No     Yes    No

I've tried a couple of different ways to get at the interaction terms, so my models are set up as follows:

lm1 <- lm(A ~ C*D + E + B, data=example)

lm2 <- lm(A ~ C:D + E + B, data=example)

I expected to get an output table listing the regression coefficient, standard error, etc. for the intercept, C alone, D alone, E, B, and then C * D, broken down into 3 of the 4 possible combination groups of that interaction term, minus the combination group that comprised both referent groups ("No" for both C and D, "C_No:D_No").

EXPECTED:

Coefficient   Estimate   Std. Error   t value   Pr(>|t|)   
Intercept     90.76369   0.54308      167.127   < 2e-16  ***
C_Yes         -0.28639   0.62044      -0.462    0.644465    
D_Yes         -3.01242   1.14733      -2.626    0.008771 **
E_Yes         0.05865    0.01691      3.468     0.000544 ***
B             -0.20891   0.35982      -0.581    0.561634
C_No:D_Yes    -0.42116   0.47213      2.617     0.01674  *
C_Yes:D_Yes   2.01208    1.43154      1.406     0.160148
C_Yes:D_No    -0.02877   0.65271      -0.345    0.672531

For the first model, I got the output for the intercept, C alone, D alone, E, B, and then one combination group of C * D only.

ACTUAL:

Coefficient   Estimate   Std. Error   t value   Pr(>|t|)   
Intercept     90.76369   0.54308      167.127   < 2e-16  ***
C_Yes         -0.28639   0.62044      -0.462    0.644465    
D_Yes         -3.01242   1.14733      -2.626    0.008771 **
E_Yes         0.05865    0.01691      3.468     0.000544 ***
B             -0.20891   0.35982      -0.581    0.561634
C_No:D_Yes    -0.42116   0.47213      2.617     0.01674  *

For the second model, I got the output for the intercept, E, B, and then all combination groups of C * D.

ACTUAL:

Coefficient   Estimate   Std. Error   t value   Pr(>|t|)   
Intercept     90.76369   0.54308      167.127   < 2e-16  ***
E_Yes         0.05865    0.01691      3.468     0.000544 ***
B             -0.20891   0.35982      -0.581    0.561634
C_No:D_Yes    -0.42116   0.47213      2.617     0.01674  *
C_Yes:D_Yes   NA  (all not defined because of singularities)
C_Yes:D_No    -0.02877   0.65271      -0.345    0.672531

So now my questions are:

1) Is there different code that will give me everything that I want in one model instead of two?

2) Is this model, as is, using C_Yes:D_Yes as the referent group instead of C_No:D_No and that's why I'm getting the error about singularities? My variables are correlated, yes, but not perfectly so I wasn't expecting multicollinearity to be an issue.

3) If the referent group is correct, why am I getting a coefficient estimate for C_No:D_No (the referent group)?

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions.What does `with(example, table(C, D))` return? Are these really coded as 0/1 or Yes/No? Your output doesn't match what I would expect to see from `lm()` — MrFlick, Apr 08 '19 at 19:36
OK @MrFlick, I tried to edit it with more information, without changing the question from its original state. To answer your question about the table, it shows me 939 in the C_No:D_No category, 70 in the C_Yes:D_Yes category, 74 in the C_Yes:D_No category, and 20 in the C_No:D_Yes category. I'm sorry I couldn't reproduce that visually here. All categorical variables are stored as unordered factor variables with labels. — Caitlin, Apr 09 '19 at 15:46

score 0 · Answer 1 · answered Apr 08 '19 at 20:02

Your expected output is not valid due to being full of collinearity issues. Using

lm(A ~ C * D + E + B)

gives what is possible.

Let C1=1 if C=1 and C1=0 otherwise. Let also C0=1-C1. Similarly for D and E. Then estimating the model that I've just referred to gives an output containing Intercept, B, C1, D1, E1, C1:D1. Note that those are C1, D1, E1 rather than "C, D, E alone"; the latter by itself would yield collinearity.

We cannot add, say, C1:D0 because C1:D0 + D1 = C1. Similarly, C0:D1 + C1 = D1 so that C0:D1 cannot be added. What about C0:D0? We have that C0:D0 = 1 - C1 - D1 + C1:D1, again perfect collinearity.

Thus, you may only play with the intercept, whether to have C1 or C0 (the same with D and E), and which one of the four interaction terms to include.

Running

lm(A ~ C:D + E + B)

is equivalent to the previous model after some rearrangements, except that this one forcefully tries to include all four interaction terms. One of them is not identified as the sum of the four of them always equals 1.

Yes, in your estimated model C_Yes:D_Yes could be called the reference group. You did not describe how you intended to assign C_No:D_No as one, but apparently it did not work; you need to make sure that levels(C) is c("Yes", "No") rather than c("No", "Yes").

OK, that was a good explanation and I think I understand what you are saying. Does that mean though that I need to be dummy-coding separate variables for C1, C0, etc. instead of keeping them in the same variable C? I'm coming from a Stata background and I don't remember having to do that to get the contrasts between interaction groups there. — Caitlin, Apr 09 '19 at 15:03
@Caitlin, no, you don't have to create separate variables. I just wanted to make it clear that `lm` doesn't really add *factor* variables to the regression; instead it creates those dummies and includes `C1` and/or `C0` (dummy) rather than just `C` (factor). — Julius Vainora, Apr 09 '19 at 15:49

R regression output incorrect referent?

1 Answers1