My data are dummy variables (1 = if disclosed, 0 = not disclosed) as dependent variable and categorical variable (five types of sectors) as independent variable.
With these data, can a linear regression model be used?
My objectives are to identify which sectors do or do not disclose.
So is it a good way to use?, for example:
summary(lm(Disclosed ~ 0 + Sectors, data = df_0))
I add in the model "0 +", so that it also returns the first sector, eliminating the intercept. If I don't add it, I don't know why the first sector doesn't return it to me. I am very lost. Thanks!
If I use a binomial logistic regression, the significance values that I obtain with the estimated sign that it indicates will not be interpreted.
Call:
glm(formula = Disclosed ~ 0 + Sectors, family = binomial(link = "logit"), 
    data = df_0)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.96954  -0.32029  -0.00005  -0.00005   2.48638  
Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
SectorsCOMMUNICATION      -0.5108     0.5164  -0.989  0.32256    
SectorsCONSIMERSTAPLES   -20.5661  6268.6324  -0.003  0.99738    
SectorsCONSUMERDISCRET    -3.0445     1.0235  -2.975  0.00293 ** 
SectorsENERGY            -20.5661  3780.1276  -0.005  0.99566    
SectorsFINANCIALS         -2.9444     0.7255  -4.059 4.94e-05 ***
SectorsHEALTHCARE        -20.5661  5345.9077  -0.004  0.99693    
SectorsINDUSTRIALS       -20.5661  2803.4176  -0.007  0.99415    
SectorsINDUSTRIALS       -20.5661 17730.3699  -0.001  0.99907    
SectorsINFORMATION        -1.0986     0.8165  -1.346  0.17846    
SectorsMATERIALS         -20.5661  3780.1276  -0.005  0.99566    
SectorsREALESTATE        -20.5661  8865.1850  -0.002  0.99815    
SectorsUTILITIES         -20.5661  7238.3932  -0.003  0.99773    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 277.259  on 200  degrees of freedom
Residual deviance:  54.185  on 188  degrees of freedom
AIC: 78.185
Number of Fisher Scoring iterations: 19
This means that the financial and consumer discretionary sectors are the least disclosed, right?
On the other hand, if I apply an lm, it returns more consistent results. The sectors that spread the most are information and communication. They are significant and positive estimate values
Call:
lm(formula = Disclosed ~ 0 + Sectors, data = df_0)
Residuals:
    Min      1Q  Median      3Q     Max 
-0.3750 -0.0500  0.0000  0.0000  0.9546 
Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
SectorsCOMMUNICATION   3.750e-01  5.191e-02   7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00  7.341e-02   0.000 1.000000    
SectorsCONSUMERDISCRET 4.545e-02  4.427e-02   1.027 0.305815    
SectorsENERGY          0.000e+00  4.427e-02   0.000 1.000000    
SectorsFINANCIALS      5.000e-02  3.283e-02   1.523 0.129426    
SectorsHEALTHCARE      0.000e+00  6.260e-02   0.000 1.000000    
SectorsINDUSTRIALS     2.194e-18  3.283e-02   0.000 1.000000    
SectorsINDUSTRIALS     0.000e+00  2.076e-01   0.000 1.000000    
SectorsINFORMATION     2.500e-01  7.341e-02   3.406 0.000807 ***
SectorsMATERIALS       0.000e+00  4.427e-02   0.000 1.000000    
SectorsREALESTATE      0.000e+00  1.038e-01   0.000 1.000000    
SectorsUTILITIES       1.416e-17  8.476e-02   0.000 1.000000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared:  0.2632,    Adjusted R-squared:  0.2162 
F-statistic: 5.597 on 12 and 188 DF,  p-value: 3.568e-08