statistics - Why does regression in R delete index 1 of a factor variable? -
this question has answer here:
i trying regression in r using lm
, glm
function.
my dependent variable logit transformed data based on proportion of events on non-events within given time period. dependent variable continuous whereas independent variable factor variable or dummies.
i have 2 independent variables can take values of
- year year m, year variable
- month j month n, month variable
the problem whenever run model summaries results april(index 1 month) , 1998 (index 1 year) not within results... if change april let's "foo_bar", august missing...
please help! frustrating me , not know how search solution problem.
if r create dummy variable every level in factor, resulting set of variables linearly dependent (assuming there intercept term). therefore, 1 factor level chosen baseline , has no dummy generated it.
to illustrate this, let's consider toy example:
> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c'))) > summary(lm(y ~ f, data)) call: lm(formula = y ~ f, data = data) residuals: 1 2 3 4 5 6 -0.5 0.5 -1.0 1.0 -7.0 7.0 coefficients: estimate std. error t value pr(>|t|) (intercept) 2.500 4.093 0.611 0.5845 fb 3.500 5.788 0.605 0.5880 fc 15.500 5.788 2.678 0.0752 . --- signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 residual standard error: 5.788 on 3 degrees of freedom multiple r-squared: 0.7245, adjusted r-squared: 0.5409 f-statistic: 3.945 on 2 , 3 df, p-value: 0.1446
as can see, there 3 coefficients (the same number of levels in factor). here, a
has been chosen baseline, (intercept)
refers subset of data f
a
. coefficients b
, c
(fb
, fc
) differences between baseline intercept , intercepts 2 other factor levels. intercept b
6
(2.500+3.500
) , intercept c
19
(2.500+15.500
).
if don't automatic choice, pick level baseline: how force r use specified factor level reference in regression?
Comments
Post a Comment