Multiple
Regression
Review of points in linear
regression analysis:
- When the
regression line is calculated for raw data for X
and Y,
- the regression line
passes through the means of both
variables.
- the "unstandardized"
slope of the regression line, b, states
the change in Y that is due to a change of
one unit of X.
- Obviously,
interpreting the change in Y depends on
the "units" used to measure both X and
Y.
- A slope of 0 (b=0)
would indicate the absence of a correlation
between X and Y.
- The correlation
coefficient, r, would thus also be
zero.
- Therefore, testing
for the significance of the slope (b), is
equivalent to testing for the correlation coefficient
(r).
- However, the simple
magnitude of a non-zero b-coefficient is not a
reliable guide to the size of the correlation
coefficient, for the slope is dependent on the
ratio of the covariation of X and Y, to the variation
of X.
- This is eqiuivalent
to the ratio of the covariance of X and Y, to the
covariance of X.
bx
=
|
covariation
of X and Y
|
=
|
Sum(Xi-X)(Yi-Y)
|
variation
of X
|
Sum(Xi-X)2
|
- When the
regression line is calculated for z-scores
calculated for the same data:
- the regression line
still passes through the means of both
variables, which in the case of z-scores are both
0.
- Therefore, the
intercept is 0, and the a term in the
unstandardized regression equation simply drops
out.
- Because the data are
standardized, the slope of the equation is itself a
standardized value, called a
beta-coefficient.
- The beta-coefficient
also measures change in Y coming from one unit change
in X, but now the "units" are in "standard
deviations."
- The correlation
coefficient, r, remains the same whether
calculated for the same data in raw or standardized
form.
- Moreover, the
correlation coefficient, r, is equal to the
standardized beta-coefficient.
- The great value of
the beta-coefficient is that it expresses the "effect"
of one variable on another without regard to
how differently the variables are scaled.
- It does this by
expressing the change in Y in standard
deviations produced from a change of one
standard deviation in X.
Multiple regression analysis:
- An extension of
simple regression to the case of multiple
independent variables, X1 to Xn,
and a single dependent variable, Y:
- It is most
appropriate when Y is a continuous
variable.
- The classic model
also assumes the X variables to be continuous.
- They certainly
cannot be polychotomous nominal,
- but regression
analysis can handle independent nominal variables
in dichotomous form -- so-called "dummy"
variables.
- The effects of
dummy variables are more easily interpreted when
they are scored "0" and "1" to indicate the absence
or presence of an attribute.
- Then the presence
of the attribute -- e.g., SOUTH indicating a
southern state if "1" and non-southern if "0" --
turns on the associated coefficient, which
shows the effect of SOUTH on Y.
- Under the Analyze
Menu in SPSS 10, the Regression menu offers
several choices, choose Linear
- The box allows you to
enter a single dependent variable but multiple
independent variables
- The "Statistics"
button offers several boxes to check, the most
relevant for our class are
- regression
Coefficient "estimates"
- Model
fit
- Descriptives
- Unfortunately, the
"Plots" button only produces plots for
residuals, so it is not useful for
us.
- The "Method" box
offers several options, most useful for us are
- Enter
--which automatically forces all the dependent
variables you listed.
- Stepwise --
which proceeds step-by-step,
- first entering
the variable that explains the most variance--if
it is significant at .05
- then the
variable that explains most of the remaining
variance--if it significant at .05
- and so
on.
- The mathematics
of regression analysis operates on the
intercorrelations among all the variables -- dependent
and independent.
- An
intercorrelation matrix is a table of all possible
bivariate Pearson correlations among variables Y,
X1, . . , Xn
- Regression analysis
tries to produce the best additive combination
of independent variables that produces the best
linear relationship between the observed
Y values and the Y values predicted by the
resulting regression equation.
- The multiple
correlation coefficient, R, is equal to the
product moment correlation, r, that
would be calculated if the observed values were
correlated with the values computed from the
regression equation.
- Similarly, the
multiple R-squared is equal to the
proportion of variance in the dependent
variable Y that is explained by the additive
combination of effects of the independent variables,
X1 to Xn.
- Multiple
regression analysis assesses only direct, additive
effects according to this model:
- X1
and X2 are uncorrelated, such
that rx1x2 = 0.
- If this causal model
actually operated for the variables, then one could
obtain the same results by computing two separate,
simple regression equations and adding together the
variance in Y that is explained separately by
X1 and X2.
Y =
beta(X1)
|
=
.5(X1)
|
r=
,5
|
r-squared
= .25
|
Y =
beta(X2)
|
=
.4(X2)
|
r =
.4
|
r-squared
= .16
|
Variance
explained by adding variance of
X1 and X2=
|
.25 +
.16
|
=
.41
|
- This would be equivalent
to constructing a regression equation from the
beta-coefficients in the two independent regression
equations:
- EXAMPLE: Y =
beta(X1) + beta(X2) =
.5(X1) + .4(X2)
- In most cases
the simple model of uncorrelated independent
variables does not hold, and the effects of variables
computed from separate simple regressions
cannot simply be added together.
- The value of
multiple regression analysis is that it discounts for
overlapping explanation of Y between correlated
independent variables and expresses the NET effects of
each independent variable controlling for any
others in the equation.
- Multiple
regression through the stepwise procedure in SPSS
demonstrates these features.
- The first variable
entered in the equation is that which has the
highest simple correlation with the dependent
variable Y -- that is, the program selects as the
first variable that which explains most of the
variance in Y.
- The next variable
entered is that which has the highest partial
correlation with Y -- that is, the variable that
explains most of the remaining variance in Y while
controlling for the first -- assuming that the
variable meets the stated level of significance (the
default for entering a variable is
.05).
- With the addition
each a new variable, a variable previously IN the
equation may drop below significance (the default
value for removing a variable already in the
equation is .10) and be removed from the equation--its
explanatory power replaced by other variables now in
the equation.
- And so it goes with
each remaining variable in the set of independent
variables until one of two conditions is met:
- All the variables
are entered in the equation, which usually happens
only when the number of variables is less than
5.
- No additional
variables would show "significant" effects IF added
to the equation.
- Interpretation of the
resulting regression equation can be done using either
b-coefficients or beta-coefficients -- or
both:
- b-coefficients,
which are unstandardized, show the net effect
in Y which is associated with one unit change in X --
all in raw data values.
- beta-coefficents,
which are standardized, show the net effect in
Y which is associated with one unit change in X -- but
now the changes are in standard deviations of
both variables.
- Because
b-coeffficients deal with raw (or "original")
values, the b-coefficients should be used to
construct the prediction equation from the X
variables to the Y variable.
- Because
beta-coefficients are standardized, they should be
used to compare the "effects" of variables
within equations.
- Both b-coefficients
and beta-coefficients can be interpreted as
controlling for the effects of other
variables.
- If the b-coefficient
is significant, determined by applying the
t-test to the ratio of the coefficient to its standard
error, then the beta-coefficient is
significant.
- The F value
associated with a multiple regression equation tests
for the significance of the multiple R for the
entire equation.
- It is possible to
have a significant R and some variables that are
not significant.
- These probably
should be removed from the equation, for they are
not adding any appreciable explanation -- but there
may be theoretical reasons for leaving them in.
- Evaluating the "fit" of
a multiple regression equation, or the strength of
the relationship:
- The R-squared value
for a multiple regression equation tends to
increase with the addition of new variables up to
the total number of cases, N.
- The "adjusted"
R-square allows for the additional explanation of
a new variable matched against the loss of one degree
of freedom for entering it in the
equation.
- If the adjusted
r-square goes down with the addition of a new
variable, it ordinarily should not be included in the
equation.
|