Multiple
Regression
Review of points in linear
regression analysis:
 When the
regression line is calculated for raw data for X
and Y,
 the regression line
passes through the means of both
variables.
 the "unstandardized"
slope of the regression line, b, states
the change in Y that is due to a change of
one unit of X.
 Obviously,
interpreting the change in Y depends on
the "units" used to measure both X and
Y.
 A slope of 0 (b=0)
would indicate the absence of a correlation
between X and Y.
 The correlation
coefficient, r, would thus also be
zero.
 Therefore, testing
for the significance of the slope (b), is
equivalent to testing for the correlation coefficient
(r).
 However, the simple
magnitude of a nonzero bcoefficient is not a
reliable guide to the size of the correlation
coefficient, for the slope is dependent on the
ratio of the covariation of X and Y, to the variation
of X.
 This is eqiuivalent
to the ratio of the covariance of X and Y, to the
covariance of X.
b_{x}
=

covariation
of X and Y

=

Sum(X_{i}X)(Y_{i}Y)

variation
of X

Sum(X_{i}X)^{2}

 When the
regression line is calculated for zscores
calculated for the same data:
 the regression line
still passes through the means of both
variables, which in the case of zscores are both
0.
 Therefore, the
intercept is 0, and the a term in the
unstandardized regression equation simply drops
out.
 Because the data are
standardized, the slope of the equation is itself a
standardized value, called a
betacoefficient.
 The betacoefficient
also measures change in Y coming from one unit change
in X, but now the "units" are in "standard
deviations."
 The correlation
coefficient, r, remains the same whether
calculated for the same data in raw or standardized
form.
 Moreover, the
correlation coefficient, r, is equal to the
standardized betacoefficient.
 The great value of
the betacoefficient is that it expresses the "effect"
of one variable on another without regard to
how differently the variables are scaled.
 It does this by
expressing the change in Y in standard
deviations produced from a change of one
standard deviation in X.
Multiple regression analysis:
 An extension of
simple regression to the case of multiple
independent variables, X_{1} to X_{n},
and a single dependent variable, Y:
 It is most
appropriate when Y is a continuous
variable.
 The classic model
also assumes the X variables to be continuous.
 They certainly
cannot be polychotomous nominal,
 but regression
analysis can handle independent nominal variables
in dichotomous form  socalled "dummy"
variables.
 The effects of
dummy variables are more easily interpreted when
they are scored "0" and "1" to indicate the absence
or presence of an attribute.
 Then the presence
of the attribute  e.g., SOUTH indicating a
southern state if "1" and nonsouthern if "0" 
turns on the associated coefficient, which
shows the effect of SOUTH on Y.
 Under the Analyze
Menu in SPSS 10, the Regression menu offers
several choices, choose Linear
 The box allows you to
enter a single dependent variable but multiple
independent variables
 The "Statistics"
button offers several boxes to check, the most
relevant for our class are
 regression
Coefficient "estimates"
 Model
fit
 Descriptives
 Unfortunately, the
"Plots" button only produces plots for
residuals, so it is not useful for
us.
 The "Method" box
offers several options, most useful for us are
 Enter
which automatically forces all the dependent
variables you listed.
 Stepwise 
which proceeds stepbystep,
 first entering
the variable that explains the most varianceif
it is significant at .05
 then the
variable that explains most of the remaining
varianceif it significant at .05
 and so
on.
 The mathematics
of regression analysis operates on the
intercorrelations among all the variables  dependent
and independent.
 An
intercorrelation matrix is a table of all possible
bivariate Pearson correlations among variables Y,
X_{1}, . . , X_{n}
 Regression analysis
tries to produce the best additive combination
of independent variables that produces the best
linear relationship between the observed
Y values and the Y values predicted by the
resulting regression equation.
 The multiple
correlation coefficient, R, is equal to the
product moment correlation, r, that
would be calculated if the observed values were
correlated with the values computed from the
regression equation.
 Similarly, the
multiple Rsquared is equal to the
proportion of variance in the dependent
variable Y that is explained by the additive
combination of effects of the independent variables,
X_{1} to X_{n}.
 Multiple
regression analysis assesses only direct, additive
effects according to this model:
 X_{1}
and X_{2} are uncorrelated, such
that r_{x1x2} = 0.
 If this causal model
actually operated for the variables, then one could
obtain the same results by computing two separate,
simple regression equations and adding together the
variance in Y that is explained separately by
X_{1} and X_{2}.
Y =
beta(X_{1})

=
.5(X_{1})

r=
,5

rsquared
= .25

Y =
beta(X_{2})

=
.4(X_{2})

r =
.4

rsquared
= .16

Variance
explained by adding variance of
X_{1} and X_{2}=

.25 +
.16

=
.41

 This would be equivalent
to constructing a regression equation from the
betacoefficients in the two independent regression
equations:
 EXAMPLE: Y =
beta(X_{1}) + beta(X_{2}) =
.5(X_{1}) + .4(X_{2})
 In most cases
the simple model of uncorrelated independent
variables does not hold, and the effects of variables
computed from separate simple regressions
cannot simply be added together.
 The value of
multiple regression analysis is that it discounts for
overlapping explanation of Y between correlated
independent variables and expresses the NET effects of
each independent variable controlling for any
others in the equation.
 Multiple
regression through the stepwise procedure in SPSS
demonstrates these features.
 The first variable
entered in the equation is that which has the
highest simple correlation with the dependent
variable Y  that is, the program selects as the
first variable that which explains most of the
variance in Y.
 The next variable
entered is that which has the highest partial
correlation with Y  that is, the variable that
explains most of the remaining variance in Y while
controlling for the first  assuming that the
variable meets the stated level of significance (the
default for entering a variable is
.05).
 With the addition
each a new variable, a variable previously IN the
equation may drop below significance (the default
value for removing a variable already in the
equation is .10) and be removed from the equationits
explanatory power replaced by other variables now in
the equation.
 And so it goes with
each remaining variable in the set of independent
variables until one of two conditions is met:
 All the variables
are entered in the equation, which usually happens
only when the number of variables is less than
5.
 No additional
variables would show "significant" effects IF added
to the equation.
 Interpretation of the
resulting regression equation can be done using either
bcoefficients or betacoefficients  or
both:
 bcoefficients,
which are unstandardized, show the net effect
in Y which is associated with one unit change in X 
all in raw data values.
 betacoefficents,
which are standardized, show the net effect in
Y which is associated with one unit change in X  but
now the changes are in standard deviations of
both variables.
 Because
bcoeffficients deal with raw (or "original")
values, the bcoefficients should be used to
construct the prediction equation from the X
variables to the Y variable.
 Because
betacoefficients are standardized, they should be
used to compare the "effects" of variables
within equations.
 Both bcoefficients
and betacoefficients can be interpreted as
controlling for the effects of other
variables.
 If the bcoefficient
is significant, determined by applying the
ttest to the ratio of the coefficient to its standard
error, then the betacoefficient is
significant.
 The F value
associated with a multiple regression equation tests
for the significance of the multiple R for the
entire equation.
 It is possible to
have a significant R and some variables that are
not significant.
 These probably
should be removed from the equation, for they are
not adding any appreciable explanation  but there
may be theoretical reasons for leaving them in.
 Evaluating the "fit" of
a multiple regression equation, or the strength of
the relationship:
 The Rsquared value
for a multiple regression equation tends to
increase with the addition of new variables up to
the total number of cases, N.
 The "adjusted"
Rsquare allows for the additional explanation of
a new variable matched against the loss of one degree
of freedom for entering it in the
equation.
 If the adjusted
rsquare goes down with the addition of a new
variable, it ordinarily should not be included in the
equation.
