L27-Multiple Regression

Path: janda.org/c10 > Syllabus > Topics and Readings > Modeling Relationships > Multiple Regression

Modeling Relationships between Continuous Variables
Introduction to Multiple Regression

A problem for multiple regression

One dependent variable: REAGAN84 -- % vote by state for Reagan in 1984

Three independent variables:

% vote for Reagan in 1980 <------ past voting history of state

% Black <--------------------------- the "Jackson factor"

% Women <------------------------ the "Ferraro factor"

Correlation matrix: A table showing intercorrelations among all variables

Bivariate results

As shown previously, the vote for Reagan in 1980 is strongly related to the vote for Reagan in 1984.

The vote for Reagan in both 1980 and 1984 is strongly correlated with % Black and % Women -- but slightly LESS in 1984 than 1980.

The two demographic variables, Black and Women, are themselves highly correlated.

Multiple regression is needed to assess the combined explanatory power of all three independent variables, magically (i.e., statistically) adjusting for their intercorrelations

The explanatory model underlying multiple regression

Here is a schematic presentation of the simplest model of additive effects:

Here is the regression model expressed mathematically:

Assumptions of this simple additive model

Independent variables are not causes of each other.

They are not caused by other common variables.

Therefore the independent variables are uncorrelated.

Departures from the additive nature of explanation:

In a strict additive model, the variance explained by each variable can be added to that explained by the other variables.

In practice, strict additivity rarely obtains because independent variables tend to be correlated.

In practice, every variable added to the equation will increase its power of explanation but in decreasing amounts.

Usually, there is little increase after adding 5 or 6 variables.

Multiple regression as an extension of simple linear regression

Linear regression model for one independent variable

Unstandardized or "raw" data

Standardized data (transformed into z-scores)

where is a standardized coefficient, not a population parameter

Extension of linear model to multiple independent variables

Unstandardized model uses b-coefficients: Y_i= b₁X_1i + b₂X_2i + b₃X_3i+ . . . . b_nX_ni

Standardized model uses beta-coefficients:

Conceptualization

Whereas in simple regression, we calculated the best fitting line that could pass through a series of points in two-dimensional space--

in multiple regression, we seek the best fitting "hyperplane" that can pass through a mass of points in k+1 dimensional space.

Computation and interpretation of and b -- the REGRESSION COEFFICIENTS

These values are computed by the computer, and in practice you will not have to calculate them yourself

Computation of

Relationship of to b

Interpretation

b₁ is the expected change in Y with one unit change in Xi when all other variables are controlled.

See the example based on the SPSS Users' Guide, p. 198.

however, represents the expected change in Y in standard deviation unitswith one standard deviation change in Xi when the other independent variables are controlled.

The b₁ coefficients offer the advantage of interpretation in the original units of measurement

They have the disadvantage of making it difficult to compare the effects of independent variables, for variables may vary widely in means and standard deviations and thus in their values.

Consider the effect of income in dollars as an independent variable -- one unit change (i.e., $1) is likely to have a very small effect on any Y and thus b would be tiny.

If measured instead in thousands of dollars, the b coefficient would be larger -- for one unit change (i.e., $1,000) would have a larger effect on Y.

coefficients have the advantage of being directly comparable in relative importance of effects on Y, but they can't be interpreted in the original measurement scale.

Interpretation of the multiple correlation coefficient, R

R is the product moment correlation between the dependent variable and the predicted values as generated by the multiple regression equation, using the b-coefficients.

See the example that illustrates the meaning of R using data to explain female life expectancy across nations.

R² as a PRE (proportional-reduction-in-error measure of association)

Adjusted R² -- adjusting for the number of variables in equation:

According to the logic of multiple regression, each new variable adds "something" to the explanation.

After a point, the addition of each new variable adds less and less to the explanation -- until the addition of a new variable is not "worth" its contribution.

The formula for the "adjusted R2" allows for the number of variables involved in the equation.

In general, one should not add variables beyond the point that the adjusted R² begins to decrease. where: p = number of independent variables in equation

Assumptions of multiple regression

Normality and Eequality of variance for distributions of Y for each value of the X variable (homoscedasticity, discussed earlier)

Independence of observations on Y (i.e., not repeated measures on the same unit, as in the paired samples design for the T-Test)

Linearity of relationships between Y and X variables