Path: > Syllabus > Topics and Readings > Modeling Relationships > Multiple Regression

Modeling Relationships between Continuous Variables
Introduction to Multiple Regression

A problem for multiple regression

  • One dependent variable: REAGAN84 -- % vote by state for Reagan in 1984
  • Three independent variables:
    • % vote for Reagan in 1980 <------ past voting history of state
    • % Black <--------------------------- the "Jackson factor"
    • % Women <------------------------ the "Ferraro factor"
  • Correlation matrix: A table showing intercorrelations among all variables
  • Bivariate results
    • As shown previously, the vote for Reagan in 1980 is strongly related to the vote for Reagan in 1984.
    • The vote for Reagan in both 1980 and 1984 is strongly correlated with % Black and % Women -- but slightly LESS in 1984 than 1980.
    • The two demographic variables, Black and Women, are themselves highly correlated.
  • Multiple regression is needed to assess the combined explanatory power of all three independent variables, magically (i.e., statistically) adjusting for their intercorrelations 

The explanatory model underlying multiple regression

  • Here is a schematic presentation of the simplest model of additive effects:  
  • Here is the regression model expressed mathematically:
  • Assumptions of this simple additive model
    • Independent variables are not causes of each other.
    • They are not caused by other common variables.
    • Therefore the independent variables are uncorrelated.
  • Departures from the additive nature of explanation:
    • In a strict additive model, the variance explained by each variable can be added to that explained by the other variables.
    • In practice, strict additivity rarely obtains because independent variables tend to be correlated.
    • In practice, every variable added to the equation will increase its power of explanation but in decreasing amounts.
    • Usually, there is little increase after adding 5 or 6 variables. 

Multiple regression as an extension of simple linear regression

  • Linear regression model for one independent variable
    • Unstandardized or "raw" data
    • Standardized data (transformed into z-scores)
      where is a standardized coefficient, not a population parameter
  • Extension of linear model to multiple independent variables
    • Unstandardized model uses b-coefficients: Yi= b1X1i + b2X2i + b3X3i+ . . . . bnXni
    • Standardized model uses beta-coefficients:
  • Conceptualization
    • Whereas in simple regression, we calculated the best fitting line that could pass through a series of points in two-dimensional space--
    • in multiple regression, we seek the best fitting "hyperplane" that can pass through a mass of points in k+1 dimensional space. 

Computation and interpretation of and b -- the REGRESSION COEFFICIENTS

  • These values are computed by the computer, and in practice you will not have to calculate them yourself
    • Computation of

    • Relationship of to b  

  • Interpretation 
    • b1 is the expected change in Y with one unit change in Xi when all other variables are controlled.
      See the example based on the SPSS Users' Guide, p. 198.
    • however, represents the expected change in Y in standard deviation unitswith one standard deviation change in Xi when the other independent variables are controlled.
    • The b1 coefficients offer the advantage of interpretation in the original units of measurement
    • They have the disadvantage of making it difficult to compare the effects of independent variables, for variables may vary widely in means and standard deviations and thus in their values.
    • Consider the effect of income in dollars as an independent variable -- one unit change (i.e., $1) is likely to have a very small effect on any Y and thus b would be tiny.
    • If measured instead in thousands of dollars, the b coefficient would be larger -- for one unit change (i.e., $1,000) would have a larger effect on Y.
    • coefficients have the advantage of being directly comparable in relative importance of effects on Y, but they can't be interpreted in the original measurement scale.

Interpretation of the multiple correlation coefficient, R

  • R is the product moment correlation between the dependent variable and the predicted values as generated by the multiple regression equation, using the b-coefficients. 
    See the example that illustrates the meaning of R using data to explain female life expectancy across nations.
  • R2 as a PRE (proportional-reduction-in-error measure of association)
  •  Adjusted R2 -- adjusting for the number of variables in equation:
    • According to the logic of multiple regression, each new variable adds "something" to the explanation.
    • After a point, the addition of each new variable adds less and less to the explanation -- until the addition of a new variable is not "worth" its contribution.
    • The formula for the "adjusted R2" allows for the number of variables involved in the equation.
    • In general, one should not add variables beyond the point that the adjusted R2 begins to decrease.
        where: p = number of independent variables in equation

Assumptions of multiple regression

  • Normality and Eequality of variance for distributions of Y for each value of the X variable (homoscedasticity, discussed earlier)
  • Independence of observations on Y (i.e., not repeated measures on the same unit, as in the paired samples design for the T-Test)
  • Linearity of relationships between Y and X variables