
 Measures for
continuous variables: regression and
correlation

Theoretical statements are
statements of relationships between variables
 They adopt the general
form: the greater A, the greater B (or the greater A, the
less B)
 In the social sciences,
these theoretical statements are probabilistic
rather than deterministic
 Because the
relationships don't hold unfailingly, there is room to
doubt whether the relationship exceeds that due to chance
factors
Regression and correlation
analysis allow us to measure the strength of the observed
relationship between variables that are hypothesized to be
related.
 Regression analysis
produces the BEST prediction (in a leastsquares sense)
of a Y variable from knowledge of X.
 Given any
twodimensional pattern of points, there is one line
that can passed through the points to minimize the
"squared deviations" from that line.
 We will learn later
this week how to calculate that line.
 Correlation analysis
succinctly summarizes the "fit" of the observed values to
those predicted by the "regression line."
 The term "regression"
comes from Sir Francis Galton's pioneering analyses in
the 1880s on the relationship between parents' mean
heights and height of adult offspring.
 After plotting the
relationship of parent height on the baseline and
offspring height on the ordinate, he observed that the
relationship was linear and devised a procedure for
finding the straight line that best fit the plotted
points in the sense that they minimized the deviations
of the points from the line.
 He called this line
the "regression" or "reversion" line, because he noted
that short parents tended to have slightly taller
offspring while tall parents tended to have slightly
shorter offspring.
 It was left to Karl
Pearson who perfected the formula for summarizing the
closeness of the fit of points to the regression line
in the productmoment correlation
coefficient.
 Neither regression nor
correlation PROVES any causal connection between the
variables, but the techniques do permit tests for
nonchance relationships.
The Pearson ProductMoment CORRELATION
COEFFICIENT, r:
The productmoment
correlation, commonly expressed as r, indicates the
strength of a relationship between two variables that
are assumed to be measured on an interval or ratio
scale.
 Properties of the
correlation coefficient:
 r ranges from
1.0 to + 1.0; the signs indicating whether the
relationship is direct (+) or inverse ().
 The absolute value of
the coefficient indicates its strength.
 Many ways to skin a
catand to compute the correlation coefficient,
r.
 The SPSS Guide
on page 178 gives as the formula for r
 Personally, I regard
this formula as too difficult to explain, and I prefer
as a definitional the following:


 Interpreting the
numerator in the definitional formula (above) for the
correlation coefficient:
 The top term,
,
is the product of the joint deviations of the
individual X and Y scores from their respective
means. This term is called the crossproduct sum
of squares or simply the
crossproduct.
 The crossproduct
gives the correlation its name as the
productmoment correlation, for it is the product
of the moments (deviations) of the X and Y values
from their respective means.
 Another common
name for the crossproduct is covariation,
for the larger the value, the more the two
variables vary together or covary.
 As in the
formula above, the covariation can be adjusted for
the number of observations by dividing by N (number
of cases) to produce the covariancethe
average or mean amount that the paired observations
covary.
 Interpreting the
denominator in the formula for the correlation
coefficient:
 Close examination
of this term, ,
reveals that it is the product of the variances of
the X and Y variables.
 This term is used
in the denominator to adjust for the overall
variation in the individual variables.
 In effect,
the complete formula expresses the extent to which
the X and Y variables covary as a proportion
of the product of the standard deviations of
the X and Y variables.
 NOTE:
Sometimes, the deviations of X and Y from their means
are expressed as lowercase x, and y,.
Then r is expressed as:
 r
Alternative ways to calculate the correlation
coefficient:
 The method above can be
described as the ratio of covariance over the product
of the standard deviations.
 This method can be
expressed in these alternative ways:

 Alternatively, one can
compute r with a formula that factors our the N's
in the numerator and denominatorsthus relying on only
the covariation and the product of the square root of
the sums of squares (variation).
 r
REVIEW OF TERMS
Measures of
variation for a single variable
SUM OF
SQUARES = SS (also known simply as VARIATION)
 VARIANCE
("mean" sum of squares, mean as in "average")

 STANDARD
DEVIATION (square root of variance)


 Measures of variation
for two variables taken together
 COVARIATION
(sum of crossproduct deviation = crossproduct SS) =

 COVARIANCE
(average crossproduct deviation) =

 Comments on the suffixes
"ATION" and "ANCE"
 "ATION" refers
to "SUM OF SQUARES"
 "ANCE" refers
to average or mean sum of squares, so these
terms have a divisor of N.
SPSS procedures for correlational
analysis:
 From the Analyze Menu choose Correlate
and then Bivariate
 Transfer the variables you want to correlate into
the right hand window.
 Before clicking on OK to run the correlation,
click on Options
 Check off both boxes in the "Statistics"
area
 Then click the "Continue" button
and then "OK" to run the correlation.
Your next research assignment:
 Formulate a
hypothesis about the effect of a socioeconomic
variable on any political variable at the state
level.

 Run Correlation
with the nustates2000 data to compute the
correlation.

 One possibility would
be to hypothesize about the causes of the Bush vote in
the 2000 election.
 Alternatively, you
may wish to hypothesize about the causes of voting
turnout in 20000.
 Compute the
correlation.
 How strong is the
relationship between your independent variable and
your dependent variable?
 I will give a
Valuable Prize to the student who produces
the strongest valid correlation coefficient.
 A valid
coefficient is one that reflects a genuine
association, one not based on artifacts of
measurement.

