L22-Pearson Correlation

Path: janda.org/c10 > Syllabus > Outline > Topics and Readings >
Measuring Relationships between Continuous Variables
> Product-Moment Correlation

Product-Moment Correlation

Measures for continuous variables: regression and correlation

Theoretical statements are statements of relationships between variables

They adopt the general form: the greater A, the greater B (or the greater A, the less B)

In the social sciences, these theoretical statements are probabilistic rather than deterministic

Because the relationships don't hold unfailingly, there is room to doubt whether the relationship exceeds that due to chance factors

Regression and correlation analysis allow us to measure the strength of the observed relationship between variables that are hypothesized to be related.

Regression analysis produces the BEST prediction (in a least-squares sense) of a Y variable from knowledge of X.

Given any two-dimensional pattern of points, there is one line that can passed through the points to minimize the "squared deviations" from that line.

We will learn later this week how to calculate that line.

Correlation analysis succinctly summarizes the "fit" of the observed values to those predicted by the "regression line."

The term "regression" comes from Sir Francis Galton's pioneering analyses in the 1880s on the relationship between parents' mean heights and height of adult offspring.

After plotting the relationship of parent height on the baseline and offspring height on the ordinate, he observed that the relationship was linear and devised a procedure for finding the straight line that best fit the plotted points in the sense that they minimized the deviations of the points from the line.

He called this line the "regression" or "reversion" line, because he noted that short parents tended to have slightly taller offspring while tall parents tended to have slightly shorter offspring.

It was left to Karl Pearson who perfected the formula for summarizing the closeness of the fit of points to the regression line in the product-moment correlation coefficient.

Neither regression nor correlation PROVES any causal connection between the variables, but the techniques do permit tests for non-chance relationships.

The Pearson Product-Moment CORRELATION COEFFICIENT, r:
The product-moment correlation, commonly expressed as r, indicates the strength of a relationship between two variables that are assumed to be measured on an interval or ratio scale.

Properties of the correlation coefficient:

r ranges from -1.0 to + 1.0; the signs indicating whether the relationship is direct (+) or inverse (-).

The absolute value of the coefficient indicates its strength.

Many ways to skin a cat--and to compute the correlation coefficient, r.

The SPSS Guide on page 178 gives as the formula for r

Personally, I regard this formula as too difficult to explain, and I prefer as a definitional the following:

Interpreting the numerator in the definitional formula (above) for the correlation coefficient:

The top term, , is the product of the joint deviations of the individual X and Y scores from their respective means. This term is called the cross-product sum of squares or simply the cross-product.

The cross-product gives the correlation its name as the product-moment correlation, for it is the product of the moments (deviations) of the X and Y values from their respective means.

Another common name for the cross-product is covariation, for the larger the value, the more the two variables vary together or covary.

As in the formula above, the covariation can be adjusted for the number of observations by dividing by N (number of cases) to produce the covariance--the average or mean amount that the paired observations covary.

Interpreting the denominator in the formula for the correlation coefficient:

Close examination of this term, , reveals that it is the product of the variances of the X and Y variables.

This term is used in the denominator to adjust for the overall variation in the individual variables.

In effect, the complete formula expresses the extent to which the X and Y variables covary as a proportion of the product of the standard deviations of the X and Y variables.

NOTE: Sometimes, the deviations of X and Y from their means are expressed as lowercase x, and y,. Then r is expressed as:

r

Alternative ways to calculate the correlation coefficient:

The method above can be described as the ratio of covariance over the product of the standard deviations.

This method can be expressed in these alternative ways:



Alternatively, one can compute r with a formula that factors our the N's in the numerator and denominators--thus relying on only the covariation and the product of the square root of the sums of squares (variation).

r

REVIEW OF TERMS

Measures of variation for a single variable
SUM OF SQUARES = SS (also known simply as VARIATION)

VARIANCE ("mean" sum of squares, mean as in "average")

STANDARD DEVIATION (square root of variance)



Measures of variation for two variables taken together

COVARIATION (sum of cross-product deviation = cross-product SS) =

COVARIANCE (average cross-product deviation) =

Comments on the suffixes "ATION" and "ANCE"

"ATION" refers to "SUM OF SQUARES"

"ANCE" refers to average or mean sum of squares, so these terms have a divisor of N.

SPSS procedures for correlational analysis:

From the Analyze Menu choose Correlate and then Bivariate

Transfer the variables you want to correlate into the right hand window.

Before clicking on OK to run the correlation, click on Options

Check off both boxes in the "Statistics" area

Then click the "Continue" button and then "OK" to run the correlation.

Your next research assignment:

Formulate a hypothesis about the effect of a socioeconomic variable on any political variable at the state level.

Run Correlation with the nustates2000 data to compute the correlation.

One possibility would be to hypothesize about the causes of the Bush vote in the 2000 election.

Alternatively, you may wish to hypothesize about the causes of voting turnout in 20000.

Compute the correlation.

How strong is the relationship between your independent variable and your dependent variable?

I will give a Valuable Prize to the student who produces the strongest valid correlation coefficient.

A valid coefficient is one that reflects a genuine association, one not based on artifacts of measurement.