Path: > Syllabus > Review: Statistical Inference and Hypothesis Testing
Statistical Inference and Hypothesis Testing

I. Statistical Inference: predicting population values, or PARAMETERS, from sample data:

  • Sampling theory
    • If N cases are randomly drawn from some unknown population, one can estimate characteristics of that population with stated levels of accuracy.
    • First step is to compute the sample statistic -- examples:
      • Mean
      • Standard deviation
      • Correlation between two variables in the sample
      • Regression coefficient
      • Intercept
    • Second step is to determine its sampling distribution.
      • Conceive of drawing an infinite number of samples of size N which you compute the sample statistic.
      • Due to sampling variation, these statistics would not all have the same values but would distribute symmetrically and usually "normally" around some "expected" value -- the mean of the distribution of sample values.
      • The exact shape of the sampling distribution depends on
        • the sample statistic involved --mean, correlation, etc.
        • the number of cases in the sample -- N
        • the variation among the values in the population
    • Third step is to compute the sampling distribution's standard error.
      • The dispersion of values around the sampling mean can be measured by the standard deviation, which becomes known as the standard error of the sampling distribution.
      • For all statistics of interest to us, formulas exist for calculating the standard deviations of hypothetical sampling distributions -- which means that we can calculate their standard errors.
      • For example, the central limit theorem states:
      If repeated samples of N observations are drawn from a population with mean µ, and variance , then as N grows large, the sample means will become normally distributed with mean µ, variance 2/N, and standard deviation .
      • Thus, the formula for the standard error of the mean is _____. 
    • Knowledge of the standard error of a statistic allows us to place bounds on the estimates of population parameters from sample data.
  • Point v. interval estimates of population parameters:
    • Point estimates are simply the best guess of the parameter, and they are based on the observed sample statistic itself.
      • If the sample statistic is unbiased, the mean of the sampling distribution equals the population paramenter and thus the observed statistic is the best estimate of the population parameter -- Example: the mean
      • Bias in a sample statistic exists when the mean of the sampling distribution does not equal the population parameter -- Example: thestandard deviuation, which tends to underestimate the population standard deviation and thus must be corrected by using N-1 in the denominator.
      • Corrections for bias are routinely incorporated into the statistical formulas that you encounter.
    • Interval estimates state the likely range of values in the population at a chosen level of confidence -- e.g., .95, .99.
      • A typical example of an interval estimate in political research is predicting to the likely proportion of vote for a candidate -- e.g., 95% sure that Reagan will win 52% of the vote, plus or minus 3 percentage points, which yields a 95% confidence interval of 49% to 55% of the vote.
      • Factors that determine the width of the confidence interval:
        • The chosen level of confidence: 1 - alpha
        • The standard error of the observed sample statistic -- in the case of predicting to the population proportion (a special case of predicting to the mean) -- the s.e. depends on
          • The sample size, N
          • The variation in the population
        • Sample size and population variance are in the formula for the standard error of the mean.
          • The proportion that the sample is of the population is not a major factor in accuracy of a sample.
          • Assuming sampling without replacement, the complete formulat for the standard error of the sample mean, includes a "correction factor" to adjust for sample size relative to the size of the population.
            • The correction factor is
              • p = the proportion that the sample is of the population
              • the correction factor is always less than 1.
              • So multiplying by the correction factor reduces the standard errror.
              • But it is not important unless the sample approaches 20% of the population, when the "correction factor" begins to reduce the s.e. in any substantial way.
            • For more information go here.
      • A "Rule of Thumb" for interval estimates:
        • Most sample statistics distribute normally, or approximately so, and standard errors can thus be interpreted as z-scores in a table of areas under the normal curve.
        • Plus or minus one s.e. would embrace about 68% of the occurrences in the hypothetical sampling distribution.
        • Plus or minus two s.e. would embrace about 95% of the sampling distribution.
        • Thus, doubling the standard error on either side of the mean approximates the 95% confidence interval for estimating the population mean from sample data.
        • This rule applies as well for other sample statistics-- e.g., estimating the confidence interval of a b-coefficient in a regression equation from knowledge of the s.e. of b.

II. Hypothesis testing

  • Refer to the distinction between the research and the test or null hypothesis.
    • This has been discussed in several places, see 2/3 Review for one.
  • Hypothesis testing typically translates into testing whether the observed value differs "significantly" from some specified value
    • How big is the difference in comparison with sampling fluctuation?
    • How does the test statistic distribute (i.e., what's its standard error)?
    • How "significant" a different will you accept (i.e., your alpha value)?
    • Are you making a one-tailed or a two-tailed test?
  • C. General procedure in testing significance of a statistic:
    • Look at the value that you observe in the sample.
    • Subtract the value that you expected to find.
    • Compute a test statistic (e.g., z-score or t-value) according to the appropriate formula, usually dividing by the standard error of the statistic.
    • Illustrative standard errors (enter the formulas):
      Difference between means
      Correlation Coefficient (Typically, another approach is used to test for r. Where: rho is assumed to be 0.0 (i.e., tested against the null hypothesis of no correlation in the population), the t-test is used. 
      b-coefficient (our texts did not discuss the s.e. of b, but SPSS 10 calculated it). 
    • Enter the appropriate table of distribution of the statistic to determine its likelihood of occurrence (its significance).
  • Distribution of test statistics:
    • Normal distribution
    • t-distribution (df based on N-1 for means, or N-2 for r)
    • F-distribution (df based on k-1 and N-k)
    • chi-square distribution (df based on table size)