Path: janda.org/c10 > Syllabus > Review: Statistical Inference and Hypothesis Testing
 Statistical Inference and Hypothesis Testing I. Statistical Inference: predicting population values, or PARAMETERS, from sample data: Sampling theory If N cases are randomly drawn from some unknown population, one can estimate characteristics of that population with stated levels of accuracy. First step is to compute the sample statistic -- examples: Mean Standard deviation Correlation between two variables in the sample Regression coefficient Intercept Second step is to determine its sampling distribution. Conceive of drawing an infinite number of samples of size N which you compute the sample statistic. Due to sampling variation, these statistics would not all have the same values but would distribute symmetrically and usually "normally" around some "expected" value -- the mean of the distribution of sample values. The exact shape of the sampling distribution depends on the sample statistic involved --mean, correlation, etc. the number of cases in the sample -- N the variation among the values in the population Third step is to compute the sampling distribution's standard error. The dispersion of values around the sampling mean can be measured by the standard deviation, which becomes known as the standard error of the sampling distribution. For all statistics of interest to us, formulas exist for calculating the standard deviations of hypothetical sampling distributions -- which means that we can calculate their standard errors. For example, the central limit theorem states: If repeated samples of N observations are drawn from a population with mean µ, and variance , then as N grows large, the sample means will become normally distributed with mean µ, variance 2/N, and standard deviation . Thus, the formula for the standard error of the mean is _____.    Knowledge of the standard error of a statistic allows us to place bounds on the estimates of population parameters from sample data. Point v. interval estimates of population parameters: Point estimates are simply the best guess of the parameter, and they are based on the observed sample statistic itself. If the sample statistic is unbiased, the mean of the sampling distribution equals the population paramenter and thus the observed statistic is the best estimate of the population parameter -- Example: the mean Bias in a sample statistic exists when the mean of the sampling distribution does not equal the population parameter -- Example: thestandard deviuation, which tends to underestimate the population standard deviation and thus must be corrected by using N-1 in the denominator. Corrections for bias are routinely incorporated into the statistical formulas that you encounter. Interval estimates state the likely range of values in the population at a chosen level of confidence -- e.g., .95, .99. A typical example of an interval estimate in political research is predicting to the likely proportion of vote for a candidate -- e.g., 95% sure that Reagan will win 52% of the vote, plus or minus 3 percentage points, which yields a 95% confidence interval of 49% to 55% of the vote. Factors that determine the width of the confidence interval: The chosen level of confidence: 1 - alpha The standard error of the observed sample statistic -- in the case of predicting to the population proportion (a special case of predicting to the mean) -- the s.e. depends on The sample size, N The variation in the population Sample size and population variance are in the formula for the standard error of the mean. The proportion that the sample is of the population is not a major factor in accuracy of a sample. Assuming sampling without replacement, the complete formulat for the standard error of the sample mean, includes a "correction factor" to adjust for sample size relative to the size of the population. The correction factor is p = the proportion that the sample is of the population the correction factor is always less than 1. So multiplying by the correction factor reduces the standard errror. But it is not important unless the sample approaches 20% of the population, when the "correction factor" begins to reduce the s.e. in any substantial way. For more information go here. A "Rule of Thumb" for interval estimates: Most sample statistics distribute normally, or approximately so, and standard errors can thus be interpreted as z-scores in a table of areas under the normal curve. Plus or minus one s.e. would embrace about 68% of the occurrences in the hypothetical sampling distribution. Plus or minus two s.e. would embrace about 95% of the sampling distribution. Thus, doubling the standard error on either side of the mean approximates the 95% confidence interval for estimating the population mean from sample data. This rule applies as well for other sample statistics-- e.g., estimating the confidence interval of a b-coefficient in a regression equation from knowledge of the s.e. of b. II. Hypothesis testing Refer to the distinction between the research and the test or null hypothesis. This has been discussed in several places, see 2/3 Review for one. Hypothesis testing typically translates into testing whether the observed value differs "significantly" from some specified value How big is the difference in comparison with sampling fluctuation? How does the test statistic distribute (i.e., what's its standard error)? How "significant" a different will you accept (i.e., your alpha value)? Are you making a one-tailed or a two-tailed test? C. General procedure in testing significance of a statistic: Look at the value that you observe in the sample. Subtract the value that you expected to find. Compute a test statistic (e.g., z-score or t-value) according to the appropriate formula, usually dividing by the standard error of the statistic. Illustrative standard errors (enter the formulas): Mean        Proportion          Difference between means            Correlation Coefficient (Typically, another approach is used to test for r. Where: rho is assumed to be 0.0 (i.e., tested against the null hypothesis of no correlation in the population), the t-test is used.    t=      b-coefficient (our texts did not discuss the s.e. of b, but SPSS 10 calculated it).    Enter the appropriate table of distribution of the statistic to determine its likelihood of occurrence (its significance). Distribution of test statistics: Normal distribution t-distribution (df based on N-1 for means, or N-2 for r) F-distribution (df based on k-1 and N-k) chi-square distribution (df based on table size)