- Statistical
Inference and Hypothesis Testing
I. Statistical Inference: predicting population values,
or PARAMETERS, from sample data:
- Sampling theory
- If N cases are randomly
drawn from some unknown population, one can estimate
characteristics of that population with stated levels
of accuracy.
- First step is to compute the
sample statistic -- examples:
- Mean
- Standard
deviation
- Correlation between two
variables in the sample
- Regression
coefficient
- Intercept
- Second step is to determine its
sampling distribution.
- Conceive of drawing an
infinite number of samples of size N which
you compute the sample statistic.
- Due to sampling variation,
these statistics would not all have the same values
but would distribute symmetrically and usually
"normally" around some "expected" value -- the
mean of the distribution of sample
values.
- The exact shape of the
sampling distribution depends on
- the sample statistic
involved --mean, correlation, etc.
- the number of cases in
the sample -- N
- the variation among the
values in the population
- Third step is to compute the
sampling distribution's standard error.
- The dispersion of values
around the sampling mean can be measured by the
standard deviation, which becomes known as
the standard error of the sampling
distribution.
- For all statistics of
interest to us, formulas exist for calculating the
standard deviations of hypothetical sampling
distributions -- which means that we can calculate
their standard errors.
- For example, the central
limit theorem states:
- If repeated samples of N
observations are drawn from a population with mean
µ, and variance
,
then as N grows large, the sample means will become
normally distributed with mean µ, variance
2/N,
and standard deviation .
- Thus, the formula for the
standard error of the mean is
_____.
-
- Knowledge of the standard error
of a statistic allows us to place bounds on the
estimates of population parameters from sample data.
- Point v. interval
estimates of population parameters:
- Point estimates are simply the
best guess of the parameter, and they are based
on the observed sample statistic itself.
- If the sample statistic is
unbiased, the mean of the sampling
distribution equals the population paramenter and
thus the observed statistic is the best estimate of
the population parameter -- Example: the
mean
- Bias in a sample
statistic exists when the mean of the sampling
distribution does not equal the population
parameter -- Example: thestandard
deviuation, which tends to underestimate the
population standard deviation and thus must be
corrected by using N-1 in the
denominator.
- Corrections for bias are
routinely incorporated into the statistical
formulas that you encounter.
- Interval estimates state the
likely range of values in the population at a
chosen level of confidence -- e.g., .95,
.99.
- A typical example of an
interval estimate in political research is
predicting to the likely proportion of vote for a
candidate -- e.g., 95% sure that Reagan will win
52% of the vote, plus or minus 3 percentage points,
which yields a 95% confidence interval of 49% to
55% of the vote.
- Factors that determine the
width of the confidence interval:
- The chosen level of
confidence: 1 - alpha
- The standard error
of the observed sample statistic -- in the case
of predicting to the population proportion (a
special case of predicting to the mean) -- the
s.e. depends on
- The sample
size, N
- The variation
in the population
- Sample size and
population variance are in the formula
for the standard error of the mean.
- The proportion
that the sample is of the population is not a
major factor in accuracy of a
sample.
- Assuming sampling
without replacement, the complete
formulat for the standard error of the sample
mean, includes a "correction factor" to
adjust for sample size relative to the size
of the population.
- The correction
factor is
- p = the
proportion that the sample is of the
population
- the correction
factor is always less than
1.
- So multiplying
by the correction factor reduces
the standard errror.
- But it is not
important unless the sample approaches
20% of the population, when the
"correction factor" begins to reduce
the s.e. in any substantial way.
- For
more information go here.
- A "Rule of Thumb" for
interval estimates:
- Most sample statistics
distribute normally, or approximately so, and
standard errors can thus be interpreted as
z-scores in a table of areas under the normal
curve.
- Plus or minus one s.e.
would embrace about 68% of the occurrences in
the hypothetical sampling
distribution.
- Plus or minus two s.e.
would embrace about 95% of the sampling
distribution.
- Thus, doubling the
standard error on either side of the mean
approximates the 95% confidence interval for
estimating the population mean from sample
data.
- This rule applies as well
for other sample statistics-- e.g., estimating
the confidence interval of a b-coefficient in a
regression equation from knowledge of the s.e.
of b.
II. Hypothesis testing
- Refer to the distinction between
the research and the test or null
hypothesis.
- This has been discussed in
several places, see 2/3
Review for
one.
- Hypothesis testing typically
translates into testing whether the observed value
differs "significantly" from some specified value
- How big is the
difference in comparison with sampling
fluctuation?
- How does the test statistic
distribute (i.e., what's its standard
error)?
- How "significant" a different
will you accept (i.e., your alpha
value)?
- Are you making a
one-tailed or a two-tailed test?
- C. General procedure in testing
significance of a statistic:
- Look at the value that you
observe in the sample.
- Subtract the value that you
expected to find.
- Compute a test statistic
(e.g., z-score or t-value) according to the
appropriate formula, usually dividing by the standard
error of the statistic.
- Illustrative standard errors
(enter the formulas):
- Mean
-
-
-
- Proportion
-
-
-
-
- Difference between
means
-
-
-
-
-
- Correlation
Coefficient (Typically, another approach is
used to test for r. Where: rho is
assumed to be 0.0 (i.e., tested against the null
hypothesis of no correlation in the population),
the t-test is used.
-
- t=
-
-
- b-coefficient (our
texts did not discuss the s.e. of b, but SPSS 10
calculated it).
-
- Enter the appropriate table
of distribution of the statistic to determine its
likelihood of occurrence (its
significance).
- Distribution of test
statistics:
- Normal
distribution
- t-distribution (df based
on N-1 for means, or N-2 for r)
- F-distribution (df based
on k-1 and N-k)
- chi-square distribution
(df based on table size)
|