Review: Univariate Statistics

Path: janda.org/c10 > Syllabus > Review: Univariate Statistics

Theory, Measurement, and Univariate Statistics

I. Theoretical statements make assertions about broad social and political processes.

They are concerned with the general situation rather than particular occurrences.
- They involve salient causal factors, but not the only causal factors.
- Therefore, they are probabilistic rather than deterministic statements.
- Statistical analysis offers a rigorous methodology for testing such theoretical statements.
The first step in the methodology is to formalize the theory.
- Define the concepts and interrelate them in an abstract propositional statement.
- Operationalize the abstract concepts by linking them to measureable variables representing the concepts, separated into
  - Dependent variable: the phenomenon to be explained, and the
  - Independent variables: the explanatory factors.
- Using these variables, restate the proposition in the form of a research hypothesis, preferably a directional hypothesis.
  - nondirectional hypotheses merely assert that two or more variables are "related."
  - directional hypotheses specify whether they are positively or negatively related.
- Formulate a contradictory or null hypothesis for testing.
  - If the research hypothesis is nondirectional, the corresponding null hypothesis requires a two-tailed test.
  - If it is directional, a one-tailed test is more likely to reject the null hypothesis.

II. Measurement in statistical analysis:

The process of operationalizing abstract concepts by linking them to concrete variables involves some form of measurement.
- For example, the concept of "party identification" is measured by asking questions of respondents in a sample survey.
- The result is a seven-point scale of "Republicanness" ranging from 0 for "Strong Democrat" to 6 for "Strong Republican."
- Because measuring a concept typically involves a series of procedures or operations, the measurement process is known as operationalization.
The types of statistical analysis that can be performed depends on the types of measurement used for the dependent and independent variables.
There are several approaches to measurement; here are two of the most important views:
- S.S. Stevens' "levels" of measurement:
  - nominal: arbitrary numbers pinned to classes, no magnitudes intended
  - ordinal: the numbers are orderable in magnitude, but distances between values is not known
  - interval: the numbers are orderable and the distances are known, but there is no zero point
  - ratio: the numbers are orderable, distances known, there is a zero point
- An overlapping distinction: discrete and continuous variables:
  - discrete variables can assume only a (few) countable number of values, which can be orderable or nonorderable.
    - nonorderable discrete corresponds to "nominal" measurement.
    - orderable discrete corresponds to "ordinal" measurement.
  - continuous variables can assume any of an infinite number of values on a number line.
    - Time, which can be measured in infinitely small units, would qualify.
    - Strictly speaking, a variable like income, measured in small but discrete units like cents, would not be considered a "continuous" variable.
    - Practically speaking, variables with many "countable" units (e.g., income) are treated as continuous and sometimes called "quasi-continuous."
Relevance of these distinctions to statistical analysis:
- The "purist" would always respect Stevens' four "levels" of measurement in choosing a statistical test.
- The "pragmatist" would draw the line of permissible operations between orderable and nonorderable discrete variables.
  - The pragmatist would argue that people regularly treat "ordinal" data as interval anyway -- e.g., in computing GPA.
  - Labovitz's research on alternative scoring schemes for ordinal variables shows that
    - -- as long as monotonicity is maintained in the scoring schemes
    - -- even random numbers pinned on ordered categories produces very high intercorrelations among alternative scoring schemes.
- Much of what passes for "interval" data in statistical analysis is, in fact, "ordinal," due to the presence of substantial measurement error -- e.g, city population, military spending, percent Hispanic.
  - In 1980, Chicago had 3,005,000 people; New York had 7,072,000
  - In fact, these figures are not exact, and the actual population may be off by tens of thousands of people.
  - So, the true "interval" between the population of Chicago and New York is not 4,067,000 but unknown,
  - Like a number pinned on an ordinal category, the population interval is somewhere around 4 million.
- Moreover, a great deal of professional research treats ordinal data as interval in practice.
- There is a body of "log-linear" techniques of data analysis that take an entirely different approach to this problem, but they are beyond the scope of this course (see David Knoke and Peter J. Burke, Log-Linear Models (Sage Publications, 1980)).

III. Univariate statistics

SPSS Procedures for computing univariate statistics,
- available under the Analyze Menu, then under "Descriptive Statistics"
  - Frequencies for discrete variables
    - Gives counts and percents for each value of variable.
    - A "Graph" option offers three types of graphs
      - Bar charts for discrete variables
      - Pie charts also for discrete variables
      - histograms for continuous variables.
    - A "statistics" option offers various summary statistics:
      - Three measures of central tendency: mean, median, and mode
      - Three measures of dispersion: standard deviation, variance, and range
    - You can suppress the frequency table for continuous or "quasi-continuous" variables, or choose
  - Descriptives for continuous variables
Summarizing distributions of single variables
- Measures of central tendency:
  - Mode: most frequent value, applies to all types of data
  - Median: value that divides the cases into two halves, applies to all but nominal data
  - Mean: sum of all values divided by number of cases, strictly speaking, it applies only to interval and ratio data, but practically is used for discrete orderable data also
- Measures of dispersion:
  - Measures for nominal data:
    - "Variation" in nominal variables is usually determined by inspecting the frequency distribution.
      - Summary measures of variation are tyically based on the proportion of cases in the mode.
      - One such measure is the Variation Ratio
  - Easy to understand measures for continuous variables
    - Range
    - (Semi-) Interquartile Range
    - Average (mean) deviation
- Harder to understand measures, but most important ones
  - Sums of squares
  - Variance
  - Standard deviation
Standardizing the dispersion among variables: dividing raw values by their means and standard deviations to produce z-scores:
- Computation:
  
  z-score =
- Properties of z-scores, they
  - have a mean of 0 and a
  - Standard deviation of 1 (therefore a variance of 1)

IV. Shapes of univariate distributions

The normal distribution

Bell-shaped
- symmetrical
- unimodal

Properties: contains known % of cases within specified standard deviations of the mean

± 1 s.d embraces	_____ % of the cases
± 2 s.d. embraces	_____ % of the cases
± 3s.d. embraces	_____ % of the cases

These % are reported in tables of areas under the normal curve and they are used in testing observed statistics against expectations under the null hypothesis.

Non-normal distributions:
- Asymmetrical distributions are skewed.
  - Defined as positively or negatively skewed by location of tail.
  - Measured by the "third moment" of deviation from the mean.
- Symmetrical distributions that are too flat (platykurdic) or too peaked (leptokurdic) are measured by kurtosis.
  - Measured by the "fourth moment" of deviation from the mean.
  - Both measures of skewness and kurtosis can be computed by frequencies and are expressed as deviations from 0.
Some types of nonlinear transformations help "normalize" a non-normal distribution.
- Squaring X helps normalize a distribution skewed to the left.
- Taking the square root helps normalize a distribution skewed right.
- Taking the logarithm of X also helps normalize a right-skewed distribution.