Path: janda.org/c10 > Syllabus > Review: Univariate Statistics
 Theory, Measurement, and Univariate Statistics 


I. Theoretical statements make assertions about broad social and political processes.

  • They are concerned with the general situation rather than particular occurrences.
    • They involve salient causal factors, but not the only causal factors.
    • Therefore, they are probabilistic rather than deterministic statements.
    • Statistical analysis offers a rigorous methodology for testing such theoretical statements.
  • The first step in the methodology is to formalize the theory.
    • Define the concepts and interrelate them in an abstract propositional statement.
    • Operationalize the abstract concepts by linking them to measureable variables representing the concepts, separated into
      • Dependent variable: the phenomenon to be explained, and the
      • Independent variables: the explanatory factors.
    • Using these variables, restate the proposition in the form of a research hypothesis, preferably a directional hypothesis.
      • nondirectional hypotheses merely assert that two or more variables are "related."
      • directional hypotheses specify whether they are positively or negatively related.
    • Formulate a contradictory or null hypothesis for testing.
      • If the research hypothesis is nondirectional, the corresponding null hypothesis requires a two-tailed test.
      • If it is directional, a one-tailed test is more likely to reject the null hypothesis.


II. Measurement in statistical analysis:

  • The process of operationalizing abstract concepts by linking them to concrete variables involves some form of measurement.
    • For example, the concept of "party identification" is measured by asking questions of respondents in a sample survey.
    • The result is a seven-point scale of "Republicanness" ranging from 0 for "Strong Democrat" to 6 for "Strong Republican."
    • Because measuring a concept typically involves a series of procedures or operations, the measurement process is known as operationalization.
  • The types of statistical analysis that can be performed depends on the types of measurement used for the dependent and independent variables.
  • There are several approaches to measurement; here are two of the most important views:
    • S.S. Stevens' "levels" of measurement:
      • nominal: arbitrary numbers pinned to classes, no magnitudes intended
      • ordinal: the numbers are orderable in magnitude, but distances between values is not known
      • interval: the numbers are orderable and the distances are known, but there is no zero point
      • ratio: the numbers are orderable, distances known, there is a zero point
    • An overlapping distinction: discrete and continuous variables:
      • discrete variables can assume only a (few) countable number of values, which can be orderable or nonorderable.
        • nonorderable discrete corresponds to "nominal" measurement.
        • orderable discrete corresponds to "ordinal" measurement.
      • continuous variables can assume any of an infinite number of values on a number line.
        • Time, which can be measured in infinitely small units, would qualify.
        • Strictly speaking, a variable like income, measured in small but discrete units like cents, would not be considered a "continuous" variable.
        • Practically speaking, variables with many "countable" units (e.g., income) are treated as continuous and sometimes called "quasi-continuous." 
  • Relevance of these distinctions to statistical analysis:
    • The "purist" would always respect Stevens' four "levels" of measurement in choosing a statistical test.
    • The "pragmatist" would draw the line of permissible operations between orderable and nonorderable discrete variables.
      • The pragmatist would argue that people regularly treat "ordinal" data as interval anyway -- e.g., in computing GPA.
      • Labovitz's research on alternative scoring schemes for ordinal variables shows that
        • -- as long as monotonicity is maintained in the scoring schemes
        • -- even random numbers pinned on ordered categories produces very high intercorrelations among alternative scoring schemes. 
    • Much of what passes for "interval" data in statistical analysis is, in fact, "ordinal," due to the presence of substantial measurement error -- e.g, city population, military spending, percent Hispanic.
      • In 1980, Chicago had 3,005,000 people; New York had 7,072,000
      • In fact, these figures are not exact, and the actual population may be off by tens of thousands of people.
      • So, the true "interval" between the population of Chicago and New York is not 4,067,000 but unknown,
      • Like a number pinned on an ordinal category, the population interval is somewhere around 4 million.
    • Moreover, a great deal of professional research treats ordinal data as interval in practice.
    • There is a body of "log-linear" techniques of data analysis that take an entirely different approach to this problem, but they are beyond the scope of this course (see David Knoke and Peter J. Burke, Log-Linear Models (Sage Publications, 1980)).  


III. Univariate statistics

  • SPSS Procedures for computing univariate statistics,
    • available under the Analyze Menu, then under "Descriptive Statistics"
      • Frequencies for discrete variables
        • Gives counts and percents for each value of variable.
        • A "Graph" option offers three types of graphs
          • Bar charts for discrete variables
          • Pie charts also for discrete variables
          • histograms for continuous variables.
        • A "statistics" option offers various summary statistics:
          • Three measures of central tendency: mean, median, and mode
          • Three measures of dispersion: standard deviation, variance, and range
        • You can suppress the frequency table for continuous or "quasi-continuous" variables, or choose
      • Descriptives for continuous variables
  • Summarizing distributions of single variables
    • Measures of central tendency:
      • Mode: most frequent value, applies to all types of data
      • Median: value that divides the cases into two halves, applies to all but nominal data
      • Mean: sum of all values divided by number of cases, strictly speaking, it applies only to interval and ratio data, but practically is used for discrete orderable data also
    • Measures of dispersion:
      • Measures for nominal data:
        • "Variation" in nominal variables is usually determined by inspecting the frequency distribution.
          • Summary measures of variation are tyically based on the proportion of cases in the mode.
          • One such measure is the Variation Ratio


             
      • Easy to understand measures for continuous variables
        • Range


        • (Semi-) Interquartile Range


        • Average (mean) deviation


    • Harder to understand measures, but most important ones 
      • Sums of squares


      •  Variance


      • Standard deviation


         
  • Standardizing the dispersion among variables: dividing raw values by their means and standard deviations to produce z-scores:
    • Computation:
      z-score =

    • Properties of z-scores, they
      • have a mean of 0 and a
      • Standard deviation of 1 (therefore a variance of 1) 


IV. Shapes of univariate distributions

  • The normal distribution
    • Bell-shaped
      • symmetrical
      • unimodal
    • Properties: contains known % of cases within specified standard deviations of the mean

      ± 1 s.d embraces

      _____ % of the cases

      ± 2 s.d. embraces

      _____ % of the cases

      ± 3s.d. embraces

      _____ % of the cases

    • These % are reported in tables of areas under the normal curve and they are used in testing observed statistics against expectations under the null hypothesis.
  • Non-normal distributions:
    • Asymmetrical distributions are skewed.
      • Defined as positively or negatively skewed by location of tail.
      • Measured by the "third moment" of deviation from the mean.
    • Symmetrical distributions that are too flat (platykurdic) or too peaked (leptokurdic) are measured by kurtosis.
      • Measured by the "fourth moment" of deviation from the mean.
      • Both measures of skewness and kurtosis can be computed by frequencies and are expressed as deviations from 0.
  • Some types of nonlinear transformations help "normalize" a non-normal distribution.
    • Squaring X helps normalize a distribution skewed to the left.
    • Taking the square root helps normalize a distribution skewed right.
    • Taking the logarithm of X also helps normalize a right-skewed distribution.