Lecture 5, Recording data

Path: janda.org/c10 > Syllabus > Outline > Topics and Readings > Univariate Statistics > Frequency Distributions

Frequency Distributions

Use of SPSS in statistical analysis: Frequency Distributions
The number of variables in the analysis determines the possible type of analysis.

Univariate refers to the analysis of a single variable.

Bivariate indicates analyzing the joint occurrence or covariation of two variables.

Multivariate desccribes the joint covariation of more than two variables.

Univariate statistics at the simplest level merely constitutes representations of distributions, which can be divided into

FULL representations of raw values of observations

GROUPED representations of raw values in collapsed categories

SUMMARY representations of raw values of observations do not display complete information about the distribution of cases.

Full representations of distributions: raw counts of observed values

Frequency distributions: simple counts of observations pre-sorted into numerically ordered categories

These can be produced with the Frequencies procedure in SPSS, which lies under the Analyze Menu.

Frequencies is suitable for nominal-scale variables: i.e., non-orderable discrete variables

It is also suitable for most ordinal variables: i.e., orderable discrete variables with limited numbers of categories.

An example: FREQUENCIES for DIVISION in the states2000 data:

DIVISION SUBDIVISIONS WITHIN REGIONS
Valid Cum
Value Label Value Frequency Percent Percent Percent
NEW ENGLAND 1 6 11.5 11.8 11.8
MIDDLE ATLANTIC 2 3 5.8 5.9 17.6
EAST NORTH CENTRAL 3 5 9.6 9.8 27.5
WEST NORTH CENTRAL 4 7 13.5 13.7 41.2
SOUTH ATLANTIC 5 9 17.3 17.6 58.8
EAST SOUTH CENTRAL 6 4 7.7 7.8 66.7
WEST SOUTH CENTRAL 7 4 7.7 7.8 74.5
MOUNTAIN 8 8 15.4 15.7 90.2
PACIFIC 9 5 9.6 9.8 100.0
. 1 1.9 Missing
------- ------- -------
Total 52 100.0 100.0

Valid cases 51 Missing cases 1

Frequencies is even suitable for interval or ratio-scaled variables IF the number categories is not great

An example: bill96 (I'm using this in lieu of bush2000, but the point is the same.

Strictly speaking, bill96 is a discrete, orderable variable.

Practically speaking, bill96 can be considered a continuous variable--for it assumes so many values

or a ratio-scaled variable--for it has intervals of know width (percentage points) and an absolute zero (0 votes)

Frequencies for bill96:

BILL96 % of vote for Clinton in 1996
Valid Cum
Value Label Value Frequency Percent Percent Percent
33 2 3.8 3.9 3.9
35 1 1.9 2.0 7.8
36 1 1.9 2.0 9.8
37 1 1.9 2.0 11.8
40 2 3.8 3.9 15.7
41 1 1.9 2.0 17.6
42 1 1.9 2.0 19.6
43 2 3.8 3.9 23.5
44 6 11.5 11.8 35.3
45 1 1.9 2.0 37.3
46 2 3.8 3.9 41.2
47 3 5.8 5.9 47.1
48 3 5.8 5.9 52.9
49 3 5.8 5.9 58.8
50 2 3.8 3.9 62.7
51 4 7.7 7.8 70.6
52 5 9.6 9.8 80.4
53 1 1.9 2.0 82.4
54 4 7.7 7.8 90.2
57 1 1.9 2.0 92.2
59 1 1.9 2.0 94.1
60 1 1.9 2.0 96.1
62 1 1.9 2.0 98.0
85 1 1.9 2.0 100.0
. 1 1.9 Missing
------- ------- -------
Total 52 100.0 100.0

FREQUENCIES is not useful for interval or ratio-scaled variables when the number categories is large.

An example: billvote, the number of popular votes cast for Clinton in 1996, by state

Because each state cast a different number of votes for Clinton, there are 51 values--one for each state

BILLVOTE Total vote for Clinton in 1996
Valid Cum
Value Label Value Frequency Percent Percent Percent 66508 1 1.9 2.0 2.0
77897 1 1.9 2.0 3.9
106405 1 1.9 2.0 5.9
138400 1 1.9 2.0 7.8
139295 1 1.9 2.0 9.8
140209 1 1.9 2.0 11.8
152031 1 1.9 2.0 13.7
165545 1 1.9 2.0 15.7
167169 1 1.9 2.0 17.6
203388 1 1.9 2.0 19.6
205012 1 1.9 2.0 21.6
220197 1 1.9 2.0 23.5
220592 1 1.9 2.0 25.5
231906 1 1.9 2.0 27.5
245260 1 1.9 2.0 29.4
252215 1 1.9 2.0 31.4
311092 1 1.9 2.0 33.3
324394 1 1.9 2.0 35.3
326099 1 1.9 2.0 37.3
384399 1 1.9 2.0 39.2
385005 1 1.9 2.0 41.2
469164 1 1.9 2.0 43.1
488102 1 1.9 2.0 45.1
495878 1 1.9 2.0 47.1
612412 1 1.9 2.0 49.0
615732 1 1.9 2.0 51.0
635804 1 1.9 2.0 52.9
664503 1 1.9 2.0 54.9
670854 1 1.9 2.0 56.9
712603 1 1.9 2.0 58.8
874668 1 1.9 2.0 60.8
899645 1 1.9 2.0 62.7
905599 1 1.9 2.0 64.7
924284 1 1.9 2.0 66.7
928983 1 1.9 2.0 68.6
1024817 1 1.9 2.0 70.6
1047214 1 1.9 2.0 72.5
1070990 1 1.9 2.0 74.5
1071859 1 1.9 2.0 76.5
1096355 1 1.9 2.0 78.4
1099132 1 1.9 2.0 80.4
1567223 1 1.9 2.0 82.4
1599932 1 1.9 2.0 84.3
1941126 1 1.9 2.0 86.3
2100690 1 1.9 2.0 88.2
2206241 1 1.9 2.0 90.2
2455735 1 1.9 2.0 94.1
2533502 1 1.9 2.0 96.1
3513191 1 1.9 2.0 98.0
4639935 1 1.9 2.0 100.0
. 1 1.9 Missing
------- ------- -------
Total 52 100.0 100.0

This table has little value, for it simply says that each unique vote cast, occurs once.

The key point in using Frequencies and asking for the frequency table is whether the number of categories is large, with "large" somewhat a matter of judgment.

GROUPED representations of raw values in collapsed categories

Used when the number of "raw" values is too large for easy comprehension

Most typically, grouping is suitable for continuous or "quasi-continuous" variables

Income

Votes won in elections

Population

Rules for grouping continuous variables

The number of intervals depends on the RANGE of the values between the low and high scores

From 6 to 20 intervals usually provides for adequate variation

Interval size is determined by dividing the range by number of intervals

Remember that each interval is determined by its upper and lower TRUE LIMITS:

The distance on the measurement scale actually enclosed by an interval when grouping data

The upper true limit is half-way between the interval's apparent upper limit and the apparent lower limit of the next-higher interval

Example: Ages 21-25, 26-30 ... are actually 20.6 - 25.5 and 25.6 - 30.5

Discrete variables, whether ordered or not, can be grouped usefully together when the number of original categories is large:

Ethnic groups in the U.S.

Nations of the world grouped into regions

Grouped data are often displayed in graphs typically involve grouped data, which have distinct advantages over tables of numbers.

Graphs are visually striking and thus easier to interpret and remember.

Whereas numbers must be processed in digital fashion,

lLnes and areas can be interpreted spatially -- in analog fashion.

Good graphs are time-consuming to construct by hand, but they can be generated easily by computers.

Types of graphs available under Frequencies in SPSS:

HISTOGRAMS for grouped continuous data: bars should touch.

BAR GRAPHS for categorical data: bars should not touch.

PIE CHARTS are also for categorical data.

First, consider HISTOGRAMS for the two variables, billvote and pctblack:

By default, values are collected into several equal size intervals for plotting the histogram.

BARCHART produces a graph suitable for DISCRETE variables.

Consider the example for the variable DIVISION

Note the spaces between the bars.

They suggest that the values are discrete, not continuous.

A Pie chart for the same variable is more colorful, but may be too complex for each understanding.

SUMMARY representations of raw values do not display complete information about the distribution of cases.

They provide only a single value which attempts to summarize the distribution.

Because any summary throws away information, summary measures are necessarily imperfect.

The two major classes of summary measures:

Measures of central tendency

Measures of dispersion

Both of these will be taken up later this week