Path: janda.org/c10 > Syllabus > Outline > Topics and Readings >
Modeling Relationships > Handling Skewed Variables

Week VIII: Modeling Relationships between Continuous Variables

Lecture 27b: Dealing with Skewed Variables

Dealing with a Highly Skewed Variable: CIVILDOR in the POLITY Study

When you undertake a statistical analysis, be sure that you understand the variables you are analyzing. For example, the POLITY data set contains a variable, CIVILDOR, labeled as a "civil disorder index." If you consult the Nations of the World Database (by David Garson) in Government Publications, you will find this description of CIVILDOR: "Total number of incidents of civil disorder. (Taken from pages 61-62 of Charles L. Taylor, World Handbook of Political and Social Indicators.) Period covered is 1948-1977."

The "getinfo" printout reveals that the minimum value for CIVILDOR is 2, and the maximum is 8470. That range suggests that the variable has a large standard deviation. You can learn how the values are distributed by running the FREQUENCIES procedure with the subcommands

FORMAT=NOTABLE/HISTOGRAM/STATISTICS=STDDEV KURTOSIS SKEWNESS.
That produces this result for 109 of the 111 cases in the file:
CIVILDOR  CIVIL DISORDER INDEX
77          0 |***************************************
22       1000 |***********
 2       2000 |*
 2       3000 |*
 2       4000 |*
 2       5000 |*
 0       6000 |
 0       7000 |
 2       8000 |*
	          +----+----+----+----+----+----+----+----+----+----+
 0        20        40        60        80       100
		                               Histogram frequency
Std dev    1396.894      Kurtosis     16.338      Skewness      3.839
                     
If values for CIVILDOR are total counts of incidents of civil disorder, populous countries should have more disorder. That this is largely true is shown by the plot of CIVILDOR with POPULA70.

    ++----+----+----+----+----+----+----+----+----+----+----+----+----++
     |                                                                 |
     |                1                                                |
C8000+                     1                                           +
I    |                                                                 |
V    |                                                                 |
I    |                                                                 |
L    |                                                                 |
 6000+                                                                 +
D    |                                                                 R
I    |                  1                                              |
S    |                1                                                |
O    |                         1                                       |
R4000+                                          1                      +
D    |                                                                 |
E    |                                                                 |
R    |               1                                         1       |
     |                                                                 |
I2000+                                                                 +
N    |               1111                                              |
D    |                312                                              |
E    |               2312                                              |
X    |               J941 2      1                                     |
    0+               *8 1                                              +
     |                                                                 |
     R                                                                 |
     ++----+----+----+----+----+----+----+----+----+----+----+----+---++
    -3.0E+08  -1.0E+08  1.00E+08  3.00E+08  5.00E+08  7.00E+08  9.00E+08
	                               POPULATION SIZE IN 1970
                        
Correlation=35398  R-Squared=.12530  S.E. of Est=1312.54267   Sig.=.0002
                  

Seeing that incidents of civil disorder are influenced by size of population, you might compute a new variable based on incidents of civil disorder per 1,000,000 people according to this formula:
compute civilcap=civildor*1000000/popula70.
You can then run the Frequencies procedure to examine the new distribution:

CIVILCAP
 77       0 |***************************************
 24     100 |************
  3     200 |**
  1     300 |*
  1     300 |*
  1     400 |*
  1     500 |*
  0     600 |
  0     700 |
  1     800 |*
  0     900 |
  0    1000 |
  1    1083 |*
            +----+----+----+----+----+----+----+----+----+----+
                 0        20        40        60        80       100
                             Histogram frequency
    Std dev=139.591  Kurtosis=32.901     Skewness=5.350

Dividing by population removed the influence of population in calculating civil disorder, but the resulting distribution was even more skewed to the right. Another technique for dealing with an extreme positively skewed distribution is to compute its logarithm--the exponent of the power to which another number, the base (in this case 10), must be raised to equal the original number.

In substantive terms, this means that incidents of disorder in one nation must be ten times the incidents in another nation to separate the nations by a full unit of measurement. This transformation can be justified by an argument similar to that in economics about the diminishing utility of a dollar at high income levels. Similarly, computing the logarithm of CIVILDOR implies that different incidents of disorder between two nations do not "register" unless one rate is at least ten times the other. This approach to measurement is frequently used for many forms of political and social behavior. During the Korean and Vietnam wars, for examples, public opposition to U.S. involvement was linked more closely to the logarithm of battlefield casualties than to a simple count of casualties. (In the table below, only the integer characteristic is listed and not the decimal mantissa.)

compute civillog=lg10(civilcap).

CIVILLOG
        2          0 |*****
       14          1 |***********************************
        9          1 |***********************
       13          1 |*********************************
       20          1 |**************************************************
       11          2 |****************************
       20          2 |**************************************************
       10          2 |*************************
        4          2 |**********
        2          3 |*****
        1          3 |***
        2          3 |*****
                     +----+----+----+----+----+----+----+----+----+----+
                     0         4         8        12        16        20
	                               Histogram frequency
Std dev        .616      Kurtosis       .126      Skewness       .092

The distribution of the logarithm of CIVILCAP is very close to normal. The following command will list the individual nations and the relevant variables to help evaluate our reworking of the CIVILDOR variable:
LIST VARIABLES = COUNTRY CIVILDOR CIVILCAP CIVILLOG
 COUNTRY                                  CIVILDOR CIVILCAP CIVILLOG
 AFGHANISTAN                                 38.00     3.05    .484
 ALGERIA                                   4679.00   340.39   2.532
 ANGOLA                                     541.00    91.14   1.960
 ARGENTINA                                 1137.00    47.88   1.680 
 AUSTRALIA                                  113.00     9.03    .956
 AUSTRIA                                    110.00    14.81   1.171
 BANGLADESH                                  41.00      .60   -.220
 BELGIUM                                    229.00    23.72   1.375
 BOLIVIA                                    468.00   108.21   2.034
 BRAZIL                                     364.00     3.80    .580
 BULGARIA                                    22.00     2.59    .414
 BURMA                                     1357.00    50.26   1.701
 CAMEROON                                   142.00    20.94   1.321
 CANADA                                     260.00    12.19   1.086
 CENTR.AFRICAN REPUBLIC                      99.99      .      .
 CHAD                                        52.00    14.27   1.155
 CHILE                                      297.00    31.70   1.501
 CHINA                                     2662.00     3.17    .501		COLOMBIA                                   833.00    39.17   1.593