Path: janda.org/c10 > Syllabus > Outline > Topics and Readings >
Modeling Relationships > Handling Skewed Variables

 Week VIII: Modeling Relationships between Continuous Variables Lecture 27b: Dealing with Skewed Variables

Dealing with a Highly Skewed Variable: CIVILDOR in the POLITY Study

When you undertake a statistical analysis, be sure that you understand the variables you are analyzing. For example, the POLITY data set contains a variable, CIVILDOR, labeled as a "civil disorder index." If you consult the Nations of the World Database (by David Garson) in Government Publications, you will find this description of CIVILDOR: "Total number of incidents of civil disorder. (Taken from pages 61-62 of Charles L. Taylor, World Handbook of Political and Social Indicators.) Period covered is 1948-1977."

The "getinfo" printout reveals that the minimum value for CIVILDOR is 2, and the maximum is 8470. That range suggests that the variable has a large standard deviation. You can learn how the values are distributed by running the FREQUENCIES procedure with the subcommands

FORMAT=NOTABLE/HISTOGRAM/STATISTICS=STDDEV KURTOSIS SKEWNESS.
That produces this result for 109 of the 111 cases in the file:
 ```CIVILDOR CIVIL DISORDER INDEX 77 0 |*************************************** 22 1000 |*********** 2 2000 |* 2 3000 |* 2 4000 |* 2 5000 |* 0 6000 | 0 7000 | 2 8000 |* +----+----+----+----+----+----+----+----+----+----+ 0 20 40 60 80 100 Histogram frequency Std dev 1396.894 Kurtosis 16.338 Skewness 3.839 ```
If values for CIVILDOR are total counts of incidents of civil disorder, populous countries should have more disorder. That this is largely true is shown by the plot of CIVILDOR with POPULA70.

 ``` ++----+----+----+----+----+----+----+----+----+----+----+----+----++ | | | 1 | C8000+ 1 + I | | V | | I | | L | | 6000+ + D | R I | 1 | S | 1 | O | 1 | R4000+ 1 + D | | E | | R | 1 1 | | | I2000+ + N | 1111 | D | 312 | E | 2312 | X | J941 2 1 | 0+ *8 1 + | | R | ++----+----+----+----+----+----+----+----+----+----+----+----+---++ -3.0E+08 -1.0E+08 1.00E+08 3.00E+08 5.00E+08 7.00E+08 9.00E+08 POPULATION SIZE IN 1970 Correlation=35398 R-Squared=.12530 S.E. of Est=1312.54267 Sig.=.0002 ```

Seeing that incidents of civil disorder are influenced by size of population, you might compute a new variable based on incidents of civil disorder per 1,000,000 people according to this formula:
compute civilcap=civildor*1000000/popula70.
You can then run the Frequencies procedure to examine the new distribution:

 ```CIVILCAP 77 0 |*************************************** 24 100 |************ 3 200 |** 1 300 |* 1 300 |* 1 400 |* 1 500 |* 0 600 | 0 700 | 1 800 |* 0 900 | 0 1000 | 1 1083 |* +----+----+----+----+----+----+----+----+----+----+ 0 20 40 60 80 100 Histogram frequency Std dev=139.591 Kurtosis=32.901 Skewness=5.350```

Dividing by population removed the influence of population in calculating civil disorder, but the resulting distribution was even more skewed to the right. Another technique for dealing with an extreme positively skewed distribution is to compute its logarithm--the exponent of the power to which another number, the base (in this case 10), must be raised to equal the original number.

In substantive terms, this means that incidents of disorder in one nation must be ten times the incidents in another nation to separate the nations by a full unit of measurement. This transformation can be justified by an argument similar to that in economics about the diminishing utility of a dollar at high income levels. Similarly, computing the logarithm of CIVILDOR implies that different incidents of disorder between two nations do not "register" unless one rate is at least ten times the other. This approach to measurement is frequently used for many forms of political and social behavior. During the Korean and Vietnam wars, for examples, public opposition to U.S. involvement was linked more closely to the logarithm of battlefield casualties than to a simple count of casualties. (In the table below, only the integer characteristic is listed and not the decimal mantissa.)

compute civillog=lg10(civilcap).

 ```CIVILLOG 2 0 |***** 14 1 |*********************************** 9 1 |*********************** 13 1 |********************************* 20 1 |************************************************** 11 2 |**************************** 20 2 |************************************************** 10 2 |************************* 4 2 |********** 2 3 |***** 1 3 |*** 2 3 |***** +----+----+----+----+----+----+----+----+----+----+ 0 4 8 12 16 20 Histogram frequency Std dev .616 Kurtosis .126 Skewness .092```

The distribution of the logarithm of CIVILCAP is very close to normal. The following command will list the individual nations and the relevant variables to help evaluate our reworking of the CIVILDOR variable:
LIST VARIABLES = COUNTRY CIVILDOR CIVILCAP CIVILLOG
 ``` COUNTRY CIVILDOR CIVILCAP CIVILLOG AFGHANISTAN 38.00 3.05 .484 ALGERIA 4679.00 340.39 2.532 ANGOLA 541.00 91.14 1.960 ARGENTINA 1137.00 47.88 1.680 AUSTRALIA 113.00 9.03 .956 AUSTRIA 110.00 14.81 1.171 BANGLADESH 41.00 .60 -.220 BELGIUM 229.00 23.72 1.375 BOLIVIA 468.00 108.21 2.034 BRAZIL 364.00 3.80 .580 BULGARIA 22.00 2.59 .414 BURMA 1357.00 50.26 1.701 CAMEROON 142.00 20.94 1.321 CANADA 260.00 12.19 1.086 CENTR.AFRICAN REPUBLIC 99.99 . . CHAD 52.00 14.27 1.155 CHILE 297.00 31.70 1.501 CHINA 2662.00 3.17 .501 COLOMBIA 833.00 39.17 1.593```