Analysing a single scale variable

1c: Center and dispersion (mean and standard deviation)

Instead of (or additional to) creating a table or a visualisation of the data, some statistical measures can provide a description of the sample data. The two most common types of statistical measures are those for central tendency and those for dispersion.

When doing proper research you should also look at the standard deviation. This is roughly the average distance from the mean. A high standard deviation can indicate that respondents for example disagreed with each other. It is a measure of variation. In the example it is 17.69. There is no fixed rule to determine if this is high or low. A variation of 2 degrees Celsius in body temperature can mean the difference between life or death, while a deviation of 2 years over a life span seems very low.

Central tendency (mean)

Measures of central tendency try to establish somewhat of the ‘most typical’ value for the data. The most common measure of central tendency for a scale measure is the mean. Note that when people say ‘average’ they most often refer to the mean, although the term ‘average’ could refer to any measure of central tendency. Also with ‘mean’ most often the ‘arithmetic mean’ is meant, but there are other means as well (for example a geometric mean and harmonic mean).

Ask someone what the mean (or average) is, and most often you will hear the calculation: the sum divided by the number of items. However this explains how to calculate it, not what it conceptually actually means. One conceptual definition is “the fulcrum that is unique to each distribution” (Weinberger & Schumacher as cited in Watier, Lamontagne, & Chartier, 2011, p. 3). This might sound a bit vague, so let’s use an example.

Let’s say we only have one score of 1, and place this on a scale as shown in Figure 1.

Figure 1. A scale with one block on it, tipping to the left

The scale is now tipping to the left. Note that the number in the block is not a weight, just to indicate the position and that the scale has an infinite length. To balance the scale again one obvious solution could be to add something to the other side as shown in Figure 2.

Figure 2. Balancing the scale by adding another block.

Another obvious solution will be to remove the block [1] again, but there is a third solution without having to add or remove something shown in Figure 3.

Figure 3. Balancing the scale by moving the fulcrum

In this case we simply moved the triangle (the fulcrum) so the scale is in balance again (note that the scale has an infinite length in both directions so the ‘weight’ of the line where it is all placed on is irrelevant).

If we add another score, let’s say 5 we get what is shown in Figure 4.

Figure 4. Adding another block to the scale makes it tip again.

The scale now tips to the right, so we can move the fulcrum (the triangle) again to make it balanced, as shown in Figure 5.

Figure 5. Balancing the scale again by moving the fulcrum.

The position of the fulcrum will be exactly at the mean. It shows the ‘balancing’ point of all the scores.

Note that the mean is quickly influenced by extreme values. If for example we have the following scores: 1, 2, 2, 3, 4, 7, 8, 9. The mean will be 4.5. If we change the last 9 to a 90, the mean will become 14.625. Only one score is then actually above the mean, while all others are below it. In case of an extreme score, the median might be a better representation for the data than the mean. In the example the median remains unchanged. For more information about the median see the section on center & dispersion for an ordinal variable here.

Dispersion or variety (standard deviation)

The centre alone does not give a good picture. If your head is in the oven and your feet in a refrigerator you’d be doing fine on average, but the deviation from the average is too high. That’s why besides a measure of centre, you should also report a measure of dispersion.

With a scale variable the most commonly used measure of dispersion is known as the standard deviation. It is roughly the average difference from the mean. It is not exactly the average difference from the mean due to its calculation. There is another measure known as mean absolute deviation that is exactly this, but this measure is less frequently used.

The standard deviation gives information about the diversity of the scores. It could indicate how well people agreed with each other, how much variation there was, or how stable something is. Chebyshev’s inequality (Tchébychef, 1867) states that 75% of all scores will fall within two standard deviations from the mean, and almost 89% within 3 standard deviations. If for example the mean age is 23 and the standard deviation is 3, then we can expect that 75% of the respondents have an age between (23 – 2 x 3 =) 17 and (23 + 2 x 3 =) 29, and almost 88% between (23 – 3 x 3 =) 14 and (23 + 3 x 3 =) 32.

Unfortunately there is no fixed rule to determine if a standard deviation is big or small. A deviation of 7 is most likely fatal if it is about body temperature, but is almost nothing if it’s about revenue of top 100 companies. For more information about the standard deviation see the appendix below.

These descriptive measures can give you a first impression and can also be reported. For example like the following:

The ages ranged from 18 till 89, with a mean age of 48 years, but quite some variation between the customers (SD = 17.69).

Click here to see how to determine the mean and standard deviation...

with Excel

Excel file from video: CEDI - Mean and Standard Deviation.xlsm.

with Flowgorithm

The Mean

A flowgorithm for the (arithmetic) mean in Figure 1.

Flowgorithm (arithmetic) mean

It takes as input paramaters the scores.

It uses a function MAsumReal (sum of real values)

Flowgorithm file: FL-CEmean.fprg.

The Standard Deviation:

A flowgorithm for the sample standard deviation in Figure 1.

Flowgorithm sample standard deviation

It takes as input paramaters the scores.

It uses the mean function (CEmean) and a function MAsumReal (sum of real values)

Flowgorithm file: FL-VAsd.fprg.

with Python

The Mean

Jupyter Notebook from video: CE - Mean.ipynb.

Data file from video: GSS2012a.csv.

The Standard Deviation:

Jupyter Notebook from video: DI - Standard Deviation.ipynb.

Data file from video: GSS2012a.csv.

with R

R script from video: CEDI - Mean and Standard Deviation.R.

Data file used in video: GSS2012-Adjusted.sav.

with SPSS

There are a four different ways to determine the mean and standard deviation with SPSS.

using Frequencies

Data file used in video: GSS2012-Adjusted.sav.

using Descriptives

Data file used in video: GSS2012-Adjusted.sav.

using Explore

Data file used in video: GSS2012-Adjusted.sav.

using a shortcut

Data file used in video: GSS2012-Adjusted.sav.

Manually (Formulas)

The formula for the (arithmetic) mean is:

\(\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\)

The formula for the (ubiased sample) standard deviation is:

\(s=\sqrt{\frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n-1}}\)

Where \(x_i\) is the i-th score, and \(n\) the sample size.

If your data is the entire population, the standard deviation is calculated by dividing by n instead of n - 1.

Now that we have a decent impression of the sample data it is time to see what it can tell us about the population. We'll start with this in the next section.

APPENDIX

to be added

Single scale variable

Google adds