Analysing a single scale variable
2a: The population mean
We've seen in the previous part how to get an impression of our sample data, but what does this tell about the population? We can either obtain a confidence interval for the population mean (average). This will be an interval (i.e. the population mean will most likely be somewhere between... and ....), and/or if we had a hypothesized average, we can test if the data shows a significance difference.
Test for specific mean
If we'd expect the population mean to be a certain value we can also perform a statistical test for this, known as a one-sample t-test. In the example perhaps HRM thinks that the average age is 50, we can then use a one-sample t-test (Student, 1908) to test if this might be true based on the sample. If our sample mean is close to the expected population mean then our population mean might be correct, but if it is very different it might be wrong.
To determine if the sample result is ‘wildly’ different from the expected mean in the population the test determines the difference between the sample result and the mean, but also the variation in the data (the standard deviation) and the sample size is taken into consideration. This results in a so called t-value.
Mr. Student (actually his name was William Sealy Gosset) showed that if you would take all possible samples from a population and determine for each sample this t-value, that a relative histogram of these t-values will form a so-called t-distribution, shown in Figure 1.
How this distribution looks, depends on the sample size. The ‘df’ is the sample size minus 1. With these kind of distributions, the most important thing to remember is that it is all about areas under the curve. This will give the probability of a t-value in that range. If for example we had a sample among 100 respondents (so df will be 100 – 1 = 99), and had a t-value of -2, then we are interested in the area to the left of the -2, but also to the right of +2 (since we are interested in the chance of a t-value as extreme in the sample), as shown in Figure 2
Figure 2. t-distribution with t = 2 and df = 99.
In the past t-distribution tables were used to lookup the size of this area, and even more ‘hard core’ would be to attempt to manually calculate these, but these days this is most often simply left over to the calculator or some software program. This will yield that the area is 0.04824, or 4.824% of the total area. This means that the chance of having a t-value or an even bigger t-value, as in the sample, if indeed your population mean is correct is only 0.04824 and is known as the significance. In short this means that if the chance is very low, your expected population mean is probably wrong.
In the age example the significance is .000 which is very low (usually below .05 is considered low), so the expected population mean of 50 is most likely wrong. We can be fairly sure that the population mean is significantly different from 50 and can add this to the report.
The mean age of customers was 48.19 years (SD = 17.69). The claim that the average age is 50 years old can be rejected, t(1968) = -4.53, p < .001.
Click here to see how you can perform such a test with SPSS, R (studio), Excel, or Manually
Manually (formula and example)
The t-value can be determined using:
In this formula x̄ is the sample mean, μH0 the expected mean in the population (the mean according to the null hypothesis), and SE the standard error.
The standard error can be calculated using:
Where n is the sample size, and s the sample standard deviation.
The formula for the sample standard deviation is:
In this formula xi is the i-th score, x̄ is the sample mean, and n is the sample size.
The sample mean can be calculated using:
The degrees of freedom is determined by:
Where n is the sample size
Example (different example)
We are given the ages of five students, and have an hypothesized population mean of 24. The ages of the students are:
Since there are five students, we can also set n = 5, and the hypothesized population of 24 gives
For the standard deviation, we first need to determine the sample mean:
Then we can determine the standard deviation:
The standard error then becomes:
The degrees of freedom is relatively simple:
The two-sided significance is then usually found by using the t-value and the df, and consulting a t-distribution table, or using some software. If you really had to determine it manually, it would involve the formula for the t-distribution (the cumulative density function):
One remark on the t-test is about an frequently mentioned criteria before using it. Many textbooks will mention that in order to use a t-test the data should be roughly normally distributed. Simplified this means that the histogram should look a bit like a bell-shape. There are tests to check if the data is ‘normal’, but there are also those who argue that this criteria (for normality) is not even needed. For example Lumley, Diehr, Emerson, and Chen (2002) conclude: “previous simulations studies show that “sufficiently large” is often under 100” (p. 166).
The confidence interval
We can also use this t-distribution to determine a so-called 95% confidence interval for the mean. In the example this will be for the mean age between 47.41 and 48.98. If you want to round this, always round the lower value down, and the upper value up, so in this example between 47 and 49.
The confidence interval could reported for example like this:
The mean age of customers was 48.19 years (SD = 17.69), 95% CI [47.4, 49.0]. The claim that the average age is 50 years old can be rejected, t(1968) = -4.53, p < .001.
Click here to see how you can determine such an interval, with SPSS, R (Studio), Excel, or Python.
There are two ways to determine the confidence interval with SPSS.
You can either use Excel only or use the Excel add-on Data Analysis
with Data Analysis add-on
As a last thing to do, it is always good to add a so-called effect size when performing a statistical test, which will be discussed in the next section.
Single scale variable