# Analysing a single scale variable

## 2a: The population mean

We've seen in the previous part how to get an impression of our sample data, but what does this tell about the population? We can either obtain a confidence interval for the population mean (average). This will be an interval (i.e. the population mean will most likely be somewhere between... and ....), and/or if we had a hypothesized average, we can test if the data shows a significance difference.

### Test for specific mean

If we'd expect the population mean to be a certain value we can also perform a statistical test for this, known as a **one-sample t-test**. In the example perhaps HRM thinks that the average age is 50, we can then use a one-sample t-test (Student, 1908) to test if this might be true based on the sample. If our sample mean is close to the expected population mean then our population mean might be correct, but if it is very different it might be wrong.

In the age example the significance is .000 which is very low (usually below .05 is considered low), so the expected population mean of 50 is most likely wrong. We can be fairly sure that the population mean is significantly different from 50 and can add this to the report.

The mean age of customers was 48.19 years (*SD* = 17.69). The claim that the average age is 50 years old can be rejected, *t*(1968) = -4.53, *p* < .001.

**Click here to see how you can perform a one-sample Student t-test...**

**with Excel**

Excel file from video: TS - t-test (one-sample).xlsm.

**with Flowgorithm**

A flowgorithm for the one-sample Student t-test in Figure 1.

It takes as input the scores, the hypothesized mean, and a string for which output to show (df, statistic, or p-value).

It uses a function for the mean (CEmean), standard deviation (VAsd) and the t cumulative distribution (DItcdf). These in turn require the MAsumReal function (sums an array of real values) and the standard normal cumulative distribution (DIsncdf).

Flowgorithm file: FL-TSstudentOS.fprg.

**with Python**

Jupyter Notebook used in video: TS - t test - one-sample.ipynb.

Data file used: GSS2012a.csv.

**with R**

Click on the thumbnail below to see where to look in the output.

R script used in video: TS - T-test (one-sample).R.

Datafile used in video: GSS2012-Adjusted.sav

**with SPSS**

Click on the thumbnail below to see where to look in the output.

Datafile used in video: GSS2012-Adjusted.sav

**Manually (formula and example)**

**Formula's**

The t-value can be determined using:

\(t=\frac{\bar{x}-\mu_{H_{0}}}{SE}\)

In this formula *x*̄ is the sample mean, *μ*_{H0} the expected mean in the population (the mean according to the null hypothesis), and *SE* the standard error.

The standard error can be calculated using:

\(SE=\frac{s}{\sqrt{n}}\)

Where *n* is the sample size, and *s* the sample standard deviation.

The formula for the sample standard deviation is:

\(s=\sqrt{\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1}}\)

In this formula *x*_{i} is the i-th score, *x*̄ is the sample mean, and *n* is the sample size.

The sample mean can be calculated using:

\(\bar{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}\)

The degrees of freedom is determined by:

\(df=n-1\)

Where *n* is the sample size

**Example** (different example)

We are given the ages of five students, and have an hypothesized population mean of 24. The ages of the students are:

\(X=\left\{18,21,22,19,25\right\}\)

Since there are five students, we can also set *n *= 5, and the hypothesized population of 24 gives \(\mu_{H_{0}}=24\)

For the standard deviation, we first need to determine the sample mean:

\(\bar{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}=\frac{\sum_{i=1}^{5}x_{i}}{5}=\frac{18+21+22+19+25}{5}\)

\(=\frac{105}{5}=21\)

Then we can determine the standard deviation:

\(s=\sqrt{\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1}}=\sqrt{\frac{\sum_{i=1}^{5}\left(x_{i}-21\right)^{2}}{5-1}}=\sqrt{\frac{\sum_{i=1}^{5}\left(x_{i}-21\right)^{2}}{4}}\)

\(=\sqrt{\frac{\left(18-21\right)^{2}}{4}+\frac{\left(21-21\right)^{2}}{4}+\frac{\left(22-21\right)^{2}}{4}+\frac{\left(19-21\right)^{2}}{4}+\frac{\left(25-21\right)^{2}}{4}}\)

\(=\sqrt{\frac{\left(-3\right)^{2}}{4}+\frac{\left(0\right)^{2}}{4}+\frac{\left(1\right)^{2}}{4}+\frac{\left(-2\right)^{2}}{4}+\frac{\left(4\right)^{2}}{4}}\)

\(=\sqrt{\frac{9}{4}+\frac{0}{4}+\frac{1}{4}+\frac{4}{4}+\frac{16}{4}}=\sqrt{\frac{9+0+1+4+16}{4}}=\sqrt{\frac{30}{4}}\)

\(=\sqrt{\frac{15}{2}}=\frac{1}{2}\sqrt{15\times2}=\frac{1}{2}\sqrt{30}\approx2.74\)

The standard error then becomes:

\(SE=\frac{s}{\sqrt{n}}=\frac{\frac{1}{2}\sqrt{30}}{\sqrt{5}}=\frac{\frac{\sqrt{30}}{2}}{\sqrt{5}}=\frac{\sqrt{30}}{2\times\sqrt{5}}\)

\(=\frac{1}{2}\times\frac{\sqrt{30}}{\sqrt{5}}=\frac{1}{2}\times\sqrt\frac{30}{5}=\frac{1}{2}\sqrt{6}\approx1.22\)

The t-value:

\(t=\frac{\bar{x}-\mu_{H_{0}}}{SE}=\frac{21-24}{\frac{1}{2}\sqrt{6}}=\frac{-3}{\frac{\sqrt{6}}{2}}=\frac{-3\times2}{\sqrt{6}}=\frac{-6}{\sqrt{6}}\)

\(gif.latex?=\frac{-6}{\sqrt{6}}\times\frac{\sqrt{6}}{\sqrt{6}}=\frac{-6\times\sqrt{6}}{\sqrt{6}\times\sqrt{6}}=\frac{-6\times\sqrt{6}}{6}=-\sqrt{6}\approx-2.45\)

The degrees of freedom is relatively simple:

\(df=n-1=5-1=4\)

The two-sided significance is then usually found by using the t-value and the df, and consulting a t-distribution table, or using some software. If you really had to determine it manually, it would involve the formula for the t-distribution (the cumulative density function):

\(2\times\int_{x=|t|}^{\infty}\frac{\Gamma\left(\frac{df+1}{2}\right)}{\sqrt{df\times\pi}\times\Gamma\left(\frac{df}{2}\right)}\times\left(1+\frac{x^2}{df}\right)^{-\frac{df+1}{2}}\)

See the Student t-distribution section for more details.

One remark on the t-test is about an frequently mentioned criteria before using it. Many textbooks will mention that in order to use a t-test the data should be roughly normally distributed. Simplified this means that the histogram should look a bit like a bell-shape. There are tests to check if the data is ‘normal’, but there are also those who argue that this criteria (for normality) is not even needed. For example Lumley, Diehr, Emerson, and Chen (2002) conclude: “previous simulations studies show that “sufficiently large” is often under 100” (p. 166).

### The confidence interval

We can also use this t-distribution to determine a so-called 95% confidence interval for the mean. In the example this will be for the mean age between 47.41 and 48.98. If you want to round this, always round the lower value down, and the upper value up, so in this example between 47 and 49.

The confidence interval could reported for example like this:

The mean age of customers was 48.19 years (*SD* = 17.69), 95% CI [47.4, 49.0]. The claim that the average age is 50 years old can be rejected, *t*(1968) = -4.53, *p* < .001.

**Click here to see how you can determine such an interval...**

**with Excel**

You can either use Excel only or use the Excel add-on Data Analysis

*without add-on*

Excel file from video: TS - Confidence Interval (Mean).xlsm.

*with Data Analysis add-on*

**with Python**

Jupyter Notebook used in video: TS - Confidence Interval Mean.ipynb.

Data file used: GSS2012a.csv.

**with R (Studio)**

Click on the thumbnail below to see where to look in the output.

**with SPSS**

There are two ways to determine the confidence interval with SPSS.

*using Explore*

Click on the thumbnail below to see where to look in the output.

Datafile used in video: GSS2012-Adjusted.sav

*using One-sample T-test*

Click on the thumbnail below to see where to look in the output.

Datafile used in video: GSS2012-Adjusted.sav

**Manually (Formula's)**

The formula for a confidence interval for a mean is given by:

\(\bar{x} \pm ME\)

\(ME = t_{\alpha/2}\times SE\)

\(SE=\frac{s}{\sqrt{n}}\)

\(\alpha = 1 - CL\)

\(s=\sqrt{\frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n-1}}\)

\(\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\)

In these formula's \(\bar{x}\) is the sample mean, \(ME\) the margin of error, \(SE\) the standard error, \(\alpha\) the p-value threshold (usually 0.05), \(n\) the sample size, \(s\) the unbiased sample standard deviation, \(CL\) the confidence level (usually 0.95), and \(x_i\) the i-th score on x.

Note that \(\pm\) indicates to do both minus (for the lower bound) and plus for the upper bound.

As a last thing to do, it is always good to add a so-called effect size when performing a statistical test, which will be discussed in the next section.

**Single scale variable**

Google adds