Analyse a Single Nominal Variable

The analysis of a single nominal variable can be done with the steps shown below. Click on each step to reveal how this step can be done.

Step 1: Impression

For a quick impression of a nominal variable, you can create a frequency table. The result will be something as shown in table 1.

Table 1
*Results of Marital Status*
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Married	972	49.2	50.1	50.1
	Widowed	181	9.2	9.3	59.4
	Divorced	314	15.9	16.2	75.6
	Separated	79	4.0	4.1	79.6
	Never married	395	20.0	20.4	100.0
	Subtotal	1941	98.3	100.0
Missing	No answer	33	1.7
	Subtotal	33	1.7
Total		1974	100.0

Click here to see how to create a frequency table with Excel, Python, R, or SPSS.

with Excel

Excel file from videos available here.

with stikpetE add-in

without add-ins

with Python

Jupyter Notebook of video is available here.

with stikpetP library

without stikpetP library

with R (Studio)

with stikpetP library

Jupyter Notebook of video is available here.

without stikpetR library

R script of video is available here.

Datafile used in video: GSS2012-Adjusted.sav

with SPSS

There are a three different ways to create a frequency table with SPSS.

An SPSS workbook with instructions of the first two can be found here.

using Frequencies

watch the video below, or download the pdf instructions (via bitly, opens in new window/tab).

Datafile used in video: Holiday Fair.sav

using Custom Tables

watch the video below, or download the pdf instructions for versions before 24, or version 24 (via bitly, opens in new window/tab)

Datafile used in video: Holiday Fair.sav

using descriptive shortcut

watch the video below, or download the pdf instructions (via bitly, opens in new window/tab).

Datafile used in video: StudentStatistics.sav

See the explanation of frequency table for details on how to read this kind of table.

In the example we can see that Married has been chosen quite often compared to the others, and Separated the least. The table itself might not end up in the report, but gives a quick impression for yourself. Most would probably prefer a visualisation though.

Step 2: Visualisation

To visualise the results of a single nominal variable a simple bar-chart is often a good choice. An example is shown in Figure 1.

Figure 1
Results of marital status
bar chart example of marital status

Click here to see how to create a bar-chart

with Excel

Excel file from video VI - Bar Chart (Simple).xlsm.

with Python

Jupyter Notebook from video: VI - Bar Chart (Simple).ipynb.
Data file used: GSS2012a.csv.

with R (Studio)

R script from video VI - Bar Chart (Simple).R.
Data file used: GSS2012-Adjusted.sav.

with SPSS

There are a few different ways to obtain a (simple) bar-chart with SPSS. The end result for each will be the same.

using Chart builder

Watch the video below, or follow the instructions in this pdf (via bitly, opens in new window/tab).

Data file used: StudentStatistics.sav.

using Legacy dialogs

Watch the video below, or follow the instructions in this pdf (via bitly, opens in new window/tab).

Data file used: StudentStatistics.sav.

using Frequencies

Watch the video below, or follow the instructions in this pdf (via bitly, opens in new window/tab).

Data file used: StudentStatistics.sav.

using an existing table

Watch the video below, or follow the instructions in this pdf (via bitly, opens in new window/tab)

Data file used: StudentStatistics.sav.

When showing a bar-chart it is good to also talk a little bit about it. Usually the peak is mentioned (known as the mode) and perhaps a few of the very low ones. It always depends on the situation, but inform your reader what you notice from the graph or what you want to show.

In the report I recommend using a ‘Introduce – Show – Tell’ approach. Introduce the figure, then show the figure, then talk about what you notice. The example shows the same as we also saw from the frequency table; Married is chosen by far the most, and separated the least.

There are some alternatives for a bar-chart, these include pie-chart, (Cleveland) dotplot, and Pareto chart.

Step 3: Overall Test

The previous steps, gave an impression on the sample results, but are the results also significant? We can test if the percentages could be equal for all categories, in the population with a Pearson chi-square goodness-of-fit test.

Click here to see how to perform a Pearson chi-square goodness-of-fit test

with Excel

Excel file from video: TS - Pearson chi-square GoF.xlsm.

with Flowgorithm

A basic implementation for Pearson Chi-Square GoF test in the flowchart in figure 1

Figure 1
Flowgorithm for the Pearson Chi-Square GoF test
Flowgorithm the Pearson Chi-Square GoF test

It takes as input an array of integers with the observed frequencies, a string to indicate which correction to use (either 'none', 'pearson', 'williams', or 'yates') and an integer for which output to show (0 = sig., 1=chi-square value, 2 = degrees of freedom).

It uses the function for the right-tail probabilities of the chi-square distribution and a small helper function to sum an array of integers.

Flowgorithm file: FL-TSpearsonChiSqGoF.fprg.

with Python

Jupyter Notebook from video: TS - Pearson chi-square GoF.ipynb.

Data file from video: GSS2012a.csv.

with R (Studio)

Click or hover over the thumbnail below to see where to look in the output.

R output Chi-square goodness-of-fit

R script from video: TS - Pearson chi-square GoF.R.

Data file from video: Pearson Chi-square independence.csv.

with SPSS

via One-sample

Click or hover over the thumbnail below to see where to look in the output.

SPSS output Chi-square goodness-of-fit

Data file used in video: GSS2012-Adjusted.sav.

via Legacy dialogs

Click or hover over the thumbnail below to see where to look in the output.

SPSS output Chi-square goodness-of-fit

Data file used in video: GSS2012-Adjusted.sav.

Manually (Formula and example)

The formula's

The Pearson chi-square goodness-of-fit test statistic (χ²):

\( \chi^{2}=\sum_{i=1}^{k}\frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}\)

In this formula O_i is the observed count in category i, E_i is the expected count in category i, and k is the number of categories.

If the expected frequencies, are expected to be equal, then:

\(E_{i}=\frac{\sum_{j=1}^{k}O_{i}}{k}\)

The degrees of freedom is given by:

\(df=k-1\)

The probability of such a chi-square value or more extreme, can then be found using the chi-square distribution.

Example

We have the following observed frequencies of five categories:

\(O=\left(972,181,314,79,395\right)\)

Note that since there are five categories, we have k = 5. If the expected frequency for each category is expected to be equal we can use the formula to determine:

\(E_{i}=\frac{\sum_{j=1}^{k}O_{i}}{k}=\frac{\sum_{j=1}^{5}O_{i}}{5} =\frac{972+181+314+79+395}{5}\)

\(=\frac{1941}{5}=388\frac{1}{5}=388.2\)

Then we can determine the Pearson chi-square value:

\(\chi^{2}=\sum_{i=1}^{k}\frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}=\sum_{i=1}^{5}\frac{\left(O_{i}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\left(972-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(181-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(314-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(79-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(395-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\left(\frac{972\times5}{5}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{181\times5}{5}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{314\times5}{5}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{79\times5}{5}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{395\times5}{5}-\frac{1941}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\left(\frac{972\times5-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{181\times5-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{314\times5-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{79\times5-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{395\times5-1941}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\left(\frac{4860-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{905-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{1570-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{395-1941}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{1975-1941}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\left(\frac{2919}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{-1036}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{-371}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{-1546}{5}\right)^{2}}{\frac{1941}{5}}+\frac{\left(\frac{34}{5}\right)^{2}}{\frac{1941}{5}}\)

\(=\frac{\frac{\left(2919\right)^{2}}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{\left(-1036\right)^{2}}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{\left(-371\right)^{2}}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{\left(-1546\right)^{2}}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{\left(34\right)^{2}}{5^{2}}}{\frac{1941}{5}}\)

\(=\frac{\frac{8520561}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{1073296}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{137641}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{2390116}{5^{2}}}{\frac{1941}{5}}+\frac{\frac{1156}{5^{2}}}{\frac{1941}{5}}\)

\(=\frac{8520561\times5}{1941\times5^{2}}+\frac{1073296\times5}{1941\times5^{2}}+\frac{137641\times5}{1941\times5^{2}}+\frac{2390116\times5}{1941\times5^{2}}+\frac{1156\times5}{1941\times5^{2}}\)

\(=\frac{8520561}{1941\times5}+\frac{1073296}{1941\times5}+\frac{137641}{1941\times5^{2}}+\frac{2390116}{1941\times5}+\frac{1156}{1941\times5}\)

\(=\frac{8520561+1073296+137641+2390116+1156}{1941\times5}\)

\(=\frac{12122770}{1941\times5}=\frac{2424554\times5}{1941\times5}=\frac{2424554}{1941}\approx1249.13\)

The degrees of freedom is:

\(df=k-1=5-1=4\)

To determine the signficance you then need to determine the area under the chi-square distribution curve, in formula notation:

\(\int_{x=0}^{\chi^{2}}\frac{x^{\frac{df}{2}-1}\times e^{-\frac{x}{2}}}{2^{\frac{df}{2}}\times\Gamma\left(\frac{df}{2}\right)}\)

This is usually done with the aid of either a distribution table, or some software. See the chi-square distribution section for more details.

The p-value (sig.) is the probability of the percentages as in the sample, or more extreme, if the assumption about the population (that they are all equal) would be true. If this is below the pre-defined threshold (usually .05), we would reject this assumption, and conclude there are at least two significantly differt, otherwise we would not reject the assumption.

When reporting the results, you need to show the degrees of freedom, the test statistic (a chi-square value) and the p-value. I usually also add the sample size. This then looks like χ²([degrees of freedom], N = [sample size]) = [test value], p = [p-value]. Prior to this, mention the test that was done, and the interpretation of the results. In the example this results in:

A Pearson chi-square test of goodness-of-fit showed that the marital status was not equally distributed in the population, χ²(4, N = 1941) = 1249.13, p < .001.

The Pearson chi-square test of goodness-of-fit should only be used if not too many cells have a low so-called expected count. For this test it is usually set that all cells should have an expected count of at least 5 (see for example Peck & Devore, 2012, p. 593). If you don't meet this criteria, you could use an exact multinomial test of goodness-of-fit.

Another alternative is the G-test (a.k.a. Likelihood Ratio or Wilks test) of goodness-of-fit.

Step 4: Effect Size for Overall Test

When a test has a significant result, the effect size is important to add. One possible effect size to go with a Pearson chi-square test is Cramér's V.

Click here to see how to obtain Cramér's V.

with Excel

Excel file from video: ES - Cramers V (GoF).xlsm.

with Flowgorithm

A basic implementation for Cramér's V in the flowchart in figure 1

Figure 1
Flowgorithm for Cramér's V
Flowgorithm Cramér's V

It takes as input the chi-square value, an array of integers with the observed frequencies, and a boolean to indicate to use the Bergsma correction.

It uses a small helper function to sum an array of integers.

Flowgorithm file: FL-EScramerVgof.fprg.

with Python

Jupyter Notebook from video: ES - Cramers V (GoF).ipynb.

Data file from video: GSS2012a.csv.

with R (Studio)

R script from video: ES - Cramers V (GoF).R.

Data file from video: Pearson Chi-square independence.csv.

with SPSS

Unfortunately SPSS does not have a method to determine Cramér's V directly from the GUI, however the calculation is not very difficult once you have the output from the previous part.

The video below shows how this could be done with a bit of help from Excel

Online calculator

Enter the requested information below:

Manually (formula and example)

Formula

The formula for Cramér's V is:

\(V=\sqrt\frac{\chi^{2}}{n\times df}\)

In the above formula \(\chi^2\) is the chi-square test value, \(n\) is the total sample size, and \(df\) is the degrees of freedom, determined by \(df=k-1\). \(k\) is the number of categories.

Example.

If we have a chi-square value of 1249.13, a total sample size of 1941, and had five categories, we can first determine the degrees of freedom (df):

\(df = k - 1 = 5 - 4 = 4\)

Then we can fill out all values in the formula for Cramér's V:

\(V=\sqrt\frac{\chi^{2}}{n\times df}=\sqrt\frac{1249.13}{1941\times4}=\sqrt\frac{1249.13}{7764}\approx\sqrt{0.1609}\approx0.4011\)

A correction can be applied using the procedure proposed by Bergsma (2013). This is actually for a Cramér’s V with a chi-square test of independence, but adapting it for a goodness-of-fit is possible.

Alternatives for Cramér's V as an effect size measure can also be Cohen's w, or Johnston-Berry-Mielke E.

One set of rule-of-thumb for Cramér's V with a goodness-of-fit test, is by converting it to Cohen's w and then use the rules of thumb from that. This will depend on the number of categories (k). In Table 1 the already converted rule of thumb.

Table 1
Interpretation for Cramér's V (GoF)
k	Negligible	Small	Medium	Large
2	0 < 0.100	0.100 < 0.300	0.300 < 0.500	≥ 0.500
3	0 < 0.071	0.071 < 0.212	0.212 < 0.354	≥ 0.354
4	0 < 0.058	0.058 < 0.173	0.173 < 0.289	≥ 0.289
5	0 < 0.050	0.050 < 0.150	0.150 < 0.250	≥ 0.250
k	0.1 / SQRT(k-1)	0.3 / SQRT(k-1)	0.5 / SQRT(k-1)
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., pp. 227) by J. Cohen, 1988, L. Erlbaum Associates.

In the example Cramér's V is 0.401 (see videos below on how to determine this), and we had 5 categories, which would indicate a large effect.

We could add this to our report:

A Pearson chi-square test of goodness-of-fit showed that the marital status was not equally distributed in the population, χ²(4, N = 1941) = 1249.13, p < .001, with a relatively strong (Cramér's V = .40) effect size according to conventions for Cramér's V (Cohen, 1988).

Step 5: Post-Hoc Analyses

The overall test can show if the proportions might be different for some categories in the population, but we often then would like to know, which categories are different. We can compare each possible pair of categories. In principle this makes it each time a binary variable, for which we can use a one-sample binomial test (or an alternative).

Because we are comparing multiple pairs, we run the risk of rejecting the assumption with a chance of 5% each time. This quickly adds up, and the chance of making the wrong decision at least once grows then rapidly. To counter this, we will need to adjust the pairwise test results a little. One common method is to use the Bonferroni procedure (Bonferroni, 1935). He simply suggested to divide the 5% by the number of tests that are being done, and use that then as the criteria.

Click here to see how to perform a pairwise binomial test

with Excel

Excel file from video: PH - Pairwise Binomial.xlsm.

with Python

Jupyter Notebook from video: PH - Pairwise Binomial.ipynb.

Data file from video: GSS2012a.csv.

with R (Studio)

R script from video: PH - Pairwise Binomial.R.

Data file from video: Pearson Chi-square independence.csv.

with SPSS

Data file used in video: GSS2012-Adjusted.sav.

Depending on the number of pairs tested and the results, summarizing the results can be easy or tricky. In the example all were significant, so we could add something like:

A post-hoc pairwise binomial test with Bonferroni correction of marital status showed that all proportions were significantly different from each other (p < .003).

In some situations, you might just want to show a table with all the results.

Step 6: Effect Size for Post-Hoc Analyses

Also for each of the pairwise comparisons, we need to report an appropriate effect size. Since the pairwise comparison was done using a one-sample binomial test, we can use the effect size, for that test: Cohen g.

Cohen g is simply the difference between the observed proportion, and 0.5. Cohen gave some rule of thumb to interpret this, shown in Table 2.

Table 2
Rule of thumb for Cohen’s g
\|g\|	Interpretation
0.00 < 0.05	Negligible
0.05 < 0.15	Small
0.15 < 0.25	Medium
0.25 or more	Large
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., pp. 147-149) by J. Cohen, 1988, L. Erlbaum Associates.

Click here to see how to determine Cohen's g...

with Excel

Excel file from video: ES - Cohen g.xlsm.

with Flowgorithm

A basic implementation for Cohen g is shown in the flowchart in figure 1

Figure 2
Flowgorithm for Cohen g

It takes as input the frequency of one of the categories (k) and the sample size (n).

Flowgorithm file: ES - Cohen g.fprg.

with Python

Jupyter Notebook from video: ES - Cohen g.ipynb.

Datafile used in video: StudentStatistics.csv

with R (Studio)

R script from video: binary - effect sizes.R.

Datafile used in video: StudentStatistics.sav

with SPSS

Datafile used in video: StudentStatistics.sav

Online calculator

Enter the number of cases of the first category, then the total sample size:

Manually (using Formula)

Given a sample proportion (p) and the expected proportion in the population (π), the formula for Cohen's g will be:

\(g=p-\pi\)

The sample proportion in the example was 0.26 and the expected proportion was 0.50, in the example this therefor gives:

\(g=0.26-0.50=-0.24\)

Often the absolute value is used (the so-called nondirectional Cohen's g):

\(g=|0.26-0.50|=|-0.24|=0.24\)

The example will have the following results:

Table 3
Example post-hoc effect sizes
Category 1	Category 2	n1	n2	total	proportion	Cohen g	size
married	widowed	972	181	1153	0,843	0,3430	large
married	divorced	972	314	1286	0,7558	0,2558	large
married	separated	972	79	1051	0,9248	0,4248	large
married	never married	972	395	1367	0,711	0,2110	medium
widowed	divorced	181	314	495	0,3657	0,1343	small
widowed	separated	181	79	260	0,6962	0,1962	medium
widowed	never married	181	395	576	0,3142	0,1858	medium
divorced	separated	314	79	393	0,799	0,2990	large
divorced	never married	314	395	709	0,4429	0,05712	small
separated	never married	79	395	474	0,1667	0,3333	large

Summarizing this for the report can be tricky. We could do something as:

Married had a large difference with the other categories (g > 0.25), except for never married, which was medium (g = .21). The same goes for separated, which had a large difference with the other categories (g > 0.29), except for widowed, which was medium (g = 0.19). The differences between divorced with widowed and never married were each small (g < 0.14). Divorced vs. separated was another large difference (g = 0.40) and widowed was medium different from separated (g = 0.20).

It is still a lot of text, so if your research is focussed on a particular category, you might want to just describe that category, and refer to a table as Table 3 in an appendix.

Alternatives to Cohen g could be Cohen h' or the Alternative Ratio.

Step 7: Reporting

In each step, we already discussed how it could be reported. For the example used, the final report could have somthing like the following:

The media often reports that people are getting less often married these days. Figure 1 shows the results of the marital status in the survey.

Figure 1
Results of marital status
bar chart example of marital status

As can be seen Figure 1 almost 50% of the respondents is married and only a few were separated.

A post-hoc pairwise binomial test with Bonferroni correction of marital status showed that all proportions were significantly different from each other (p < .003). Married had a large difference with the other categories (g > 0.25), except for never married, which was medium (g = .21). The same goes for separated, which had a large difference with the other categories (g > 0.29), except for widowed, which was medium (g = 0.19). The differences between divorced with widowed and never married were each small (g < 0.14). Divorced vs. separated was another large difference (g = 0.40) and widowed was medium different from separated (g = 0.20).

If you want to make things easy for yourself and are using Excel, Python or R, you can use my library/add-on to perform each step.

Using a stikpet Library/Add-On

Excel and the stikpetE add-on

Excel file from video: stikpetE - Single Nominal

Python and the stikpetP library

Notebook from video: stikpetP - Single Nominal.ipynb

R and the stikpetE library

Notebook from video: stikpetR - Single Nominal

Quick Analysis

Single Variable

binary

nominal

ordinal

Google adds