Nominal vs. Nominal (unpaired/independent)
Part 3a: Test for association (Pearson chi-square test of independence)
To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05, the two nominal variables have a significant association.
One problem though is that the Pearson chi-square test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .
Once you have checked the conditions and looked at the results, you can report the test results. In the example the percentage of cells with an expected count less than 5 is actually 0%, so it is okay to use the test. The test results could than be reported as something like:
Gender and marital status showed to have a significant association, χ2(4, N = 1941) = 16.99, p < .001.
Click here to see how you can perform a Pearson chi-square test of independence...
with Excel
Excel file: TS - Pearson Chi square - Independence.xlsm
with Python
Jupyter Notebook: TS - Pearson Chi square - Independence.ipynb
Data file used: GSS2012a.csv
with R (Studio)
Click on the thumbnail below to see where you can find each of the values mentioned in the output of the software.
R Script: TS - Pearson Chi square - Independence.R
Data file used: GSS2012a.csv
with SPSS
Click on the thumbnail below to see where you can find each of the values mentioned in the output of the software.
Data file used: GSS2012-Adjusted.sav
with a TI-83
Manually (Formulas and example)
Formulas
The formula for the Pearson chi-square value is:
\(\chi^2=\sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\)
In this formula r is the number of rows, and c the number of columns. Oi,j is the observed frequency of row i and column j. Eij is the expected frequency of row i and column j. The expected frequency can be determined using:
\(E_{i,j}=\frac{R_i\times{C_j}}{N}\)
In this formula Ri is the total of all observed frequencies in row i, Cj the total of all observed frequencies in column j, and N the grand total of all observed frequencies. In formula notation:
\(R_i=\sum_{j=1}^{c}O_{i,j}\)
\(C_j=\sum_{i=1}^{r}O_{i,j}\)
\(N=\sum_{i=1}^{r}R_i=\sum_{j=1}^{c}C_j=\sum_{i=1}^{r}\sum_{j=1}^{c}O_{i,j}\)
The degrees of freedom is the number of rows minus one, multiplied by the number of columns minus one. In formula notation:
\(df=(r-1)\times(c-1)\)
Example
Note: different example than in rest of this page.
We are given the following table with observed frequencies.
Brand | Red | Blue |
---|---|---|
Nike |
10 |
8 |
Adidas |
6 |
4 |
Puma |
14 |
8 |
There are three rows, so r = 3, and two columns, so c = 2. Then we can determine the row totals:
\(R_1=\sum_{j=1}^{2}O_{1,j}=O_{1,1}+O_{1,2}=10+8=18\)
\(R_2=\sum_{j=1}^{2}O_{2,j}=O_{2,1}+O_{2,2}=6+4=10\)
\(R_3=\sum_{j=1}^{2}O_{3,j}=O_{3,1}+O_{3,2}=14+8=22\)
The column totals:
\(C_1=\sum_{i=1}^{3}O_{i,1}=O_{1,1}+O_{2,1}+O_{3,1}=10+6+14=30\)
\(C_2=\sum_{i=1}^{3}O_{i,2}=O_{1,2}+O_{2,2}+O_{3,2}=8+4+8=20\)
The grand total, all three formulas will give the same result:
\(N=\sum_{i=1}^{3}R_i=R_1+R_2+R_3=18+10+22=50\)
\(N=\sum_{j=1}^{2}C_j=C_1+C_2=30+20=50\)
\(N=\sum_{i=1}^{3}\sum_{j=1}^{2}O_{i,j}=O_{1,1}+O_{1,2}+O_{2,1}+O_{2,2}+O_{3,1}+O_{3,2}\)
\(=10+8+6+4+14+8=50\)
We can add the totals to our table:
Brand | Red | Blue | Total |
---|---|---|---|
Nike |
10 |
8 |
18 |
Adidas |
6 |
4 |
10 |
Puma |
14 |
8 |
22 |
Total | 30 |
20 |
50 |
Next we calculate the expected frequencies for each cell:
\(E_{1,1}=\frac{R_1\times{C_1}}{50} =\frac{18\times{30}}{50} =\frac{540}{50} =\frac{54}{5}=10.8\)
\(E_{1,2}=\frac{R_1\times{C_2}}{50} =\frac{18\times{20}}{50} =\frac{360}{50} =\frac{36}{5}=7.2\)
\(E_{2,1}=\frac{R_2\times{C_1}}{50} =\frac{10\times{30}}{50} =\frac{300}{50} =6\)
\(E_{2,2}=\frac{R_2\times{C_2}}{50} =\frac{10\times{20}}{50} =\frac{200}{50} =4\)
\(E_{3,1}=\frac{R_3\times{C_1}}{50} =\frac{22\times{30}}{50} =\frac{660}{50} =\frac{66}{5} =13.2\)
\(E_{3,2}=\frac{R_3\times{C_2}}{50} =\frac{22\times{20}}{50} =\frac{440}{50} =\frac{44}{5} =8.8\)
An overview of these in a table might be helpful:
Brand | Red | Blue | Total |
---|---|---|---|
Nike |
10.8 |
7.2 |
18 |
Adidas |
6 |
4 |
10 |
Puma |
13.2 |
8.8 |
22 |
Total | 30 |
20 |
50 |
Note that the totals remain the same. Now for the chi-square value. For each cell we need to determine:
\(\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\)
So again six times:
\(\frac{(O_{1,1}-E_{1,1})^2}{E_{1,1}} =\frac{(10-\frac{54}{5})^2}{\frac{54}{5}} =\frac{(\frac{50}{5}-\frac{54}{5})^2}{\frac{54}{5}} =\frac{(\frac{50-54}{5})^2}{\frac{54}{5}} =\frac{(\frac{-4}{5})^2}{\frac{54}{5}} =\frac{\frac{(-4)^2}{5^2}}{\frac{54}{5}} =\frac{\frac{16}{25}}{\frac{54}{5}} =\frac{16\times5}{25\times54} =\frac{8}{5\times27} =\frac{8}{135} \approx0.059\)
\(\frac{(O_{1,2}-E_{1,2})^2}{E_{1,2}} =\frac{(8-\frac{36}{5})^2}{\frac{36}{5}} =\frac{(\frac{40}{5}-\frac{36}{5})^2}{\frac{36}{5}} =\frac{(\frac{40-36}{5})^2}{\frac{36}{5}} =\frac{(\frac{4}{5})^2}{\frac{36}{5}} =\frac{\frac{(4)^2}{5^2}}{\frac{36}{5}} =\frac{\frac{16}{25}}{\frac{36}{5}} =\frac{16\times5}{25\times36} =\frac{4}{5\times9} =\frac{4}{45} \approx0.089\)
\(\frac{(O_{2,1}-E_{2,1})^2}{E_{2,1}} =\frac{(6-6)^2}{6} =\frac{(0)^2}{6} =\frac{0}{6} =0\)
\(\frac{(O_{2,2}-E_{2,2})^2}{E_{2,2}} =\frac{(4-4)^2}{4} =\frac{(0)^2}{4} =\frac{0}{4} =0\)
\(\frac{(O_{3,1}-E_{3,1})^2}{E_{3,1}} =\frac{(14-\frac{66}{5})^2}{\frac{66}{5}} =\frac{(\frac{70}{5}-\frac{66}{5})^2}{\frac{66}{5}} =\frac{(\frac{70-66}{5})^2}{\frac{66}{5}} =\frac{(\frac{-4}{5})^2}{\frac{66}{5}} \frac{\frac{(-4)^2}{5^2}}{\frac{66}{5}} \frac{\frac{16}{25}}{\frac{66}{5}} =\frac{16\times5}{25\times66} =\frac{8}{5\times33} =\frac{8}{165} \approx0.048\)
\(\frac{(O_{3,2}-E_{3,2})^2}{E_{3,2}} =\frac{(8-\frac{44}{5})^2}{\frac{44}{5}} =\frac{(\frac{40}{5}-\frac{44}{5})^2}{\frac{44}{5}} =\frac{(\frac{40-44}{5})^2}{\frac{44}{5}} =\frac{(\frac{-4}{5})^2}{\frac{44}{5}} \frac{\frac{(-4)^2}{5^2}}{\frac{44}{5}} \frac{\frac{16}{25}}{\frac{44}{5}} =\frac{16\times5}{25\times44} =\frac{4}{5\times11} =\frac{4}{55} \approx0.073\)
Then the chi-square value is the sum of all of these:
\(\chi^2=\sum_{i=1}^{3}\sum_{j=1}^{2}\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}} =\frac{8}{135}+\frac{4}{45}+0+0+\frac{8}{165}+\frac{4}{55}\)
\(=\frac{8\times11}{135\times11}+\frac{4\times33}{45\times33}+\frac{8\times9}{165\times9}+\frac{4\times27}{55\times27} =\frac{88}{1485}+\frac{132}{1485}+\frac{72}{1485}+\frac{108}{1485}\)
\(=\frac{88+132+72+108}{1485} =\frac{400}{1485} =\frac{80}{297} \approx0.269\)
The degrees of freedom is:
\(df=(3-1)\times(2-1)=(2)\times(1)=2\)
To determine the signficance you then need to determine the area under the chi-square distribution curve, in formula notation:
\(\int_{x=0}^{\chi^{2}}\frac{x^{\frac{df}{2}-1}\times e^{-\frac{x}{2}}}{2^{\frac{df}{2}}\times\Gamma\left ( \frac{df}{2} \right )}\)
This is usually done with the aid of either a distribution table, or some software.
You might now also wonder what then the association is (which marital status is differently chosen by men and women). This will be the topic on the next page.
FAQ's: (click on the question to see the answer).
What if I do not meet the conditions? Fisher-Freeman-Halton Exact Test
If your data does not meet the two criteria, all is not lost. You could perhaps combine some categories that have a low count (e.g. combine all marital status that are not married into one), or you can perform a so-called Fisher exact test (for larger than 2x2 tables this is also then known as the Fisher-Freeman-Halton Exact Test).
Click here to see how you can perform a Fisher-Freeman-Halton Exact Test...
with Excel (not really possible)
There is no easy way to perform a Fisher-Freeman-Halton Exact Test for tables larger than 2x2. A free add-on from Real Statistics can do the calculations for up to 2x3, 2x4, ..., 2x9, and 3x3, 3x4, 3x5. Each with a limit on the total sum of all the counts. An explanation and the add-on can be found at: https://www.real-statistics.com/chi-square-and-f-distributions/fishers-exact-test/
with Python (not really possible)
There is no easy way to perform a Fisher-Freeman-Halton Exact Test for tables larger than 2x2. One option described here, is to actually call R and use it's function for this from Python.
There is also a library 'FisherExact' that should work, but I couldn't make it to work.
with R
video to be uploaded
R script: TS - Fisher-Freeman-Halton.R
Data files: GSS2012a.csv and StudentStatistics.csv
with SPSS
Data file: GSS2012-Adjusted.sav.
With a Fisher exact test we only need to check the significance, and the interpertation goes similar to that of the Chi-square test. In the report this might go something like:
a two-sided Fisher exact test showed that gender and marital status have a significant association (N = 1941, p < .001).
What are these 'expected values'?
The expected values are the number of respondents you would expect if the two variables would be independent.
If for example I had 50 male and 50 female respondents, and 50 agreed with a statement and 50 disagreed with the statement, the expected value for each combination (male-agree, female-agree, male-disagree, and female-disagree) would be 25.
Note that if in the survey the real results would be that all male disagreed, and all female would agree, there is a full dependency (i.e. gender fully decides if you agree or disagree), even though the row and column totals would still be 50. In essence the Pearson chi-square test, checks if your data is more toward the expected values (independence) or the full dependency one.
Are there any alternatives or variations? G, Freeman-Tukey, Cressie-Read, Neyman, and Mod-Log Likelihood test
The Pearson chi-square test and the Fisher exact test are probably the two most frequently used tests in this situation, however other tests also exist which some claim to perform even better. Some alternatives for the Pearson chi-square test are the G-test (also known as a (Wilks) likelihood ratio test), Freeman-Tukey test, Neyman test, and Mod-Log Likelihood test.Cressie and Read (1984) showed that all the mentioned tests are related by another formula.
Another option worth mentioning is that for a chi-square test (such as Pearson and the G-test) some corrections have been suggested, these include the Yates correction (Yates, 1934), the Williams correction (Williams, 1976), and the E.S. Pearson correction (Pearson, 1947).
Click here to see how you can perform a G, Freeman-Tukey, Cressie-Read, Neyman, and Mod-Log Likelihood test of independence...
with Python
video to be uploaded
Jupyter Notebooks: TS - G-test (independence).ipynb, TS - Power Divergents (independence).ipynb
with R
video to be uploaded
R scripts: TS - G-test (independence).R, TS - Power Divergence (independence).R
Manually (Formula's)
Using the following settings:
\(r\) is the number of rows
\(c\) is the number of columns
\(F_{i,j}\) is the observed count in row i, column j
\(E_{i,j}\) is the expected count in row i, column j, calculated by:
\( E_{i,j} = \frac{R_i\times C_i}{n}\)
\(R_i\) is the row total, of row i: \(R_i = \sum_{j=1}^c F_{i,j}\)
\(C_j\) is the column total, of column j: \(C_j = \sum_{i=1}^r F_{i,j}\)
\(n\) is sample size: \(n = \sum_{i=1}^r R_{i} = \sum_{j=1}^c C_{i} = \sum_{i=1}^r\sum_{j=1}^c F_{i,j}\)
The Tests
The formula's for the tests are then as follows.
G / (Wilks) Likelihood Ratio (1938, p. 62):
\( G = 2\times\sum_{i=1}^r\sum_{j=1}^c F_{i,j}\times\ln\left(\frac{F_{i,j}}{E_{i,j}}\right)\)
Mod-Log Likelihood
\( ML = 2\times\sum_{i=1}^r\sum_{j=1}^c E_{i,j}\times\ln\left(\frac{E_{i,j}}{F_{i,j}}\right)\)
Note that this is the same as the G-test, but with the observed and expected counts swopped.
Neyman (1949, p. 250)
\(N = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{\left(F_{i,j} - E_{i,j}\right)^2}{F_{i,j}}\)
Note that this is the same as the Pearson chi-square test, but with the observed and expected counts swopped.
Freeman-Tukey
\(T = 4\times\sum_{i=1}^{r}\sum_{j=1}^{c} \left(\sqrt{F_{i,j}}-\sqrt{E_{i,j}}\right)^2\)
The article all others point to is from Freeman and Tukey (1950). However, I couldn’t clearly find the formula anywhere in there. One source for the formula can be Bishop et al. (2007, p. 513).
Cressie-Read (1984, p. 463)
\(CR = \frac{9}{5}\times\sum_{i=1}^{r}\sum_{j=1}^{c} F_{i,j}\times\left(\left(\frac{F_{i,j}}{E_{i,j}}\right)^{\frac{2}{3}} - 1\right)\)
As mentioned earlier, Cressie and Read derived a generic formula using 'power divergence'. They found a lambda of 2/3 to work best, which gives the above formula.
Their generic formula is:
\(\frac{2}{\lambda\times\left(\lambda+1\right)} \times\sum_{i=1}^{r}\sum_{j=1}^{c} O_{i,j}\times\left(\left(\frac{F_{i,j}}{E_{i,j}}\right)^{\lambda} - 1\right)\)
If λ = 0, we would get a division by 0, so in that case the formula from the G-test is used. If λ = -1, it would also lead to a division by zero, so then the mod-log likelihood formula is used.
Using λ=1, will give the Pearson test, λ=-0.5 the Freeman-Tukey, λ=-2 Neyman, and of course λ=2/3 Cressie-Read
The results of each of the above formula's is a chi-square value, i.e. it follows a chi-square distribution. The probability of such a value or more extreme (the sig./p-value) can then be calculated with the cumulative chi-square distribution, with degrees of freedom of:
\(df = \left(r - 1\right)\times\left(c - 1\right) \)
The Corrections
The Yates Continuity Correction (1934, p. 222)
WARNING: Only use this for 2x2 tables.
Adjust the observed counts using:
\(F_{i,j}^{'} = \begin{cases} F_{i,j}-0.5 & \text{ if } F_{i,j}>E_{i,j} \\ F_{i,j}+0.5 & \text{ if } F_{i,j}<E_{i,j} \\ F_{i,j} & \text{ if } F_{i,j}=E_{i,j} \end{cases}\)
Then use the adjusted counts instead of the observed counts.
Williams Correction (1976, p. 36)
\(q = 1 + \frac{\left(n\times\left(\sum_{i=1}^{r}\frac{1}{R_i}\right)-1\right) \times \left(n\times\left(\sum_{i=1}^{c}\frac{1}{C_i}\right)-1\right)}{6\times \left(c - 1\right)\times\left(c - 1\right)}\)
Then adjust the original chi-square value by:
\(\chi_{adj}^2 = \frac{\chi_{original}^2}{q}\)
E.S. Pearson Correction (1947, p. 157)
Adjust the original chi-square value by:
\(\chi_{adj}^2 = \frac{n-1}{n} \times \chi_{original}^2\)
Note that E.S. Pearson is the son of Karl Pearson (from the Pearson chi-square test).
For the Fisher exact test, the Barnard test and the Boschloo test, although these two are for 2x2 tables only (i.e. two binary variables).
How do you get that chi symbol (χ) in Word?
Type in the letter 'c', then select it and change the font to 'Symbol'
Two nominal variables
Google adds