Analysing a binary vs. ordinal variable

2a: Test (Mann-Whitney U test)

The cross table and the multiple-compound bar chart from the example, showed that males and females appear to think differently about how much material was available. Since “5.3 The amount of…” is an ordinal variable, we might be tempted to do something with the median of each group. The median is the score in the middle, i.e. 50% of all cases score equal or higher than the median. However, if you look closely at Figure 1 from the previous section, you might notice that the median will be the same for each group. Instead of testing if the medians are equal, often a check is done if the two groups might have the same distribution.

The (Wilcoxon)-Mann-Whitney U test (H. B. Mann & Whitney, 1947) can do this, by comparing the mean ranks of each group. Ranks are simply determined by first sorting all the scores on the ordinal variable, then the lowest score gets rank 1, the next one rank 2, etc. The highest rank possible is therefore the total number of cases. If two or more scores are the same, the average of the ranks they would have gotten is used. So, if for example the fourth score is a 9, the fifth is a 9 and the sixth is a 9, then the rank for score four, five and six will each be (4+5+6) / 3 = 5.

The exact p-value of the Mann-Whitney U test in the example is .008. This indicates that there is a .008 chance of having a Mann-Whitney U score as in the sample or even more extreme, if the two groups would have the same distribution in the population. Instead of ‘having a Mann-Whitney U score’ you can also place ‘having differences in mean ranks’.

If this chance is very low (usually low is considered below .05), then most likely the assumption about the population is not true, and the two will have a different distribution. If the chance is high (usually .05 or above), we will conclude that we do not have enough evidence to reject the assumption about the population.

In the example we already noticed from the visualisation that the females thought the amount of activities was sufficient or (far) too little. However we can also see this from the mean ranks. The mean rank for the Females was 14.05, and for the Males 25.90. Since ‘5.3 The amount of …’, was coded as 1 = far too little to 5 = far too many, a higher mean rank indicates that, that group tended more towards the higher end of the coding of the variable. In this case the higher mean rank for the males, suggests that they tended more towards ‘far too many’ then the females.

We can add the results of the Mann-Whitney U test to our report as for example:

An exact Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, U(n₁ = 11, n₂ = 34) = 285.5, p = .008.

If you do not have an exact p-value, then use the approximated one. In that case the test-statistic Z is actually used. The report would then go something like:

A Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, Z(n₁ = 11, n₂ = 34) = 2.845, p = .004.

Click here to see how you can perform a Mann-Whitney U test...

With Excel

With Python

With R (studio)

With SPSS

Using non-parametric tests

Using Legacy dialogs

Manually (Formulas and Example)

Formula

The formula for the U statistics is:

\(U_i=R_i-\frac{n_i\times\left(n_i+1\right)}{2}\)

In this formula n_i is the number of scores in category i, and R_i the sum of the ranks from category i.

Often however there are ties, and we then need to adjust for those. We then need the z-statistic:

\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE}\)

The formula for SE (standard error) is:

\SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right )}\)

The N is the total number of scors (i.e. n₁ + n₂) and T_i the tie correction for tie i. For each unique rank the Ti is determined by:

\T_i=\frac{t_i^3-t_i}{12}\)

Where t_i is the number of ranks tied for unique rank i.

Example

Note: different example then the one used in the rest of this section.

We are given the scores of one group of people:

\X_1=(1,2,5,2,2)\)

And another group:

\X_2=(4,3,5,5)\)

Note that the number of scores in the first group is five, and in the second four, so:

\n_1=5, n_2=4, N=5+4=9\)

If we combine both groups we get:

\C=(1,2,5,2,2,4,3,5,5)\)

The lowest score is a 1, so this gets a rank of 1. Then there are three 2's, so these get ranks 2, 3, and 4, or on average 3. There is only one 3, so this gets rank 5, only one 4 which gets rank 6, and there are three 5's, so these get ranks 7, 8, and 9, or on average 8. Replacing the original scores with the ranks (average ones), and summing them up we get for the first group:

\R_1=1+3+8+3+3=18\)

And for the second:

\R_2=6+5+8+8=27\)

The U statistic of the first group is:

\U_1=R_1-\frac{n_1\times\left(n_1+1\right)}{2} =18-\frac{5\times\left(5+1\right)}{2} =18-\frac{30}{2}=18-15=3\)

And for the second group:

\U_2=R_2-\frac{n_2\times\left(n_2+1\right)}{2} =27-\frac{4\times\left(4+1\right)}{2} =27-\frac{20}{2}=27-10=17\)

We had three 2's, and also three 5's. So for the frequencies of ties we get the sequence:

\T=(3,3)\)

Now calculate the adjustment for each frequency of ties

\T_1=\frac{t_1^3-t_1}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)

\T_2=\frac{t_2^3-t_2}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)

Then the standard error:

\SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right)} =\sqrt{\frac{5\times 4}{9\times\left(9-1\right)}\times\left(\frac{9^3-9}{12}-(2+2))\right)}\)

\=\sqrt{\frac{20}{72}\times\left(\frac{729-9}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(\frac{720}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(60-4)\right)}\)

\=\sqrt{\frac{5}{18}\times56} =\sqrt{\frac{5\times56}{18}} =\sqrt{\frac{5\times28}{9}} =\sqrt{\frac{140}{9}}\)

\=\frac{\sqrt{140}}{\sqrt9} =\frac{\sqrt{4\times35}}{3} =\frac{\sqrt{4}\times\sqrt{35}}{3} =\frac{2}{3}\sqrt{35}\approx3.944\)

Finally the Z-score. If we use U₁:

\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 3-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{6-20}{\frac{4}{3}\sqrt{35}} =\frac{-14}{\frac{4\sqrt{35}}{3}}\)

\=\frac{-14\times3}{4\sqrt{35}} =\frac{-7\times3}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{-21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)

\=\frac{-21\sqrt{35}}{2\times35} =\frac{-21\sqrt{35}}{70} =\frac{-3\sqrt{35}}{10}\approx-1.775\)

If we use U₂:

\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 17-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{34-20}{\frac{4}{3}\sqrt{35}} =\frac{14}{\frac{4\sqrt{35}}{3}}\)

\=\frac{14\times3}{4\sqrt{35}} =\frac{7\times3}{2\sqrt{35}} =\frac{21}{2\sqrt{35}} =\frac{21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)

\=\frac{21\sqrt{35}}{2\times35} =\frac{21\sqrt{35}}{70} =\frac{3\sqrt{35}}{10}\approx1.775\)

For the two-tailed significance we can then use the standard normal distribution. Usually this is found either by using a table, or some software, but if you must know the formula would be:

\2\times\int_{x=|Z|}^{\infty}\left(\frac{1}{\sqrt{2\times\pi}}\times e^{-\frac{x^2}{2}} \right )\)

The term ‘exact’ might give you the impression that you should always use it, since ‘exact’ sounds better than ‘approximate’. Some might indeed argue for this (for example Berger (2017)), but the ‘exact’ test often requires a lot more computational power (even for computers today), and in some cases there are those who actually claim the approximate test is preferred (see for example Agresti and Coull (1998)).

An alternative for the Mann-Whitney U test can be the Fligner-Policello test, or independent-samples Mood Median test. The Mood test can also be used with more than two groups so it's discussed in the Nominal vs. Ordinal section. See the bottom of this page for more info on the Fligner-Policello test.

As a final note the Mann-Whitney U test is sometimes also called the Wilcoxon-Mann-Whitney test, or Mann-Whitney-Wilcoxon test. This is because Mann and Whitney expanded on an idea from Wilcoxon. It is the same as the Wilcoxon Rank Sum test, but NOT as the Wilcoxon Signed Rank test.

We are almost done with our analysis, but there is one more thing to do. Besides testing for differences, we also need to indicate how big the differences are. This is done using a so-called effect size and the topic for the next section.

Appendix

Fligner-Policello test

The Mann-Whitney U test has as a null hypothesis if the scores of the two categories have the same distribution, Fligner and Policello (1981) adjusted the test, so it would test if the two categories have the same median.

Binary vs. Ordinal

Reporting

Google adds