Mann-Whitney U Test
Explanation
A Mann-Whitney U test, is a test that could be used when you have a binary and an ordinal variable. It compares the mean ranks of each group (from the binary variable). Ranks are simply determined by first sorting all the scores on the ordinal variable, then the lowest score gets rank 1, the next one rank 2, etc. The highest rank possible is therefore the total number of cases. If two or more scores are the same, the average of the ranks they would have gotten is used. So, if for example the fourth score is a 9, the fifth is a 9 and the sixth is a 9, then the rank for score four, five and six will each be (4+5+6) / 3 = 5.
If we assume the two samples have the same shape, we could use a Mann-Whitney U test to indeed check if the medians are equal. This would then only imply a location shift (the distribution looks the same for the two, just a shift of location) (Chung & Romano, 2011, p. 5).
Divine et al. (2018, p. 279) show that the MWU test without any assumptions about the distribution, is a test for this equal stochastic. This tests if the probability of a randomly selected case having a score greater than a random score from the other category is 50% (Divine et al., 2018, p. 286). In formula \(P(X < Y) + \frac{1}{2}\times P(X=Y) = \frac{1}{2} \), if this is rejected one of the two samples dominates the other. However, Chung and Romano (2011, p. 5) note that it fails to control type 1 errors, and a Brunner-Munzel test might be preferred.
Another interpretation for the MWU test could be that the two samples come from the same distribution (Fong & Huang, 2019). However, Divine et al. (2018, p. 279) note that MWU is actually only for stochastic equality. Chung and Romano (2011, p. 5) note that as a test of distribution the MWU does not have much power detecting differences.
It seems for any scenario there are better alternatives available:
- as a test of medians => the Fligner-Policello test only requires the distributions to be symmetrical around the median, not entirely the same shape. A bit more conservative but without any assumptions is the Mood median test. Another alternative might be a test developed by Schlag (2015).
- as a test of stochastic equivelant => the Brunner-Munzel test was designed for this, and the \(C^2\) test even doesn't have any restriction on sample sizes.
- as a test of distribution the Kolmogorov–Smirnov is probably better suited.
But the MWU is still very often used and also referred to as the Wilcoxon-Mann-Whitney test, or Mann-Whitney-Wilcoxon test. This is because Mann and Whitney expanded on an idea from Wilcoxon. It is the same as the Wilcoxon Rank Sum test.
The test can be done with an exact distribution (The Wilcoxon-Rank-Sum distribution), but is most often approximated using a normal distribution. The term ‘exact’ might give you the impression that you should always use it, since ‘exact’ sounds better than ‘approximate’. Some might indeed argue for this (for example Berger (2017)), but the ‘exact’ test often requires a lot more computational power (even for computers today), and in some cases there are those who actually claim the approximate test is preferred (see for example Agresti and Coull (1998)).
Performing the Test
with SPSS
using Nonparametric
using Legacy Dialogs
Formulas
Formula
The formula for the U statistics is:
\(U_i=R_i-\frac{n_i\times\left(n_i+1\right)}{2}\)
In this formula ni is the number of scores in category i, and Ri the sum of the ranks from category i.
Often however there are ties, and we then need to adjust for those. We then need the z-statistic:
\(Z=\frac{2\times U_i-n_1\times n_2}{2\times SE}\)
The formula for SE (standard error) is:
\(SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right )}\)
The N is the total number of scors (i.e. n1 + n2) and Ti the tie correction for tie i. For each unique rank the Ti is determined by:
\(T_i=\frac{t_i^3-t_i}{12}\)
Where ti is the number of ranks tied for unique rank i.
Example
Note: different example then the one used in the rest of this section.
We are given the scores of one group of people:
\(X_1=(1,2,5,2,2)\)
And another group:
\(X_2=(4,3,5,5)\)
Note that the number of scores in the first group is five, and in the second four, so:
\(n_1=5, n_2=4, N=5+4=9\)
If we combine both groups we get:
\(C=(1,2,5,2,2,4,3,5,5)\)
The lowest score is a 1, so this gets a rank of 1. Then there are three 2's, so these get ranks 2, 3, and 4, or on average 3. There is only one 3, so this gets rank 5, only one 4 which gets rank 6, and there are three 5's, so these get ranks 7, 8, and 9, or on average 8. Replacing the original scores with the ranks (average ones), and summing them up we get for the first group:
\(R_1=1+3+8+3+3=18\)
And for the second:
\(R_2=6+5+8+8=27\)
The U statistic of the first group is:
\(U_1=R_1-\frac{n_1\times\left(n_1+1\right)}{2} =18-\frac{5\times\left(5+1\right)}{2} =18-\frac{30}{2}=18-15=3\)
And for the second group:
\(U_2=R_2-\frac{n_2\times\left(n_2+1\right)}{2} =27-\frac{4\times\left(4+1\right)}{2} =27-\frac{20}{2}=27-10=17\)
We had three 2's, and also three 5's. So for the frequencies of ties we get the sequence:
\(T=(3,3)\)
Now calculate the adjustment for each frequency of ties
\(T_1=\frac{t_1^3-t_1}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)
\(T_2=\frac{t_2^3-t_2}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)
Then the standard error:
\(SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right)} =\sqrt{\frac{5\times 4}{9\times\left(9-1\right)}\times\left(\frac{9^3-9}{12}-(2+2))\right)}\)
\(=\sqrt{\frac{20}{72}\times\left(\frac{729-9}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(\frac{720}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(60-4)\right)}\)
\(=\sqrt{\frac{5}{18}\times56} =\sqrt{\frac{5\times56}{18}} =\sqrt{\frac{5\times28}{9}} =\sqrt{\frac{140}{9}}\)
\(=\frac{\sqrt{140}}{\sqrt9} =\frac{\sqrt{4\times35}}{3} =\frac{\sqrt{4}\times\sqrt{35}}{3} =\frac{2}{3}\sqrt{35}\approx3.944\)
Finally the Z-score. If we use U1:
\(Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 3-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{6-20}{\frac{4}{3}\sqrt{35}} =\frac{-14}{\frac{4\sqrt{35}}{3}}\)
\(=\frac{-14\times3}{4\sqrt{35}} =\frac{-7\times3}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{-21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)
\(=\frac{-21\sqrt{35}}{2\times35} =\frac{-21\sqrt{35}}{70} =\frac{-3\sqrt{35}}{10}\approx-1.775\)
If we use U2:
\(Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 17-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{34-20}{\frac{4}{3}\sqrt{35}} =\frac{14}{\frac{4\sqrt{35}}{3}}\)
\(=\frac{14\times3}{4\sqrt{35}} =\frac{7\times3}{2\sqrt{35}} =\frac{21}{2\sqrt{35}} =\frac{21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)
\(=\frac{21\sqrt{35}}{2\times35} =\frac{21\sqrt{35}}{70} =\frac{3\sqrt{35}}{10}\approx1.775\)
For the two-tailed significance we can then use the standard normal distribution. Usually this is found either by using a table, or some software, but if you must know the formula would be:
\(2\times\int_{x=|Z|}^{\infty}\left(\frac{1}{\sqrt{2\times\pi}}\times e^{-\frac{x^2}{2}} \right )\)
Interpreting the Result
The assumption about the population for this test (the null hypothesis) is that the medians are equal (if both categories have similar distribution), or the distributions are equal (if they look different).
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, we conclude that the medians in the population will be different, or the distributions will be different.
Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.
Writing the results
Writing up the results of the test uses the format (APA, 2019 p. 182):
U(n1 = <number of cases in 1st category>, n2 = <number of cases in 2nd category>) = <U-value>, p = <p-value>
So for example if an exact test is used:
An exact Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, U(n1 = 11, n2 = 34) = 285.5, p = .008.
If you do not have an exact p-value, then use the approximated one. In that case the test-statistic Z is actually used. The report would then go something like:
A Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, z(n1 = 11, n2 = 34) = 2.845, p = .004.
A few notes about reporting statistical results with APA:
- The p-value is shown with three decimal places, and no 0 before the decimal sign. If the p-value is below .0005, it can be reported as p < .001.
- Both U and z are standard abbreviations from APA for the Mann-Whitney U test statistic, and standardized score (see APA, 2019, table 6.5). They do not need to be explained.
- APA does not require to include references nor formulas for statical analysis that are in common use (2019, p. 181).
- APA (2019, p. 88) states to also report an effect size measure.
Next...
The next step is to determine an effect size measure. Varha-Delaney A, a Rosenthal Correlation, or a (Glass) Rank Biserial Correlations (Cliff Delta), could be suitable for this.
Alternatives
alternatives for testing stochastic equivelance:
- Brunner-Munzel test
- the Brunner-Munzel studentized permutation test
- C-square test, which is an improvement on the Brunner-Munzel test
- Cliff-Delta, which according to Delaney and Vargha (2002), performs similar as the Brunner-Munzel test.
if you only want to test if the medians are equal:
- Mann-Whitney U, assuming distributions have the same shape
- Fligner-Policello, assuming distributions are symmetric around the median, and continuous data
- Mood-Median, although according to Schlag (2015) this is actually testing quantiles, and can lead to over rejection.
- Schlag, but only used to accept or reject, no p-value
Google adds
