Module `stikpetP.tests.test_brunner_munzel`

Expand source code

import pandas as pd
from scipy.stats import norm, t

def ts_brunner_munzel(bin_field, ord_field, categories=None, levels=None, distribution='t'):
    '''
    Brunner-Munzel Test
    --------------------------------------------
    The Brunner-Munzel test.
    
    The Brunner-Munzel test, tests so-called stochastic dominance. It tests if you pick a random score from the first category, will there be a higher (or lower) chance than 50% it will be higher than a random score from the other category. Note this is not the same as a median test. A very short example.
    
    If we have the scores 40, 50 and 60 in group A, and the scores 30, 50, and 51 in group B, the median of each group is 50. However if we pick 40 from group A, there is a 1/3 chance it is higher than a value from group B. If we pick 50 there is also a 1/3 chance, and if we pick 60 there is a 3/3 chance. Together, there is a (1/3+1/3+3/3)/3 = 5/9 chance that if you pick a random value from group A and B, that the one in group A is higher. This is quite off from the 50%.

    The test could therefor be used if we have a binary and an ordinal variable. The Mann-Whitney U test is probably the most famous one used in these scenarios, but actually has quite some limitations, and this Brunner-Munzel test might be better suited in many cases. The Fligner-Policello test is another test that is sometimes used.
    
    Brunner and Munzel (p. 21) indicate the test-statistic that is computed follows a standard normal distribution, if each category has 50 or more data points. They also remark (p. 22) that the test is no longer accurate if sample sizes are less than 10, although in Schüürhuis et al. (2025, p. 18) 15 is listed. Neubert and Brunner (2007) propose to use a studentized permutation test in these cases. Schüürhuis et al. (2025) developed an improved version of this test as well, called a \(C^2\) test.

    This function is shown in this [YouTube video](https://youtu.be/qwugQCbXtnM) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/BrunnerMunzelTest.html).
    
    Parameters
    ----------
    bin_field : pandas series
        data with categories for the rows
    ord_field : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from bin_field. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    distribution : {"t", "z"}, optional
        the use of the student t-distribution or the standard normal z-distribution.

    Returns
    -------
    A dataframe with:
    
    * *var. est.*, the estimated overall variance
    * *min n*, the sample size of the category with the lowest count
    * *test-statistic*, the Brunner-Munzel statistic
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value), two-tailed
    * *categories*, the two categories that were used
    * *test used*, description of the test used and distribution.
    
    Notes
    -----
    The test-statistic is calculated using (Brunner & Munzel, 2000, p. 21, eq. 4.8):
    $$W_n^{BM} = \\frac{\\bar{R}_2 - \\bar{R}_1}{\\sqrt{N \\times \\hat{\\sigma}_N^2}}$$
    
    with degrees of freedom (Brunner & Munzel, 2000, p. 22):
    $$df = \\frac{\\hat{\\sigma}_N^4}{N^2 \\times \\sum_{k=1}^2 \\frac{\\hat{\\sigma}_k^4}{\\left(n_k - 1\\right)\\times n_k^2}}$$

    The total estimated variance (Brunner & Munzel, 2000, p. 21, eq. 4.7):
    $$\\hat{\\sigma}_N^2 = N\\times\\left(\\frac{\\hat{\\sigma}_1^2}{n_1} + \\frac{\\hat{\\sigma}_2^2}{n_2}\\right)$$
    
    The estimated variance for each category (Brunner & Munzel, 2000, p. 20, eq. 4.6):
    $$\\hat{\\sigma}_k^2 = \\frac{s_{R_k^*}^2}{\\left(N - n_k\\right)^2}$$
    
    the sample variance of placement values (Brunner & Munzel, 2000, p. 20, eq. 4.5):
    $$s_{R_k^*}^2 = \\frac{\\sum_{i=1}^{n_k} \\left(R_{ik}^* - \\bar{R}_{k}^*\\right)^2}{n_k - 1}$$

    mean of the placement values:
    $$\\bar{R}_{k}^* = \\frac{\\sum_{i=1}^{n_k} R_{ik}^*}{n_k}$$

    the placement values:
    $$R_{ik}^* = R_{ik} - R_{ik}^{(k)}$$

    mean of the pooled ranks for each category:
    $$\\bar{R}_k = \\frac{\\sum_{i=1}^{n_k} R_{ik}}{n_k}$$

    *Symbols used:*
    
    * \\(R_{ik}\\), the rank of the i-th score in category k, when using all combined scores
    * \\(R_{ik}^{(k)}\\), the rank of the i-th score in category k, when using only scores from category k
    * \\(N\\), the total sample size
    * \\(n_{k}\\), the number of scores in category k

    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table and/or some visualisation:
    
    * [tab_cross](../other/table_cross.html#tab_cross)
    * [vi_bar_stacked_multiple](../visualisations/vis_bar_stacked_multiple.html#vi_bar_stacked_multiple)

    After the test you might want an effect size measure:
    
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann estimate
    * [r_rank_biserial_is](../correlations/cor_rank_biserial_is.html#r_rank_biserial_is) for (Glass) Rank Biserial (Cliff delta)

    Alternatives for this test could be:
    
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test
    * [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test
    * [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test

    References
    ----------
    Brunner, E., & Munzel, U. (2000). The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation. *Biometrical Journal, 42*(1), 17–25. https://doi.org/10.1002/(SICI)1521-4036(200001)42:1<17::AID-BIMJ17>3.0.CO;2-U

    Neubert, K., & Brunner, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. *Computational Statistics & Data Analysis, 51*(10), 5192–5204. https://doi.org/10.1016/j.csda.2006.05.024
    
    Schüürhuis, S., Konietschke, F., & Brunner, E. (2025). A new approach to the Nonparametric Behrens-Fisher problem with compatible confidence intervals (arXiv:2504.01796). arXiv. https://doi.org/10.48550/arXiv.2504.01796

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    >>> pd.set_option('display.width',1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv"
    >>> df1 = pd.read_csv(file1, sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> coding = {'Far too little': 1, 'too little': 2, 'Enough': 3, 'Too much': 4, 'Far too much': 5}
    >>> ts_brunner_munzel(df1['Gen_Gender'], df1['Mix_NrAct'], levels=coding)
       var. est.  min n  test statistic         df   p-value    categories                           test used
    0   0.156508     11        4.465838  36.980916  0.000073  Male, Female  Brunner-Munzel with t distribution
    
    '''
    # DATA PREPARATION
    #convert to pandas series if needed
    if type(bin_field) is list:
        bin_field = pd.Series(bin_field)
    
    if type(ord_field) is list:
        ord_field = pd.Series(ord_field)
    
    #combine as one dataframe
    df = pd.concat([bin_field, ord_field], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    # remove scores not in either of the two categories
    df = df[df['cat'].isin([cat1, cat2])]

    # RANKING
    # get the mid-ranks of the pooled scores (R_ik)
    df['rank'] = df['score'].rank()
    # the mid-ranks within each category (R_ik^(k))
    df['rank_within'] = df.groupby('cat')['score'].rank()
    # the placement scores (R_ik^*)
    df['placement'] = df['rank'] - df['rank_within']

    # TEST
    # the unbiased variances for each category (s_Rk*^2)
    unbVarRast = df.groupby('cat')['placement'].var(ddof=1)
    # the counts
    nk = df['cat'].value_counts()
    # the overall sample size
    N = sum(nk)
    # estimated variance for each category (sigma_k^2)
    vark = unbVarRast / (N - nk)**2
    # the overall estimated variance
    varN = N*sum(vark/nk)
    # the mean mid-rank for each category (based on pooled ranks)
    barRk = df.groupby('cat')['rank'].mean()
    # the test statistic
    W = (barRk.iloc[1] - barRk.iloc[0])/(N*varN)**0.5
    
    # minimum n
    min_n = min(nk)
    if distribution == 't':
        if min_n < 10:
            test_used = 'Brunner-Munzel with t distribution (not recommended)'
        else:
            test_used = 'Brunner-Munzel with t distribution'
            
        # the degrees of freedom
        deg_fr = varN**2/(N**2*sum(vark**2/((nk - 1)*nk**2)))

        # the two-tailed p-value
        p_val = 2*(1 - t.cdf(abs(W), deg_fr))
        
    elif distribution == 'z':
        if min_n < 50:
            test_used = 'Brunner-Munzel with standard normal distribution (not recommended)'
        else:
            test_used = 'Brunner-Munzel with standard normal distribution'
        
        # the two-tailed p-value    
        p_val = 2*(1 - norm.cdf(abs(W)))
        deg_fr = 'n.a.'

    results = {'var. est.':varN, 'min n':min_n, 'test statistic':W, 'df':deg_fr, 'p-value':p_val, 'categories':cat1 + ', ' + cat2, 'test used':test_used}
    results_pd = pd.DataFrame([results])
    return results_pd


def ts_brunner_munzel_perm(bin_field, ord_field, categories=None, levels=None, n_iter=1000):
    '''
    Brunner-Munzel Permutation Test
    --------------------------------------------
    This function performs a studentized permutation test of the Brunner-Munzel test. It is recommended to use this test above the regular version if sample sizes are too small in the data (e.g. less than 10 (or 15)).
    
    Parameters
    ----------
    bin_field : pandas series
        data with categories for the rows
    ord_field : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from bin_field. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    n_iter : integer, optional
        number of iterations.

    Returns
    -------
    A dataframe with:
    
    * *iters*, the number of iterations
    * *observed*, the observed Brunner-Munzel statistic
    * *n above*, the number of permutations that had a value above but not equal equal to the observed statistic
    * *n below*, the number of permutations that had a value below but not equal equal to the observed statistic
    * *p-appr.*, the significance (p-value), two-tailed
    
    Notes
    -----
    The idea for a studentized permutation test was proposed by Neubert and Brunner (2007). The regular Brunner-Munzel test statistic is calculated, then the categories get shuffled and the statistic is re-calculated. This repeats.
    
    The minimum of the number of permutations above vs. below the observed statistic is multiplied by two and then divided by the number of iterations, to obtain the p-value (Schüürhuis et al., 2025, p. 7)
    
    See the ts_brunner_munzel for more details.
    
    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table and/or some visualisation:
    * [tab_cross](../other/table_cross.html#tab_cross)
    * [vi_bar_stacked_multiple](../visualisations/vis_bar_stacked_multiple.html#vi_bar_stacked_multiple)

    After the test you might want an effect size measure:
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [es_hodges_lehmann_is](../effect_sizes/eff_size_hodges_lehmann_is.html#es_hodges_lehmann_is) for Hodges-Lehmann
    * [r_rank_biserial_is](../correlations/cor_rank_biserial_is.html#r_rank_biserial_is) for (Glass) Rank Biserial (Cliff delta)

    Alternatives for this test could be:
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test

    References
    ----------
    Neubert, K., & Brunner, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. *Computational Statistics & Data Analysis, 51*(10), 5192–5204. https://doi.org/10.1016/j.csda.2006.05.024

    Schüürhuis, S., Konietschke, F., & Brunner, E. (2025). A new approach to the Nonparametric Behrens-Fisher problem with compatible confidence intervals (arXiv:2504.01796). arXiv. https://doi.org/10.48550/arXiv.2504.01796
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    # DATA PREPARATION
    #convert to pandas series if needed
    if type(bin_field) is list:
        bin_field = pd.Series(bin_field)
    
    if type(ord_field) is list:
        ord_field = pd.Series(ord_field)
    
    #combine as one dataframe
    df = pd.concat([bin_field, ord_field], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    # remove scores not in either of the two categories
    df = df[df['cat'].isin([cat1, cat2])]

    # get the mid-ranks of the pooled scores (R_ik)
    df['rank'] = df['score'].rank()
   
    # the counts
    nk = df['cat'].value_counts()
    # the overall sample size
    N = sum(nk)

    Ws = []
    for i in range(n_iter+1):
        # the mid-ranks within each category (R_ik^(k))
        df['rank_within'] = df.groupby('cat')['score'].rank()
        # the placement scores (R_ik^*)
        df['placement'] = df['rank'] - df['rank_within']
        # the unbiased variances for each category (s_Rk*^2)
        unbVarRast = df.groupby('cat')['placement'].var(ddof=1)    
        # estimated variance for each category (sigma_k^2)
        vark = unbVarRast / (N - nk)**2
        # the overall estimated variance
        varN = N*sum(vark/nk)
        # the mean mid-rank for each category (based on pooled ranks)
        barRk = df.groupby('cat')['rank'].mean()
        # the test statistic
        W = (barRk.iloc[1] - barRk.iloc[0])/(N*varN)**0.5
        Ws.append(W)
        
        # shuffle data
        df['cat'] = df['cat'].sample(frac=1).values
        
    
    n_above = sum(i > Ws[0] for i in Ws[1:n_iter+1])
    n_below = sum(i < Ws[0] for i in Ws[1:n_iter+1])
    p_appr = 2*min(n_above, n_below)/(n_iter)
    results = {'iters':n_iter, 'observed':Ws[0], 'n above':n_above, 'n below':n_below, 'p-appr.':p_appr}
    results_pd = pd.DataFrame([results])
    return results_pd

Functions

def ts_brunner_munzel(bin_field, ord_field, categories=None, levels=None, distribution='t')

Brunner-Munzel Test

The Brunner-Munzel test.

The Brunner-Munzel test, tests so-called stochastic dominance. It tests if you pick a random score from the first category, will there be a higher (or lower) chance than 50% it will be higher than a random score from the other category. Note this is not the same as a median test. A very short example.

If we have the scores 40, 50 and 60 in group A, and the scores 30, 50, and 51 in group B, the median of each group is 50. However if we pick 40 from group A, there is a 1/3 chance it is higher than a value from group B. If we pick 50 there is also a 1/3 chance, and if we pick 60 there is a 3/3 chance. Together, there is a (1/3+1/3+3/3)/3 = 5/9 chance that if you pick a random value from group A and B, that the one in group A is higher. This is quite off from the 50%.

The test could therefor be used if we have a binary and an ordinal variable. The Mann-Whitney U test is probably the most famous one used in these scenarios, but actually has quite some limitations, and this Brunner-Munzel test might be better suited in many cases. The Fligner-Policello test is another test that is sometimes used.

Brunner and Munzel (p. 21) indicate the test-statistic that is computed follows a standard normal distribution, if each category has 50 or more data points. They also remark (p. 22) that the test is no longer accurate if sample sizes are less than 10, although in Schüürhuis et al. (2025, p. 18) 15 is listed. Neubert and Brunner (2007) propose to use a studentized permutation test in these cases. Schüürhuis et al. (2025) developed an improved version of this test as well, called a $C^2$ test.

This function is shown in this YouTube video and the test is also described at PeterStatistics.com.

Parameters

bin_field : pandas series: data with categories for the rows
ord_field : pandas series: data with the scores (ordinal field)
categories : list or dictionary, optional: the two categories to use from bin_field. If not set the first two found will be used
levels : list or dictionary, optional: the scores in order
distribution : {"t", "z"}, optional: the use of the student t-distribution or the standard normal z-distribution.

Returns

A dataframe with:

var. est., the estimated overall variance
min n, the sample size of the category with the lowest count
test-statistic, the Brunner-Munzel statistic
df, the degrees of freedom
p-value, the significance (p-value), two-tailed
categories, the two categories that were used
test used, description of the test used and distribution.

Notes

The test-statistic is calculated using (Brunner & Munzel, 2000, p. 21, eq. 4.8): $W_n^{BM} = \frac{\bar{R}_2 - \bar{R}_1}{\sqrt{N \times \hat{\sigma}_N^2}}$

with degrees of freedom (Brunner & Munzel, 2000, p. 22): $df = \frac{\hat{\sigma}_N^4}{N^2 \times \sum_{k=1}^2 \frac{\hat{\sigma}_k^4}{\left(n_k - 1\right)\times n_k^2}}$

The total estimated variance (Brunner & Munzel, 2000, p. 21, eq. 4.7): $\hat{\sigma}_N^2 = N\times\left(\frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_2^2}{n_2}\right)$

The estimated variance for each category (Brunner & Munzel, 2000, p. 20, eq. 4.6): $\hat{\sigma}_k^2 = \frac{s_{R_k^*}^2}{\left(N - n_k\right)^2}$

the sample variance of placement values (Brunner & Munzel, 2000, p. 20, eq. 4.5): $s_{R_k^*}^2 = \frac{\sum_{i=1}^{n_k} \left(R_{ik}^* - \bar{R}_{k}^*\right)^2}{n_k - 1}$

mean of the placement values: $\bar{R}_{k}^* = \frac{\sum_{i=1}^{n_k} R_{ik}^*}{n_k}$

the placement values: $R_{ik}^* = R_{ik} - R_{ik}^{(k)}$

mean of the pooled ranks for each category: $\bar{R}_k = \frac{\sum_{i=1}^{n_k} R_{ik}}{n_k}$

Symbols used:

$R_{ik}$ , the rank of the i-th score in category k, when using all combined scores
$R_{ik}^{(k)}$ , the rank of the i-th score in category k, when using only scores from category k
$N$ , the total sample size
$n_{k}$ , the number of scores in category k

Before, After and Alternatives

Before running the test you might first want to get an impression using a cross table and/or some visualisation:

After the test you might want an effect size measure:

es_common_language_is for Common Language Effect Size
me_hodges_lehmann_is for Hodges-Lehmann estimate
r_rank_biserial_is for (Glass) Rank Biserial (Cliff delta)

Alternatives for this test could be:

ts_mann_whitney for the Mann-Whitney U test
ts_fligner_policello for the Fligner-Policello test
ts_brunner_munzel_perm for the Brunner-Munzel Permutation test
ts_c_square for the $C^2$ test

References

Brunner, E., & Munzel, U. (2000). The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation. Biometrical Journal, 42(1), 17–25. https://doi.org/10.1002/(SICI)1521-4036(200001)42:1<17::AID-BIMJ17>3.0.CO;2-U

Neubert, K., & Brunner, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. Computational Statistics & Data Analysis, 51(10), 5192–5204. https://doi.org/10.1016/j.csda.2006.05.024

Schüürhuis, S., Konietschke, F., & Brunner, E. (2025). A new approach to the Nonparametric Behrens-Fisher problem with compatible confidence intervals (arXiv:2504.01796). arXiv. https://doi.org/10.48550/arXiv.2504.01796

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> pd.set_option('display.width',1000)
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv"
>>> df1 = pd.read_csv(file1, sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> coding = {'Far too little': 1, 'too little': 2, 'Enough': 3, 'Too much': 4, 'Far too much': 5}
>>> ts_brunner_munzel(df1['Gen_Gender'], df1['Mix_NrAct'], levels=coding)
   var. est.  min n  test statistic         df   p-value    categories                           test used
0   0.156508     11        4.465838  36.980916  0.000073  Male, Female  Brunner-Munzel with t distribution

Expand source code

def ts_brunner_munzel(bin_field, ord_field, categories=None, levels=None, distribution='t'):
    '''
    Brunner-Munzel Test
    --------------------------------------------
    The Brunner-Munzel test.
    
    The Brunner-Munzel test, tests so-called stochastic dominance. It tests if you pick a random score from the first category, will there be a higher (or lower) chance than 50% it will be higher than a random score from the other category. Note this is not the same as a median test. A very short example.
    
    If we have the scores 40, 50 and 60 in group A, and the scores 30, 50, and 51 in group B, the median of each group is 50. However if we pick 40 from group A, there is a 1/3 chance it is higher than a value from group B. If we pick 50 there is also a 1/3 chance, and if we pick 60 there is a 3/3 chance. Together, there is a (1/3+1/3+3/3)/3 = 5/9 chance that if you pick a random value from group A and B, that the one in group A is higher. This is quite off from the 50%.

    The test could therefor be used if we have a binary and an ordinal variable. The Mann-Whitney U test is probably the most famous one used in these scenarios, but actually has quite some limitations, and this Brunner-Munzel test might be better suited in many cases. The Fligner-Policello test is another test that is sometimes used.
    
    Brunner and Munzel (p. 21) indicate the test-statistic that is computed follows a standard normal distribution, if each category has 50 or more data points. They also remark (p. 22) that the test is no longer accurate if sample sizes are less than 10, although in Schüürhuis et al. (2025, p. 18) 15 is listed. Neubert and Brunner (2007) propose to use a studentized permutation test in these cases. Schüürhuis et al. (2025) developed an improved version of this test as well, called a \(C^2\) test.

    This function is shown in this [YouTube video](https://youtu.be/qwugQCbXtnM) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/BrunnerMunzelTest.html).
    
    Parameters
    ----------
    bin_field : pandas series
        data with categories for the rows
    ord_field : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from bin_field. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    distribution : {"t", "z"}, optional
        the use of the student t-distribution or the standard normal z-distribution.

    Returns
    -------
    A dataframe with:
    
    * *var. est.*, the estimated overall variance
    * *min n*, the sample size of the category with the lowest count
    * *test-statistic*, the Brunner-Munzel statistic
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value), two-tailed
    * *categories*, the two categories that were used
    * *test used*, description of the test used and distribution.
    
    Notes
    -----
    The test-statistic is calculated using (Brunner & Munzel, 2000, p. 21, eq. 4.8):
    $$W_n^{BM} = \\frac{\\bar{R}_2 - \\bar{R}_1}{\\sqrt{N \\times \\hat{\\sigma}_N^2}}$$
    
    with degrees of freedom (Brunner & Munzel, 2000, p. 22):
    $$df = \\frac{\\hat{\\sigma}_N^4}{N^2 \\times \\sum_{k=1}^2 \\frac{\\hat{\\sigma}_k^4}{\\left(n_k - 1\\right)\\times n_k^2}}$$

    The total estimated variance (Brunner & Munzel, 2000, p. 21, eq. 4.7):
    $$\\hat{\\sigma}_N^2 = N\\times\\left(\\frac{\\hat{\\sigma}_1^2}{n_1} + \\frac{\\hat{\\sigma}_2^2}{n_2}\\right)$$
    
    The estimated variance for each category (Brunner & Munzel, 2000, p. 20, eq. 4.6):
    $$\\hat{\\sigma}_k^2 = \\frac{s_{R_k^*}^2}{\\left(N - n_k\\right)^2}$$
    
    the sample variance of placement values (Brunner & Munzel, 2000, p. 20, eq. 4.5):
    $$s_{R_k^*}^2 = \\frac{\\sum_{i=1}^{n_k} \\left(R_{ik}^* - \\bar{R}_{k}^*\\right)^2}{n_k - 1}$$

    mean of the placement values:
    $$\\bar{R}_{k}^* = \\frac{\\sum_{i=1}^{n_k} R_{ik}^*}{n_k}$$

    the placement values:
    $$R_{ik}^* = R_{ik} - R_{ik}^{(k)}$$

    mean of the pooled ranks for each category:
    $$\\bar{R}_k = \\frac{\\sum_{i=1}^{n_k} R_{ik}}{n_k}$$

    *Symbols used:*
    
    * \\(R_{ik}\\), the rank of the i-th score in category k, when using all combined scores
    * \\(R_{ik}^{(k)}\\), the rank of the i-th score in category k, when using only scores from category k
    * \\(N\\), the total sample size
    * \\(n_{k}\\), the number of scores in category k

    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table and/or some visualisation:
    
    * [tab_cross](../other/table_cross.html#tab_cross)
    * [vi_bar_stacked_multiple](../visualisations/vis_bar_stacked_multiple.html#vi_bar_stacked_multiple)

    After the test you might want an effect size measure:
    
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann estimate
    * [r_rank_biserial_is](../correlations/cor_rank_biserial_is.html#r_rank_biserial_is) for (Glass) Rank Biserial (Cliff delta)

    Alternatives for this test could be:
    
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test
    * [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test
    * [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test

    References
    ----------
    Brunner, E., & Munzel, U. (2000). The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation. *Biometrical Journal, 42*(1), 17–25. https://doi.org/10.1002/(SICI)1521-4036(200001)42:1<17::AID-BIMJ17>3.0.CO;2-U

    Neubert, K., & Brunner, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. *Computational Statistics & Data Analysis, 51*(10), 5192–5204. https://doi.org/10.1016/j.csda.2006.05.024
    
    Schüürhuis, S., Konietschke, F., & Brunner, E. (2025). A new approach to the Nonparametric Behrens-Fisher problem with compatible confidence intervals (arXiv:2504.01796). arXiv. https://doi.org/10.48550/arXiv.2504.01796

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    >>> pd.set_option('display.width',1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv"
    >>> df1 = pd.read_csv(file1, sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> coding = {'Far too little': 1, 'too little': 2, 'Enough': 3, 'Too much': 4, 'Far too much': 5}
    >>> ts_brunner_munzel(df1['Gen_Gender'], df1['Mix_NrAct'], levels=coding)
       var. est.  min n  test statistic         df   p-value    categories                           test used
    0   0.156508     11        4.465838  36.980916  0.000073  Male, Female  Brunner-Munzel with t distribution
    
    '''
    # DATA PREPARATION
    #convert to pandas series if needed
    if type(bin_field) is list:
        bin_field = pd.Series(bin_field)
    
    if type(ord_field) is list:
        ord_field = pd.Series(ord_field)
    
    #combine as one dataframe
    df = pd.concat([bin_field, ord_field], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    # remove scores not in either of the two categories
    df = df[df['cat'].isin([cat1, cat2])]

    # RANKING
    # get the mid-ranks of the pooled scores (R_ik)
    df['rank'] = df['score'].rank()
    # the mid-ranks within each category (R_ik^(k))
    df['rank_within'] = df.groupby('cat')['score'].rank()
    # the placement scores (R_ik^*)
    df['placement'] = df['rank'] - df['rank_within']

    # TEST
    # the unbiased variances for each category (s_Rk*^2)
    unbVarRast = df.groupby('cat')['placement'].var(ddof=1)
    # the counts
    nk = df['cat'].value_counts()
    # the overall sample size
    N = sum(nk)
    # estimated variance for each category (sigma_k^2)
    vark = unbVarRast / (N - nk)**2
    # the overall estimated variance
    varN = N*sum(vark/nk)
    # the mean mid-rank for each category (based on pooled ranks)
    barRk = df.groupby('cat')['rank'].mean()
    # the test statistic
    W = (barRk.iloc[1] - barRk.iloc[0])/(N*varN)**0.5
    
    # minimum n
    min_n = min(nk)
    if distribution == 't':
        if min_n < 10:
            test_used = 'Brunner-Munzel with t distribution (not recommended)'
        else:
            test_used = 'Brunner-Munzel with t distribution'
            
        # the degrees of freedom
        deg_fr = varN**2/(N**2*sum(vark**2/((nk - 1)*nk**2)))

        # the two-tailed p-value
        p_val = 2*(1 - t.cdf(abs(W), deg_fr))
        
    elif distribution == 'z':
        if min_n < 50:
            test_used = 'Brunner-Munzel with standard normal distribution (not recommended)'
        else:
            test_used = 'Brunner-Munzel with standard normal distribution'
        
        # the two-tailed p-value    
        p_val = 2*(1 - norm.cdf(abs(W)))
        deg_fr = 'n.a.'

    results = {'var. est.':varN, 'min n':min_n, 'test statistic':W, 'df':deg_fr, 'p-value':p_val, 'categories':cat1 + ', ' + cat2, 'test used':test_used}
    results_pd = pd.DataFrame([results])
    return results_pd

def ts_brunner_munzel_perm(bin_field, ord_field, categories=None, levels=None, n_iter=1000)

Brunner-Munzel Permutation Test

This function performs a studentized permutation test of the Brunner-Munzel test. It is recommended to use this test above the regular version if sample sizes are too small in the data (e.g. less than 10 (or 15)).

Parameters

bin_field : pandas series: data with categories for the rows
ord_field : pandas series: data with the scores (ordinal field)
categories : list or dictionary, optional: the two categories to use from bin_field. If not set the first two found will be used
levels : list or dictionary, optional: the scores in order
n_iter : integer, optional: number of iterations.

Returns

A dataframe with:

iters, the number of iterations
observed, the observed Brunner-Munzel statistic
n above, the number of permutations that had a value above but not equal equal to the observed statistic
n below, the number of permutations that had a value below but not equal equal to the observed statistic
p-appr., the significance (p-value), two-tailed

Notes

The idea for a studentized permutation test was proposed by Neubert and Brunner (2007). The regular Brunner-Munzel test statistic is calculated, then the categories get shuffled and the statistic is re-calculated. This repeats.

The minimum of the number of permutations above vs. below the observed statistic is multiplied by two and then divided by the number of iterations, to obtain the p-value (Schüürhuis et al., 2025, p. 7)

See the ts_brunner_munzel for more details.

Before, After and Alternatives

Before running the test you might first want to get an impression using a cross table and/or some visualisation: * tab_cross * vi_bar_stacked_multiple

After the test you might want an effect size measure: * es_common_language_is for Common Language Effect Size * es_hodges_lehmann_is for Hodges-Lehmann * r_rank_biserial_is for (Glass) Rank Biserial (Cliff delta)

Alternatives for this test could be: * ts_mann_whitney for the Mann-Whitney U test * ts_fligner_policello for the Fligner-Policello test

References

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def ts_brunner_munzel_perm(bin_field, ord_field, categories=None, levels=None, n_iter=1000):
    '''
    Brunner-Munzel Permutation Test
    --------------------------------------------
    This function performs a studentized permutation test of the Brunner-Munzel test. It is recommended to use this test above the regular version if sample sizes are too small in the data (e.g. less than 10 (or 15)).
    
    Parameters
    ----------
    bin_field : pandas series
        data with categories for the rows
    ord_field : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from bin_field. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    n_iter : integer, optional
        number of iterations.

    Returns
    -------
    A dataframe with:
    
    * *iters*, the number of iterations
    * *observed*, the observed Brunner-Munzel statistic
    * *n above*, the number of permutations that had a value above but not equal equal to the observed statistic
    * *n below*, the number of permutations that had a value below but not equal equal to the observed statistic
    * *p-appr.*, the significance (p-value), two-tailed
    
    Notes
    -----
    The idea for a studentized permutation test was proposed by Neubert and Brunner (2007). The regular Brunner-Munzel test statistic is calculated, then the categories get shuffled and the statistic is re-calculated. This repeats.
    
    The minimum of the number of permutations above vs. below the observed statistic is multiplied by two and then divided by the number of iterations, to obtain the p-value (Schüürhuis et al., 2025, p. 7)
    
    See the ts_brunner_munzel for more details.
    
    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table and/or some visualisation:
    * [tab_cross](../other/table_cross.html#tab_cross)
    * [vi_bar_stacked_multiple](../visualisations/vis_bar_stacked_multiple.html#vi_bar_stacked_multiple)

    After the test you might want an effect size measure:
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [es_hodges_lehmann_is](../effect_sizes/eff_size_hodges_lehmann_is.html#es_hodges_lehmann_is) for Hodges-Lehmann
    * [r_rank_biserial_is](../correlations/cor_rank_biserial_is.html#r_rank_biserial_is) for (Glass) Rank Biserial (Cliff delta)

    Alternatives for this test could be:
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test

    References
    ----------
    Neubert, K., & Brunner, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. *Computational Statistics & Data Analysis, 51*(10), 5192–5204. https://doi.org/10.1016/j.csda.2006.05.024

    Schüürhuis, S., Konietschke, F., & Brunner, E. (2025). A new approach to the Nonparametric Behrens-Fisher problem with compatible confidence intervals (arXiv:2504.01796). arXiv. https://doi.org/10.48550/arXiv.2504.01796
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    # DATA PREPARATION
    #convert to pandas series if needed
    if type(bin_field) is list:
        bin_field = pd.Series(bin_field)
    
    if type(ord_field) is list:
        ord_field = pd.Series(ord_field)
    
    #combine as one dataframe
    df = pd.concat([bin_field, ord_field], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    # remove scores not in either of the two categories
    df = df[df['cat'].isin([cat1, cat2])]

    # get the mid-ranks of the pooled scores (R_ik)
    df['rank'] = df['score'].rank()
   
    # the counts
    nk = df['cat'].value_counts()
    # the overall sample size
    N = sum(nk)

    Ws = []
    for i in range(n_iter+1):
        # the mid-ranks within each category (R_ik^(k))
        df['rank_within'] = df.groupby('cat')['score'].rank()
        # the placement scores (R_ik^*)
        df['placement'] = df['rank'] - df['rank_within']
        # the unbiased variances for each category (s_Rk*^2)
        unbVarRast = df.groupby('cat')['placement'].var(ddof=1)    
        # estimated variance for each category (sigma_k^2)
        vark = unbVarRast / (N - nk)**2
        # the overall estimated variance
        varN = N*sum(vark/nk)
        # the mean mid-rank for each category (based on pooled ranks)
        barRk = df.groupby('cat')['rank'].mean()
        # the test statistic
        W = (barRk.iloc[1] - barRk.iloc[0])/(N*varN)**0.5
        Ws.append(W)
        
        # shuffle data
        df['cat'] = df['cat'].sample(frac=1).values
        
    
    n_above = sum(i > Ws[0] for i in Ws[1:n_iter+1])
    n_below = sum(i < Ws[0] for i in Ws[1:n_iter+1])
    p_appr = 2*min(n_above, n_below)/(n_iter)
    results = {'iters':n_iter, 'observed':Ws[0], 'n above':n_above, 'n below':n_below, 'p-appr.':p_appr}
    results_pd = pd.DataFrame([results])
    return results_pd