Module `stikpetP.tests.test_mood_median`

Expand source code

import pandas as pd
from ..other.table_cross import tab_cross
from ..tests.test_fisher import ts_fisher
from ..tests.test_freeman_tukey_ind import ts_freeman_tukey_ind
from ..tests.test_g_ind import ts_g_ind
from ..tests.test_mod_log_likelihood_ind import ts_mod_log_likelihood_ind
from ..tests.test_neyman_ind import ts_neyman_ind
from ..tests.test_pearson_ind import ts_pearson_ind
from ..tests.test_powerdivergence_ind import ts_powerdivergence_ind

def ts_mood_median(catField, ordField, categories=None, levels=None, test="pearson", cc=None, lambd=2/3):
    '''
    Mood Median Test
    ----------------
    This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median. 
    
    The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used.
    
    Parameters
    ----------
    catField : pandas series
        data with categories
    ordField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    levels : list or dictionary, optional
        the levels or order used in ordField.
    test : {"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional
        the test of independence to use. Default is "pearson".
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
    lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test.
        
    Returns
    -------
    A dataframe with the results of the specified test.
    
    Notes
    -----
    The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median.
    
    A chi-square test of independence on this cross table can then be performed. There are quite some different options for this:
    
    * "pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function.
    * "fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson"
    * "freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function
    * "g", will perform a G test of independence using the ts_g_ind() function
    * "mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function
    * "neyman", will perform a Neyman test of independence using the ts_neyman_ind() function
    * "power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function.
    
    The formula using the default Pearson test is:
    $$\\chi_{M}^2 = \\sum_{i=1}^2 \\sum_{j=1}^k \\frac{\\left(F_{i,j}-E_{i,j}\\right)^2}{E_{i,j}}$$
    $$df = k - 1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{M}^2, df\\right)$$
    
    With:
    $$E_{i,j} = \\frac{R_i \\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^k F_{i,j}$$
    $$C_j = \\sum_{i=1}^2 F_{i,j}$$
    $$n = \\sum_{i=1}^2 \\sum_{j=1}^k F_{i,j} = \\sum_{i=1}^2 R_i = \\sum_{j=1}^k C_j$$
    
    The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951).
    
    *Symbols used:*
    
    * \\(k\\), the number of categories (columns)
    * \\(F_{1,j}\\), the number of scores is category j that are above the overall median
    * \\(F_{2,j}\\), the number of scores is category j that are equal or below the overall median
    * \\(E_{i,j}\\), the expected count in row i and column j.
    * \\(R_i\\), the row total of row i 
    * \\(C_j\\), the column total of column j
    * \\(n\\), the overall total.
    * \\(df\\), the degrees of freedom
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function of the chi-square distribution.
    
    References
    ----------
    Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167.
    
    Mood, A. M. (1950). *Introduction to the theory of statistics*. McGraw-Hill.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if lambd is None:
        lambd = 2/3
    
    #create the cross table    
    ct = tab_cross(ordField, catField, order1=levels, order2=categories, totals="include")
    
    #basic counts
    k = ct.shape[1]-1
    nlvl = ct.shape[0]-1
    n = ct.iloc[nlvl, k]
    
    #the overall median
    #note that this will not exactly determine a between value for the median
    #but thats okay since we only care if original values are above median or equal+ below
    medIndex = int((n + 1) / 2)
    cf = ct.iloc[0, k]
    med = 1
    while cf < medIndex:
        med = med + 1
        cf = cf + ct.iloc[med-1, k]
    
    #observed below and above overall median
    obs = pd.DataFrame()
    nbtot = 0
    natot = 0
    for j in range(0, k):
        nbelow = 0
        i = 1
        while i <= med:
            nbelow = nbelow + ct.iloc[i-1, j]
            i = i + 1
        obs.at[0, j] = nbelow
        nbtot = nbtot + nbelow
        natot = natot + ct.iloc[nlvl, j] - nbelow
        obs.at[1, j] = ct.iloc[nlvl, j] - nbelow
    
    catArr = pd.Series(dtype="object")
    ordArr = pd.Series(dtype="object")
    arrRow = 0
    for j in range(0, k):
        for i in range(0,2):
            for sc in range(0 , int(obs.loc[i, j])):
                catArr.at[arrRow] = ct.columns[j]
                ordArr.at[arrRow] = i+1
                arrRow = arrRow + 1
    
    #now for the test
    if test=="fisher":
        if k>2:
            test = "pearson"
        else:
            res = ts_fisher(catArr, ordArr)
    
    if test=="freeman-tukey":
        res = ts_freeman_tukey_ind(catArr, ordArr, cc=cc)
    elif test=="g":
        res = ts_g_ind(catArr, ordArr, cc=cc)
    elif test=="mod-log":
        res = ts_mod_log_likelihood_ind(catArr, ordArr, cc=cc)
    elif test=="neyman":
        res = ts_neyman_ind(catArr, ordArr, cc=cc)
    elif test=="pearson":
        res = ts_pearson_ind(catArr, ordArr, cc=cc)
    elif test=="power":
        res = ts_powerdivergence_ind(catArr, ordArr, cc=cc, lambd=lambd)
    
    return res

Functions

def ts_mood_median(catField, ordField, categories=None, levels=None, test='pearson', cc=None, lambd=0.6666666666666666)

Mood Median Test

This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median.

The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used.

Parameters

catField : pandas series: data with categories
ordField : pandas series: data with the scores
categories : list or dictionary, optional: the categories to use from catField
levels : list or dictionary, optional: the levels or order used in ordField.
test : {"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional: the test of independence to use. Default is "pearson".
cc : {None, "yates", "pearson", "williams"}, optional: method for continuity correction
lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional: either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test.

Returns

A dataframe with the results of the specified test.

Notes

The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median.

A chi-square test of independence on this cross table can then be performed. There are quite some different options for this:

"pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function.
"fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson"
"freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function
"g", will perform a G test of independence using the ts_g_ind() function
"mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function
"neyman", will perform a Neyman test of independence using the ts_neyman_ind() function
"power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function.

The formula using the default Pearson test is: $\chi_{M}^2 = \sum_{i=1}^2 \sum_{j=1}^k \frac{\left(F_{i,j}-E_{i,j}\right)^2}{E_{i,j}}$ $df = k - 1$ $sig. = 1 - \chi^2\left(\chi_{M}^2, df\right)$

With: $E_{i,j} = \frac{R_i \times C_j}{n}$ $R_i = \sum_{j=1}^k F_{i,j}$ $C_j = \sum_{i=1}^2 F_{i,j}$ $n = \sum_{i=1}^2 \sum_{j=1}^k F_{i,j} = \sum_{i=1}^2 R_i = \sum_{j=1}^k C_j$

The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951).

Symbols used:

$k$ , the number of categories (columns)
$F_{1,j}$ , the number of scores is category j that are above the overall median
$F_{2,j}$ , the number of scores is category j that are equal or below the overall median
$E_{i,j}$ , the expected count in row i and column j.
$R_i$ , the row total of row i
$C_j$ , the column total of column j
$n$ , the overall total.
$df$ , the degrees of freedom
$\chi^2\left(\dots\right)$ , the cumulative distribution function of the chi-square distribution.

References

Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167.

Mood, A. M. (1950). Introduction to the theory of statistics. McGraw-Hill.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def ts_mood_median(catField, ordField, categories=None, levels=None, test="pearson", cc=None, lambd=2/3):
    '''
    Mood Median Test
    ----------------
    This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median. 
    
    The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used.
    
    Parameters
    ----------
    catField : pandas series
        data with categories
    ordField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    levels : list or dictionary, optional
        the levels or order used in ordField.
    test : {"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional
        the test of independence to use. Default is "pearson".
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
    lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test.
        
    Returns
    -------
    A dataframe with the results of the specified test.
    
    Notes
    -----
    The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median.
    
    A chi-square test of independence on this cross table can then be performed. There are quite some different options for this:
    
    * "pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function.
    * "fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson"
    * "freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function
    * "g", will perform a G test of independence using the ts_g_ind() function
    * "mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function
    * "neyman", will perform a Neyman test of independence using the ts_neyman_ind() function
    * "power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function.
    
    The formula using the default Pearson test is:
    $$\\chi_{M}^2 = \\sum_{i=1}^2 \\sum_{j=1}^k \\frac{\\left(F_{i,j}-E_{i,j}\\right)^2}{E_{i,j}}$$
    $$df = k - 1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{M}^2, df\\right)$$
    
    With:
    $$E_{i,j} = \\frac{R_i \\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^k F_{i,j}$$
    $$C_j = \\sum_{i=1}^2 F_{i,j}$$
    $$n = \\sum_{i=1}^2 \\sum_{j=1}^k F_{i,j} = \\sum_{i=1}^2 R_i = \\sum_{j=1}^k C_j$$
    
    The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951).
    
    *Symbols used:*
    
    * \\(k\\), the number of categories (columns)
    * \\(F_{1,j}\\), the number of scores is category j that are above the overall median
    * \\(F_{2,j}\\), the number of scores is category j that are equal or below the overall median
    * \\(E_{i,j}\\), the expected count in row i and column j.
    * \\(R_i\\), the row total of row i 
    * \\(C_j\\), the column total of column j
    * \\(n\\), the overall total.
    * \\(df\\), the degrees of freedom
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function of the chi-square distribution.
    
    References
    ----------
    Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167.
    
    Mood, A. M. (1950). *Introduction to the theory of statistics*. McGraw-Hill.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if lambd is None:
        lambd = 2/3
    
    #create the cross table    
    ct = tab_cross(ordField, catField, order1=levels, order2=categories, totals="include")
    
    #basic counts
    k = ct.shape[1]-1
    nlvl = ct.shape[0]-1
    n = ct.iloc[nlvl, k]
    
    #the overall median
    #note that this will not exactly determine a between value for the median
    #but thats okay since we only care if original values are above median or equal+ below
    medIndex = int((n + 1) / 2)
    cf = ct.iloc[0, k]
    med = 1
    while cf < medIndex:
        med = med + 1
        cf = cf + ct.iloc[med-1, k]
    
    #observed below and above overall median
    obs = pd.DataFrame()
    nbtot = 0
    natot = 0
    for j in range(0, k):
        nbelow = 0
        i = 1
        while i <= med:
            nbelow = nbelow + ct.iloc[i-1, j]
            i = i + 1
        obs.at[0, j] = nbelow
        nbtot = nbtot + nbelow
        natot = natot + ct.iloc[nlvl, j] - nbelow
        obs.at[1, j] = ct.iloc[nlvl, j] - nbelow
    
    catArr = pd.Series(dtype="object")
    ordArr = pd.Series(dtype="object")
    arrRow = 0
    for j in range(0, k):
        for i in range(0,2):
            for sc in range(0 , int(obs.loc[i, j])):
                catArr.at[arrRow] = ct.columns[j]
                ordArr.at[arrRow] = i+1
                arrRow = arrRow + 1
    
    #now for the test
    if test=="fisher":
        if k>2:
            test = "pearson"
        else:
            res = ts_fisher(catArr, ordArr)
    
    if test=="freeman-tukey":
        res = ts_freeman_tukey_ind(catArr, ordArr, cc=cc)
    elif test=="g":
        res = ts_g_ind(catArr, ordArr, cc=cc)
    elif test=="mod-log":
        res = ts_mod_log_likelihood_ind(catArr, ordArr, cc=cc)
    elif test=="neyman":
        res = ts_neyman_ind(catArr, ordArr, cc=cc)
    elif test=="pearson":
        res = ts_pearson_ind(catArr, ordArr, cc=cc)
    elif test=="power":
        res = ts_powerdivergence_ind(catArr, ordArr, cc=cc, lambd=lambd)
    
    return res