Module stikpetP.tests.test_mood_median
Expand source code
import pandas as pd
from ..other.table_cross import tab_cross
from ..tests.test_fisher import ts_fisher
from ..tests.test_freeman_tukey_ind import ts_freeman_tukey_ind
from ..tests.test_g_ind import ts_g_ind
from ..tests.test_mod_log_likelihood_ind import ts_mod_log_likelihood_ind
from ..tests.test_neyman_ind import ts_neyman_ind
from ..tests.test_pearson_ind import ts_pearson_ind
from ..tests.test_powerdivergence_ind import ts_powerdivergence_ind
def ts_mood_median(catField, ordField, categories=None, levels=None, test="pearson", cc=None, lambd=2/3):
'''
Mood Median Test
----------------
This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median.
The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used.
Parameters
----------
catField : pandas series
data with categories
ordField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
levels : list or dictionary, optional
the levels or order used in ordField.
test : {"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional
the test of independence to use. Default is "pearson".
cc : {None, "yates", "pearson", "williams"}, optional
method for continuity correction
lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test.
Returns
-------
A dataframe with the results of the specified test.
Notes
-----
The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median.
A chi-square test of independence on this cross table can then be performed. There are quite some different options for this:
* "pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function.
* "fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson"
* "freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function
* "g", will perform a G test of independence using the ts_g_ind() function
* "mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function
* "neyman", will perform a Neyman test of independence using the ts_neyman_ind() function
* "power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function.
The formula using the default Pearson test is:
$$\\chi_{M}^2 = \\sum_{i=1}^2 \\sum_{j=1}^k \\frac{\\left(F_{i,j}-E_{i,j}\\right)^2}{E_{i,j}}$$
$$df = k - 1$$
$$sig. = 1 - \\chi^2\\left(\\chi_{M}^2, df\\right)$$
With:
$$E_{i,j} = \\frac{R_i \\times C_j}{n}$$
$$R_i = \\sum_{j=1}^k F_{i,j}$$
$$C_j = \\sum_{i=1}^2 F_{i,j}$$
$$n = \\sum_{i=1}^2 \\sum_{j=1}^k F_{i,j} = \\sum_{i=1}^2 R_i = \\sum_{j=1}^k C_j$$
The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951).
*Symbols used:*
* \\(k\\), the number of categories (columns)
* \\(F_{1,j}\\), the number of scores is category j that are above the overall median
* \\(F_{2,j}\\), the number of scores is category j that are equal or below the overall median
* \\(E_{i,j}\\), the expected count in row i and column j.
* \\(R_i\\), the row total of row i
* \\(C_j\\), the column total of column j
* \\(n\\), the overall total.
* \\(df\\), the degrees of freedom
* \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function of the chi-square distribution.
References
----------
Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167.
Mood, A. M. (1950). *Introduction to the theory of statistics*. McGraw-Hill.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if lambd is None:
lambd = 2/3
#create the cross table
ct = tab_cross(ordField, catField, order1=levels, order2=categories, totals="include")
#basic counts
k = ct.shape[1]-1
nlvl = ct.shape[0]-1
n = ct.iloc[nlvl, k]
#the overall median
#note that this will not exactly determine a between value for the median
#but thats okay since we only care if original values are above median or equal+ below
medIndex = int((n + 1) / 2)
cf = ct.iloc[0, k]
med = 1
while cf < medIndex:
med = med + 1
cf = cf + ct.iloc[med-1, k]
#observed below and above overall median
obs = pd.DataFrame()
nbtot = 0
natot = 0
for j in range(0, k):
nbelow = 0
i = 1
while i <= med:
nbelow = nbelow + ct.iloc[i-1, j]
i = i + 1
obs.at[0, j] = nbelow
nbtot = nbtot + nbelow
natot = natot + ct.iloc[nlvl, j] - nbelow
obs.at[1, j] = ct.iloc[nlvl, j] - nbelow
catArr = pd.Series(dtype="object")
ordArr = pd.Series(dtype="object")
arrRow = 0
for j in range(0, k):
for i in range(0,2):
for sc in range(0 , int(obs.loc[i, j])):
catArr.at[arrRow] = ct.columns[j]
ordArr.at[arrRow] = i+1
arrRow = arrRow + 1
#now for the test
if test=="fisher":
if k>2:
test = "pearson"
else:
res = ts_fisher(catArr, ordArr)
if test=="freeman-tukey":
res = ts_freeman_tukey_ind(catArr, ordArr, cc=cc)
elif test=="g":
res = ts_g_ind(catArr, ordArr, cc=cc)
elif test=="mod-log":
res = ts_mod_log_likelihood_ind(catArr, ordArr, cc=cc)
elif test=="neyman":
res = ts_neyman_ind(catArr, ordArr, cc=cc)
elif test=="pearson":
res = ts_pearson_ind(catArr, ordArr, cc=cc)
elif test=="power":
res = ts_powerdivergence_ind(catArr, ordArr, cc=cc, lambd=lambd)
return res
Functions
def ts_mood_median(catField, ordField, categories=None, levels=None, test='pearson', cc=None, lambd=0.6666666666666666)-
Mood Median Test
This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median.
The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used.
Parameters
catField:pandas series- data with categories
ordField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
levels:listordictionary, optional- the levels or order used in ordField.
test:{"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional- the test of independence to use. Default is "pearson".
cc:{None, "yates", "pearson", "williams"}, optional- method for continuity correction
lambd:{float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional- either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test.
Returns
A dataframe with the results of the specified test.
Notes
The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median.
A chi-square test of independence on this cross table can then be performed. There are quite some different options for this:
- "pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function.
- "fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson"
- "freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function
- "g", will perform a G test of independence using the ts_g_ind() function
- "mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function
- "neyman", will perform a Neyman test of independence using the ts_neyman_ind() function
- "power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function.
The formula using the default Pearson test is: \chi_{M}^2 = \sum_{i=1}^2 \sum_{j=1}^k \frac{\left(F_{i,j}-E_{i,j}\right)^2}{E_{i,j}} df = k - 1 sig. = 1 - \chi^2\left(\chi_{M}^2, df\right)
With: E_{i,j} = \frac{R_i \times C_j}{n} R_i = \sum_{j=1}^k F_{i,j} C_j = \sum_{i=1}^2 F_{i,j} n = \sum_{i=1}^2 \sum_{j=1}^k F_{i,j} = \sum_{i=1}^2 R_i = \sum_{j=1}^k C_j
The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951).
Symbols used:
- k, the number of categories (columns)
- F_{1,j}, the number of scores is category j that are above the overall median
- F_{2,j}, the number of scores is category j that are equal or below the overall median
- E_{i,j}, the expected count in row i and column j.
- R_i, the row total of row i
- C_j, the column total of column j
- n, the overall total.
- df, the degrees of freedom
- \chi^2\left(\dots\right), the cumulative distribution function of the chi-square distribution.
References
Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167.
Mood, A. M. (1950). Introduction to the theory of statistics. McGraw-Hill.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_mood_median(catField, ordField, categories=None, levels=None, test="pearson", cc=None, lambd=2/3): ''' Mood Median Test ---------------- This test looks if the median from different categories would be the same in the population. If not, at least one is different then at least one other category. A Kruskal-Wallis test (see ts_kruksal_wallis()) is very similar but checks the average ranks instead of median. The test only looks at the number of scores above the overall median and those that are equal or below. A cross table is made with each category and the numbers below and above the overall median. From this table a test of independence can be used. Parameters ---------- catField : pandas series data with categories ordField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField levels : list or dictionary, optional the levels or order used in ordField. test : {"pearson", "fisher", "freeman-tukey", "g", "mod-log", "neyman", "power"}, optional the test of independence to use. Default is "pearson". cc : {None, "yates", "pearson", "williams"}, optional method for continuity correction lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3. Only applies to Power Divergence test. Returns ------- A dataframe with the results of the specified test. Notes ----- The Mood Median test creates a 2xk cross table, with k being the number of categories. The two rows are one for the number of scores in that category that are above the overall median, and the second row the number of scores in that category that are equal or below the overall median. A chi-square test of independence on this cross table can then be performed. There are quite some different options for this: * "pearson", will perform a Pearson chi-square test of independence using the ts_pearson_ind() function. * "fisher", will perform a Fisher exact test using the ts_fisher() function, but only if there are 2 categories, if there are more the test will be set to "pearson" * "freeman-tukey", will perform a Freeman-Tukey test of independence using the ts_freeman_tukey_ind() function * "g", will perform a G test of independence using the ts_g_ind() function * "mod-log", will perform a Mod-Log Likelihood test of independence using the ts_mod_log_likelihood_ind() function * "neyman", will perform a Neyman test of independence using the ts_neyman_ind() function * "power", will perform a Power Divergence test of independence using the ts_powerdivergence_ind() function. The formula using the default Pearson test is: $$\\chi_{M}^2 = \\sum_{i=1}^2 \\sum_{j=1}^k \\frac{\\left(F_{i,j}-E_{i,j}\\right)^2}{E_{i,j}}$$ $$df = k - 1$$ $$sig. = 1 - \\chi^2\\left(\\chi_{M}^2, df\\right)$$ With: $$E_{i,j} = \\frac{R_i \\times C_j}{n}$$ $$R_i = \\sum_{j=1}^k F_{i,j}$$ $$C_j = \\sum_{i=1}^2 F_{i,j}$$ $$n = \\sum_{i=1}^2 \\sum_{j=1}^k F_{i,j} = \\sum_{i=1}^2 R_i = \\sum_{j=1}^k C_j$$ The original source for the formula is most likely Mood (1950), but the ones shown are based on Brown and Mood (1951). *Symbols used:* * \\(k\\), the number of categories (columns) * \\(F_{1,j}\\), the number of scores is category j that are above the overall median * \\(F_{2,j}\\), the number of scores is category j that are equal or below the overall median * \\(E_{i,j}\\), the expected count in row i and column j. * \\(R_i\\), the row total of row i * \\(C_j\\), the column total of column j * \\(n\\), the overall total. * \\(df\\), the degrees of freedom * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function of the chi-square distribution. References ---------- Brown, G. W., & Mood, A. M. (1951). On median tests for linear hypotheses. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 2, 159–167. Mood, A. M. (1950). *Introduction to the theory of statistics*. McGraw-Hill. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if lambd is None: lambd = 2/3 #create the cross table ct = tab_cross(ordField, catField, order1=levels, order2=categories, totals="include") #basic counts k = ct.shape[1]-1 nlvl = ct.shape[0]-1 n = ct.iloc[nlvl, k] #the overall median #note that this will not exactly determine a between value for the median #but thats okay since we only care if original values are above median or equal+ below medIndex = int((n + 1) / 2) cf = ct.iloc[0, k] med = 1 while cf < medIndex: med = med + 1 cf = cf + ct.iloc[med-1, k] #observed below and above overall median obs = pd.DataFrame() nbtot = 0 natot = 0 for j in range(0, k): nbelow = 0 i = 1 while i <= med: nbelow = nbelow + ct.iloc[i-1, j] i = i + 1 obs.at[0, j] = nbelow nbtot = nbtot + nbelow natot = natot + ct.iloc[nlvl, j] - nbelow obs.at[1, j] = ct.iloc[nlvl, j] - nbelow catArr = pd.Series(dtype="object") ordArr = pd.Series(dtype="object") arrRow = 0 for j in range(0, k): for i in range(0,2): for sc in range(0 , int(obs.loc[i, j])): catArr.at[arrRow] = ct.columns[j] ordArr.at[arrRow] = i+1 arrRow = arrRow + 1 #now for the test if test=="fisher": if k>2: test = "pearson" else: res = ts_fisher(catArr, ordArr) if test=="freeman-tukey": res = ts_freeman_tukey_ind(catArr, ordArr, cc=cc) elif test=="g": res = ts_g_ind(catArr, ordArr, cc=cc) elif test=="mod-log": res = ts_mod_log_likelihood_ind(catArr, ordArr, cc=cc) elif test=="neyman": res = ts_neyman_ind(catArr, ordArr, cc=cc) elif test=="pearson": res = ts_pearson_ind(catArr, ordArr, cc=cc) elif test=="power": res = ts_powerdivergence_ind(catArr, ordArr, cc=cc, lambd=lambd) return res