Module stikpetP.tests.test_bhapkar
Expand source code
import pandas as pd
from scipy.stats import chi2
from numpy.linalg import inv
from numpy import matmul, array, transpose
from ..other.table_cross import tab_cross
def ts_bhapkar(field1, field2, categories=None):
'''
Bhapkar Test
------------
If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred.
Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables.
Parameters
----------
field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories : list or dictionary, optional
order and/or selection for categories of field1 and field2
Returns
-------
* *n*, the sample size
* *statistic*, the chi-squared value
* *df*, the degrees of freedom used in the test
* *p-value*, the significance (p-value)
Notes
-----
The formula used is:
$$\\chi_{B}^2 = n\\times d^{\\prime} \\times S^{-1} \\times d$$
With:
$$S_{i,i} = p_{i,.} + p_{.,i} - 2\\times p_{i,i} - \\left(p_{i,.} - p_{.,i}\\right)^2$$
$$S_{i,j} = -\\left(p_{i,j} + p_{j,i}\\right) - \\left(p_{i,.} - p_{.,i}\\right)\\times\\left(p_{j,.} - p_{.,j}\\right)$$
$$d_i = p_{i,.} - p_{.,i}$$
$$p_{i,j} = \\frac{F_{i,j}}{n}$$
$$d = \\begin{bmatrix} d_1 \\\\ d_2 \\\\ \\dots \\\\ d_{r-1} \\ \\end{bmatrix}$$
$$S = \\begin{bmatrix} S_{1,1} & S_{1,2} & \\dots & S_{1,c-1} \\\\ S_{2,1} & S_{2,2} & \\dots & S_{2,c-1} \\\\ \\dots & \\dots & \\dots & \\dots \\\\ S_{r-1,1} & S_{r-1,2} & \\dots & S_{r-1,c-1} \\\\ \\end{bmatrix}$$
$$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
The p-value (sig.):
$$df = r - 1 = c - 1$$
$$sig. = 1 - \\chi^2\\left(\\chi_B^2, df\\right)$$
*Symbols used:*
* \\(r\\), is the number of rows (categories in the first variable)
* \\(c\\), is the number of columns (categories in the second variable)
* \\(n\\), is the total number of scores
* \\(F_{i,j}\\), is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second.
* \\(p_{i,.}\\), The sum of the proportions in row i
* \\(p_{.,i}\\), The sum of the proportions in column i
* \\(d^{\\prime}\\), is the transpose of the d vector
* \\(S^{-1}\\), is the inverse of the S matrix.
* \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function for the chi-square distribution.
*Note*
* The d vector and S matrix are one row (and column) less.
* This test only differs from the Stuart-Maxwell test in the calculation of S
* The test was introduced by Bhapkar (1961, 1966)
References
----------
Bhapkar, V. P. (1961). Some tests for categorical data. *The Annals of Mathematical Statistics, 32*(1), 72–83. doi:10.1214/aoms/1177705140
Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. *Journal of the American Statistical Association, 61*(313), 228–235. doi:10.1080/01621459.1966.10502021
Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. *The British Journal of Psychiatry, 116*(535), 651–655. doi:10.1192/bjp.116.535.651
Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. *Biometrika, 42*(3/4), 412–416. doi:10.2307/2333387
Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
#create the cross table
ct = tab_cross(field1, field2, categories, categories, totals="include")
#basic counts
k = ct.shape[0]-1
n = ct.iloc[k, k]
#STEP 1: Convert to percentages based on grand total
p = pd.DataFrame()
for i in range(0, k+1):
for j in range(0, k + 1):
p.at[i, j] = ct.iloc[i, j] / n
#STEP 2: Determine the differences between the row and the column totals
d = [0]*(k - 1)
for i in range(0, k - 1):
d[i] = p.iloc[i, k] - p.iloc[k, i]
#STEP 3: Create the variance and covariance matrix
#For values on the diagonal add the row and column p total,
#subtract twice the cell p and then
#subtract the squared difference between the row and column p total.
#For values not on the diagonal add the mirrored cell p and then
#add a minus sign, then subtract the product of
#the difference of the row p total of the current cell and the column p total of the mirrored cell,
#with the difference of the row p total of the mirrored cell and column p total of the current cell.
S = pd.DataFrame()
for i in range(0, k - 1):
for j in range(0, k - 1):
if i == j:
S.at[i, j] = p.iloc[i, k] + p.iloc[k, j] - 2 * p.iloc[i, j] - (p.iloc[i, k] - p.iloc[k, j])**2
else:
S.at[i, j] = -(p.iloc[i, j] + p.iloc[j, i]) - (p.iloc[i, k] - p.iloc[k, i]) * (p.iloc[j, k] - p.iloc[k, j])
Sinv = inv(S)
chiVal = sum(matmul(n * array(d).transpose(), Sinv) * d)
#test
df = k - 1
pvalue = chi2.sf(chiVal, df)
#results
colNames = ["n", "statistic", "df", "p-value"]
results = pd.DataFrame([[n, chiVal, df, pvalue]], columns=colNames)
return results
Functions
def ts_bhapkar(field1, field2, categories=None)-
Bhapkar Test
If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred.
Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables.
Parameters
field1:listorpandas series- the first categorical field
field2:listorpandas series- the first categorical field
categories:listordictionary, optional- order and/or selection for categories of field1 and field2
Returns
- n, the sample size
- statistic, the chi-squared value
- df, the degrees of freedom used in the test
- p-value, the significance (p-value)
Notes
The formula used is: \chi_{B}^2 = n\times d^{\prime} \times S^{-1} \times d
With: S_{i,i} = p_{i,.} + p_{.,i} - 2\times p_{i,i} - \left(p_{i,.} - p_{.,i}\right)^2 S_{i,j} = -\left(p_{i,j} + p_{j,i}\right) - \left(p_{i,.} - p_{.,i}\right)\times\left(p_{j,.} - p_{.,j}\right) d_i = p_{i,.} - p_{.,i} p_{i,j} = \frac{F_{i,j}}{n} d = \begin{bmatrix} d_1 \\ d_2 \\ \dots \\ d_{r-1} \ \end{bmatrix} S = \begin{bmatrix} S_{1,1} & S_{1,2} & \dots & S_{1,c-1} \\ S_{2,1} & S_{2,2} & \dots & S_{2,c-1} \\ \dots & \dots & \dots & \dots \\ S_{r-1,1} & S_{r-1,2} & \dots & S_{r-1,c-1} \\ \end{bmatrix} n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j}
The p-value (sig.): df = r - 1 = c - 1 sig. = 1 - \chi^2\left(\chi_B^2, df\right)
Symbols used:
- r, is the number of rows (categories in the first variable)
- c, is the number of columns (categories in the second variable)
- n, is the total number of scores
- F_{i,j}, is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second.
- p_{i,.}, The sum of the proportions in row i
- p_{.,i}, The sum of the proportions in column i
- d^{\prime}, is the transpose of the d vector
- S^{-1}, is the inverse of the S matrix.
- \chi^2\left(\dots\right), the cumulative distribution function for the chi-square distribution.
Note
- The d vector and S matrix are one row (and column) less.
- This test only differs from the Stuart-Maxwell test in the calculation of S
- The test was introduced by Bhapkar (1961, 1966)
References
Bhapkar, V. P. (1961). Some tests for categorical data. The Annals of Mathematical Statistics, 32(1), 72–83. doi:10.1214/aoms/1177705140
Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61(313), 228–235. doi:10.1080/01621459.1966.10502021
Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. The British Journal of Psychiatry, 116(535), 651–655. doi:10.1192/bjp.116.535.651
Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42(3/4), 412–416. doi:10.2307/2333387
Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_bhapkar(field1, field2, categories=None): ''' Bhapkar Test ------------ If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred. Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables. Parameters ---------- field1 : list or pandas series the first categorical field field2 : list or pandas series the first categorical field categories : list or dictionary, optional order and/or selection for categories of field1 and field2 Returns ------- * *n*, the sample size * *statistic*, the chi-squared value * *df*, the degrees of freedom used in the test * *p-value*, the significance (p-value) Notes ----- The formula used is: $$\\chi_{B}^2 = n\\times d^{\\prime} \\times S^{-1} \\times d$$ With: $$S_{i,i} = p_{i,.} + p_{.,i} - 2\\times p_{i,i} - \\left(p_{i,.} - p_{.,i}\\right)^2$$ $$S_{i,j} = -\\left(p_{i,j} + p_{j,i}\\right) - \\left(p_{i,.} - p_{.,i}\\right)\\times\\left(p_{j,.} - p_{.,j}\\right)$$ $$d_i = p_{i,.} - p_{.,i}$$ $$p_{i,j} = \\frac{F_{i,j}}{n}$$ $$d = \\begin{bmatrix} d_1 \\\\ d_2 \\\\ \\dots \\\\ d_{r-1} \\ \\end{bmatrix}$$ $$S = \\begin{bmatrix} S_{1,1} & S_{1,2} & \\dots & S_{1,c-1} \\\\ S_{2,1} & S_{2,2} & \\dots & S_{2,c-1} \\\\ \\dots & \\dots & \\dots & \\dots \\\\ S_{r-1,1} & S_{r-1,2} & \\dots & S_{r-1,c-1} \\\\ \\end{bmatrix}$$ $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$ The p-value (sig.): $$df = r - 1 = c - 1$$ $$sig. = 1 - \\chi^2\\left(\\chi_B^2, df\\right)$$ *Symbols used:* * \\(r\\), is the number of rows (categories in the first variable) * \\(c\\), is the number of columns (categories in the second variable) * \\(n\\), is the total number of scores * \\(F_{i,j}\\), is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second. * \\(p_{i,.}\\), The sum of the proportions in row i * \\(p_{.,i}\\), The sum of the proportions in column i * \\(d^{\\prime}\\), is the transpose of the d vector * \\(S^{-1}\\), is the inverse of the S matrix. * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function for the chi-square distribution. *Note* * The d vector and S matrix are one row (and column) less. * This test only differs from the Stuart-Maxwell test in the calculation of S * The test was introduced by Bhapkar (1961, 1966) References ---------- Bhapkar, V. P. (1961). Some tests for categorical data. *The Annals of Mathematical Statistics, 32*(1), 72–83. doi:10.1214/aoms/1177705140 Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. *Journal of the American Statistical Association, 61*(313), 228–235. doi:10.1080/01621459.1966.10502021 Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. *The British Journal of Psychiatry, 116*(535), 651–655. doi:10.1192/bjp.116.535.651 Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. *Biometrika, 42*(3/4), 412–416. doi:10.2307/2333387 Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' #create the cross table ct = tab_cross(field1, field2, categories, categories, totals="include") #basic counts k = ct.shape[0]-1 n = ct.iloc[k, k] #STEP 1: Convert to percentages based on grand total p = pd.DataFrame() for i in range(0, k+1): for j in range(0, k + 1): p.at[i, j] = ct.iloc[i, j] / n #STEP 2: Determine the differences between the row and the column totals d = [0]*(k - 1) for i in range(0, k - 1): d[i] = p.iloc[i, k] - p.iloc[k, i] #STEP 3: Create the variance and covariance matrix #For values on the diagonal add the row and column p total, #subtract twice the cell p and then #subtract the squared difference between the row and column p total. #For values not on the diagonal add the mirrored cell p and then #add a minus sign, then subtract the product of #the difference of the row p total of the current cell and the column p total of the mirrored cell, #with the difference of the row p total of the mirrored cell and column p total of the current cell. S = pd.DataFrame() for i in range(0, k - 1): for j in range(0, k - 1): if i == j: S.at[i, j] = p.iloc[i, k] + p.iloc[k, j] - 2 * p.iloc[i, j] - (p.iloc[i, k] - p.iloc[k, j])**2 else: S.at[i, j] = -(p.iloc[i, j] + p.iloc[j, i]) - (p.iloc[i, k] - p.iloc[k, i]) * (p.iloc[j, k] - p.iloc[k, j]) Sinv = inv(S) chiVal = sum(matmul(n * array(d).transpose(), Sinv) * d) #test df = k - 1 pvalue = chi2.sf(chiVal, df) #results colNames = ["n", "statistic", "df", "p-value"] results = pd.DataFrame([[n, chiVal, df, pvalue]], columns=colNames) return results