Module stikpetP.tests.test_bhapkar

Expand source code
import pandas as pd
from scipy.stats import chi2
from numpy.linalg import inv
from numpy import matmul, array, transpose
from ..other.table_cross import tab_cross

def ts_bhapkar(field1, field2, categories=None):
    '''
    Bhapkar Test
    ------------
    If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred.
    
    Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables.
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories : list or dictionary, optional
        order and/or selection for categories of field1 and field2
    
    Returns
    -------
    * *n*, the sample size
    * *statistic*, the chi-squared value
    * *df*, the degrees of freedom used in the test
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used is:
    $$\\chi_{B}^2 = n\\times d^{\\prime} \\times S^{-1} \\times d$$
    
    With:
    $$S_{i,i} = p_{i,.} + p_{.,i} - 2\\times p_{i,i} - \\left(p_{i,.} - p_{.,i}\\right)^2$$
    $$S_{i,j} = -\\left(p_{i,j} + p_{j,i}\\right) - \\left(p_{i,.} - p_{.,i}\\right)\\times\\left(p_{j,.} - p_{.,j}\\right)$$
    $$d_i = p_{i,.} - p_{.,i}$$
    $$p_{i,j} = \\frac{F_{i,j}}{n}$$
    $$d = \\begin{bmatrix} d_1 \\\\ d_2 \\\\ \\dots \\\\ d_{r-1} \\ \\end{bmatrix}$$
    $$S = \\begin{bmatrix} S_{1,1} & S_{1,2} & \\dots & S_{1,c-1} \\\\ S_{2,1} & S_{2,2} & \\dots & S_{2,c-1} \\\\  \\dots & \\dots & \\dots & \\dots \\\\ S_{r-1,1} & S_{r-1,2} & \\dots & S_{r-1,c-1} \\\\ \\end{bmatrix}$$
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    
    The p-value (sig.):
    $$df = r - 1 = c - 1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_B^2, df\\right)$$
    
    *Symbols used:*
    
    * \\(r\\), is the number of rows (categories in the first variable)
    * \\(c\\), is the number of columns (categories in the second variable)
    * \\(n\\), is the total number of scores
    * \\(F_{i,j}\\), is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second.
    * \\(p_{i,.}\\), The sum of the proportions in row i
    * \\(p_{.,i}\\), The sum of the proportions in column i
    * \\(d^{\\prime}\\), is the transpose of the d vector
    * \\(S^{-1}\\), is the inverse of the S matrix.
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function for the chi-square distribution.

    *Note*
        
    * The d vector and S matrix are one row (and column) less.
    * This test only differs from the Stuart-Maxwell test in the calculation of S
    * The test was introduced by Bhapkar (1961, 1966)
    
    References
    ----------
    Bhapkar, V. P. (1961). Some tests for categorical data. *The Annals of Mathematical Statistics, 32*(1), 72–83. doi:10.1214/aoms/1177705140
    
    Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. *Journal of the American Statistical Association, 61*(313), 228–235. doi:10.1080/01621459.1966.10502021
    
    Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. *The British Journal of Psychiatry, 116*(535), 651–655. doi:10.1192/bjp.116.535.651
    
    Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. *Biometrika, 42*(3/4), 412–416. doi:10.2307/2333387
    
    Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    #create the cross table
    ct = tab_cross(field1, field2, categories, categories, totals="include")    
    
    #basic counts
    k = ct.shape[0]-1
    n = ct.iloc[k, k]
    
    #STEP 1: Convert to percentages based on grand total
    p = pd.DataFrame()
    for i in range(0, k+1):
        for j in range(0, k + 1):
            p.at[i, j] = ct.iloc[i, j] / n
    
    #STEP 2: Determine the differences between the row and the column totals
    d = [0]*(k - 1)
    for i in range(0, k - 1):
        d[i] = p.iloc[i, k] - p.iloc[k, i]
    
    #STEP 3: Create the variance and covariance matrix
    #For values on the diagonal add the row and column p total,
    #subtract twice the cell p and then
    #subtract the squared difference between the row and column p total.
    
    #For values not on the diagonal add the mirrored cell p and then
    #add a minus sign, then subtract the product of
    #the difference of the row p total of the current cell and the column p total of the mirrored cell,
    #with the difference of the row p total of the mirrored cell and column p total of the current cell.
    
    S = pd.DataFrame()
    for i in range(0, k - 1):
        for j in range(0, k - 1):
            if i == j:
                S.at[i, j] = p.iloc[i, k] + p.iloc[k, j] - 2 * p.iloc[i, j] - (p.iloc[i, k] - p.iloc[k, j])**2
            else:
                S.at[i, j] = -(p.iloc[i, j] + p.iloc[j, i]) - (p.iloc[i, k] - p.iloc[k, i]) * (p.iloc[j, k] - p.iloc[k, j])
    
    Sinv = inv(S)
    chiVal = sum(matmul(n * array(d).transpose(), Sinv) * d)
    
    #test
    df = k - 1
    pvalue = chi2.sf(chiVal, df)
    
    #results
    colNames = ["n", "statistic", "df", "p-value"]
    results = pd.DataFrame([[n, chiVal, df, pvalue]], columns=colNames)
    
    return results

Functions

def ts_bhapkar(field1, field2, categories=None)

Bhapkar Test

If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred.

Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables.

Parameters

field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories : list or dictionary, optional
order and/or selection for categories of field1 and field2

Returns

  • n, the sample size
  • statistic, the chi-squared value
  • df, the degrees of freedom used in the test
  • p-value, the significance (p-value)

Notes

The formula used is: \chi_{B}^2 = n\times d^{\prime} \times S^{-1} \times d

With: S_{i,i} = p_{i,.} + p_{.,i} - 2\times p_{i,i} - \left(p_{i,.} - p_{.,i}\right)^2 S_{i,j} = -\left(p_{i,j} + p_{j,i}\right) - \left(p_{i,.} - p_{.,i}\right)\times\left(p_{j,.} - p_{.,j}\right) d_i = p_{i,.} - p_{.,i} p_{i,j} = \frac{F_{i,j}}{n} d = \begin{bmatrix} d_1 \\ d_2 \\ \dots \\ d_{r-1} \ \end{bmatrix} S = \begin{bmatrix} S_{1,1} & S_{1,2} & \dots & S_{1,c-1} \\ S_{2,1} & S_{2,2} & \dots & S_{2,c-1} \\ \dots & \dots & \dots & \dots \\ S_{r-1,1} & S_{r-1,2} & \dots & S_{r-1,c-1} \\ \end{bmatrix} n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j}

The p-value (sig.): df = r - 1 = c - 1 sig. = 1 - \chi^2\left(\chi_B^2, df\right)

Symbols used:

  • r, is the number of rows (categories in the first variable)
  • c, is the number of columns (categories in the second variable)
  • n, is the total number of scores
  • F_{i,j}, is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second.
  • p_{i,.}, The sum of the proportions in row i
  • p_{.,i}, The sum of the proportions in column i
  • d^{\prime}, is the transpose of the d vector
  • S^{-1}, is the inverse of the S matrix.
  • \chi^2\left(\dots\right), the cumulative distribution function for the chi-square distribution.

Note

  • The d vector and S matrix are one row (and column) less.
  • This test only differs from the Stuart-Maxwell test in the calculation of S
  • The test was introduced by Bhapkar (1961, 1966)

References

Bhapkar, V. P. (1961). Some tests for categorical data. The Annals of Mathematical Statistics, 32(1), 72–83. doi:10.1214/aoms/1177705140

Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61(313), 228–235. doi:10.1080/01621459.1966.10502021

Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. The British Journal of Psychiatry, 116(535), 651–655. doi:10.1192/bjp.116.535.651

Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42(3/4), 412–416. doi:10.2307/2333387

Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_bhapkar(field1, field2, categories=None):
    '''
    Bhapkar Test
    ------------
    If you are only interested if the overall distribution changed (i.e. if the percentages from each category changed or not), you can perform a marginal homogeneity test. There are two that seem to be quite popular for this, the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970), and the Bhapkar test (Bhapkar, 1961; 1966). According Uebersax (2006) (which also has a nice example) the Bhapkar one is preferred.
    
    Simply put, a marginal homogeneity test, looks at the row vs column proportions. Since in a paired test, the options are the same, if the row and column proportions are the same, nothing changed between the two variables.
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories : list or dictionary, optional
        order and/or selection for categories of field1 and field2
    
    Returns
    -------
    * *n*, the sample size
    * *statistic*, the chi-squared value
    * *df*, the degrees of freedom used in the test
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used is:
    $$\\chi_{B}^2 = n\\times d^{\\prime} \\times S^{-1} \\times d$$
    
    With:
    $$S_{i,i} = p_{i,.} + p_{.,i} - 2\\times p_{i,i} - \\left(p_{i,.} - p_{.,i}\\right)^2$$
    $$S_{i,j} = -\\left(p_{i,j} + p_{j,i}\\right) - \\left(p_{i,.} - p_{.,i}\\right)\\times\\left(p_{j,.} - p_{.,j}\\right)$$
    $$d_i = p_{i,.} - p_{.,i}$$
    $$p_{i,j} = \\frac{F_{i,j}}{n}$$
    $$d = \\begin{bmatrix} d_1 \\\\ d_2 \\\\ \\dots \\\\ d_{r-1} \\ \\end{bmatrix}$$
    $$S = \\begin{bmatrix} S_{1,1} & S_{1,2} & \\dots & S_{1,c-1} \\\\ S_{2,1} & S_{2,2} & \\dots & S_{2,c-1} \\\\  \\dots & \\dots & \\dots & \\dots \\\\ S_{r-1,1} & S_{r-1,2} & \\dots & S_{r-1,c-1} \\\\ \\end{bmatrix}$$
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    
    The p-value (sig.):
    $$df = r - 1 = c - 1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_B^2, df\\right)$$
    
    *Symbols used:*
    
    * \\(r\\), is the number of rows (categories in the first variable)
    * \\(c\\), is the number of columns (categories in the second variable)
    * \\(n\\), is the total number of scores
    * \\(F_{i,j}\\), is the frequency (count) of scores equal to the i-th category in the first variable, and the j-th category in the second.
    * \\(p_{i,.}\\), The sum of the proportions in row i
    * \\(p_{.,i}\\), The sum of the proportions in column i
    * \\(d^{\\prime}\\), is the transpose of the d vector
    * \\(S^{-1}\\), is the inverse of the S matrix.
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative distribution function for the chi-square distribution.

    *Note*
        
    * The d vector and S matrix are one row (and column) less.
    * This test only differs from the Stuart-Maxwell test in the calculation of S
    * The test was introduced by Bhapkar (1961, 1966)
    
    References
    ----------
    Bhapkar, V. P. (1961). Some tests for categorical data. *The Annals of Mathematical Statistics, 32*(1), 72–83. doi:10.1214/aoms/1177705140
    
    Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. *Journal of the American Statistical Association, 61*(313), 228–235. doi:10.1080/01621459.1966.10502021
    
    Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. *The British Journal of Psychiatry, 116*(535), 651–655. doi:10.1192/bjp.116.535.651
    
    Stuart, A. (1955). A test for homogeneity of the marginal distributions in a two-way classification. *Biometrika, 42*(3/4), 412–416. doi:10.2307/2333387
    
    Uebersax, J. (2006, August 30). McNemar tests of marginal homogeneity. http://www.john-uebersax.com/stat/mcnemar.htm
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    #create the cross table
    ct = tab_cross(field1, field2, categories, categories, totals="include")    
    
    #basic counts
    k = ct.shape[0]-1
    n = ct.iloc[k, k]
    
    #STEP 1: Convert to percentages based on grand total
    p = pd.DataFrame()
    for i in range(0, k+1):
        for j in range(0, k + 1):
            p.at[i, j] = ct.iloc[i, j] / n
    
    #STEP 2: Determine the differences between the row and the column totals
    d = [0]*(k - 1)
    for i in range(0, k - 1):
        d[i] = p.iloc[i, k] - p.iloc[k, i]
    
    #STEP 3: Create the variance and covariance matrix
    #For values on the diagonal add the row and column p total,
    #subtract twice the cell p and then
    #subtract the squared difference between the row and column p total.
    
    #For values not on the diagonal add the mirrored cell p and then
    #add a minus sign, then subtract the product of
    #the difference of the row p total of the current cell and the column p total of the mirrored cell,
    #with the difference of the row p total of the mirrored cell and column p total of the current cell.
    
    S = pd.DataFrame()
    for i in range(0, k - 1):
        for j in range(0, k - 1):
            if i == j:
                S.at[i, j] = p.iloc[i, k] + p.iloc[k, j] - 2 * p.iloc[i, j] - (p.iloc[i, k] - p.iloc[k, j])**2
            else:
                S.at[i, j] = -(p.iloc[i, j] + p.iloc[j, i]) - (p.iloc[i, k] - p.iloc[k, i]) * (p.iloc[j, k] - p.iloc[k, j])
    
    Sinv = inv(S)
    chiVal = sum(matmul(n * array(d).transpose(), Sinv) * d)
    
    #test
    df = k - 1
    pvalue = chi2.sf(chiVal, df)
    
    #results
    colNames = ["n", "statistic", "df", "p-value"]
    results = pd.DataFrame([[n, chiVal, df, pvalue]], columns=colNames)
    
    return results