Module stikpetP.effect_sizes.eff_size_goodman_kruskal_tau

Expand source code
import pandas as pd
from scipy.stats import chi2
from ..other.table_cross import tab_cross

def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None):
    '''
    Goodman-Kruskal Tau
    -------------------
    According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table.
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
        
    Returns
    -------
    A dataframe with:
    
    * *dependent*, the field used as dependent variable
    * *value*, the tau value
    * *statistic*, the chi-square value
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used (Goodman & Kruskal, 1954, p. 759):
    $$\\tau_{Y|X} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{R_i} - \\sum_j C_j^2}{n^2 - \\sum_j C_j^2}$$
    $$\\tau_{X|Y} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{C_j} - \\sum_i C_i^2}{n^2 - \\sum_i C_i^2}$$
    
    The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178):
    $$\\chi_{\\tau_{Y|X}}^2 = \\left(n - 1\\right)\\times\\left(c - 1\\right)\\times\\tau_{Y|X}$$
    $$\\chi_{\\tau_{X|Y}}^2 = \\left(n - 1\\right)\\times\\left(r - 1\\right)\\times\\tau_{X|Y}$$
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    
    Light and Margolin developed a \\(R^2\\) measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that \\(R^2\\) from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance.
    
    *Symbols used:*
    
    * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j
    * \\(c\\), the number of columns
    * \\(r\\), the number of rows
    * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
    * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
    
    References
    ----------
    Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
    
    Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. *Journal of the American Statistical Association, 66*(335), 534–544. doi:10.1080/01621459.1971.10482297
    
    Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. *Journal of the American Statistical Association, 69*(347), 755–764. doi:10.1080/01621459.1974.10480201
    
    Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/
    
    Särndal, C. E. (1974). A comparative study of association measures. *Psychometrika, 39*(2), 165–187. doi:10.1007/BF02291467
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0]-1
    ncols =  ct.shape[1]-1
    n = ct.iloc[nrows, ncols]
    
    #the margin totals
    rs = ct.iloc[0:nrows, ncols]
    cs = ct.iloc[nrows, 0:ncols]
    
    #tau
    tauyx = 0
    tauxy = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            tauyx = tauyx + ct.iloc[i, j]**2 / rs[i]
            tauxy = tauxy + ct.iloc[i, j]**2 / cs[j]
    scs2 = 0
    for j in range(0, ncols):
        scs2 = scs2 + cs[j]**2
    tauyx = (n * tauyx - scs2) / (n**2 - scs2)
    
    srs2 = 0
    for i in range(0, nrows):
        srs2 = srs2 + rs[i]**2
    tauxy = (n * tauxy - srs2) / (n**2 - srs2)
    
    chi2yx = (n - 1) * (ncols - 1) * tauyx
    chi2xy = (n - 1) * (nrows - 1) * tauxy
    df = (nrows - 1) * (ncols - 1)
    pyx = chi2.sf(chi2yx, df)
    pxy = chi2.sf(chi2xy, df)
    
    #the results
    ver = ["field1", "field2"]
    tau = [tauxy, tauyx]
    statistic = [chi2xy, chi2yx]
    dfs = [df, df]
    pvals = [pxy, pyx]
    
    colNames = ["dependent", "value", "statistic", "df", "p-value"]
    results = pd.DataFrame(list(zip(ver, tau, statistic, dfs, pvals)), columns=colNames)
    
    return results

Functions

def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None)

Goodman-Kruskal Tau

According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table.

Parameters

field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories1 : list or dictionary, optional
order and/or selection for categories of field1
categories2 : list or dictionary, optional
order and/or selection for categories of field2

Returns

A dataframe with:
 
  • dependent, the field used as dependent variable
  • value, the tau value
  • statistic, the chi-square value
  • df, the degrees of freedom
  • p-value, the significance (p-value)

Notes

The formula used (Goodman & Kruskal, 1954, p. 759): \tau_{Y|X} = \frac{n\times\sum_{i,j}\frac{F_{i,j}^2}{R_i} - \sum_j C_j^2}{n^2 - \sum_j C_j^2} \tau_{X|Y} = \frac{n\times\sum_{i,j}\frac{F_{i,j}^2}{C_j} - \sum_i C_i^2}{n^2 - \sum_i C_i^2}

The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178): \chi_{\tau_{Y|X}}^2 = \left(n - 1\right)\times\left(c - 1\right)\times\tau_{Y|X} \chi_{\tau_{X|Y}}^2 = \left(n - 1\right)\times\left(r - 1\right)\times\tau_{X|Y} df = \left(r - 1\right)\times\left(c - 1\right)

Light and Margolin developed a R^2 measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that R^2 from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance.

Symbols used:

  • F_{i,j}, the absolute frequency (observed count) from row i and column j
  • c, the number of columns
  • r, the number of rows
  • R_i, row total of row i, it can be calculated using R_i=\sum_{j=1}^{c}F_{i,j}
  • C_j, column total of column j, it can be calculated using C_j=\sum_{i=1}^{r}F_{i,j}
  • n = the total number of cases, it can be calculated in various ways, n=\sum_{j=1}^{c}C_j=\sum_{i=1}^{r}R_i=\sum_{i=1}^{r}\sum_{j=1}^{c}F_{i,j}

References

Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. doi:10.2307/2281536

Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. Journal of the American Statistical Association, 66(335), 534–544. doi:10.1080/01621459.1971.10482297

Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. Journal of the American Statistical Association, 69(347), 755–764. doi:10.1080/01621459.1974.10480201

Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/

Särndal, C. E. (1974). A comparative study of association measures. Psychometrika, 39(2), 165–187. doi:10.1007/BF02291467

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None):
    '''
    Goodman-Kruskal Tau
    -------------------
    According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table.
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
        
    Returns
    -------
    A dataframe with:
    
    * *dependent*, the field used as dependent variable
    * *value*, the tau value
    * *statistic*, the chi-square value
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used (Goodman & Kruskal, 1954, p. 759):
    $$\\tau_{Y|X} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{R_i} - \\sum_j C_j^2}{n^2 - \\sum_j C_j^2}$$
    $$\\tau_{X|Y} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{C_j} - \\sum_i C_i^2}{n^2 - \\sum_i C_i^2}$$
    
    The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178):
    $$\\chi_{\\tau_{Y|X}}^2 = \\left(n - 1\\right)\\times\\left(c - 1\\right)\\times\\tau_{Y|X}$$
    $$\\chi_{\\tau_{X|Y}}^2 = \\left(n - 1\\right)\\times\\left(r - 1\\right)\\times\\tau_{X|Y}$$
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    
    Light and Margolin developed a \\(R^2\\) measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that \\(R^2\\) from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance.
    
    *Symbols used:*
    
    * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j
    * \\(c\\), the number of columns
    * \\(r\\), the number of rows
    * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
    * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
    
    References
    ----------
    Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
    
    Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. *Journal of the American Statistical Association, 66*(335), 534–544. doi:10.1080/01621459.1971.10482297
    
    Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. *Journal of the American Statistical Association, 69*(347), 755–764. doi:10.1080/01621459.1974.10480201
    
    Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/
    
    Särndal, C. E. (1974). A comparative study of association measures. *Psychometrika, 39*(2), 165–187. doi:10.1007/BF02291467
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0]-1
    ncols =  ct.shape[1]-1
    n = ct.iloc[nrows, ncols]
    
    #the margin totals
    rs = ct.iloc[0:nrows, ncols]
    cs = ct.iloc[nrows, 0:ncols]
    
    #tau
    tauyx = 0
    tauxy = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            tauyx = tauyx + ct.iloc[i, j]**2 / rs[i]
            tauxy = tauxy + ct.iloc[i, j]**2 / cs[j]
    scs2 = 0
    for j in range(0, ncols):
        scs2 = scs2 + cs[j]**2
    tauyx = (n * tauyx - scs2) / (n**2 - scs2)
    
    srs2 = 0
    for i in range(0, nrows):
        srs2 = srs2 + rs[i]**2
    tauxy = (n * tauxy - srs2) / (n**2 - srs2)
    
    chi2yx = (n - 1) * (ncols - 1) * tauyx
    chi2xy = (n - 1) * (nrows - 1) * tauxy
    df = (nrows - 1) * (ncols - 1)
    pyx = chi2.sf(chi2yx, df)
    pxy = chi2.sf(chi2xy, df)
    
    #the results
    ver = ["field1", "field2"]
    tau = [tauxy, tauyx]
    statistic = [chi2xy, chi2yx]
    dfs = [df, df]
    pvals = [pxy, pyx]
    
    colNames = ["dependent", "value", "statistic", "df", "p-value"]
    results = pd.DataFrame(list(zip(ver, tau, statistic, dfs, pvals)), columns=colNames)
    
    return results