Module stikpetP.effect_sizes.eff_size_goodman_kruskal_tau
Expand source code
import pandas as pd
from scipy.stats import chi2
from ..other.table_cross import tab_cross
def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None):
'''
Goodman-Kruskal Tau
-------------------
According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table.
Parameters
----------
field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories1 : list or dictionary, optional
order and/or selection for categories of field1
categories2 : list or dictionary, optional
order and/or selection for categories of field2
Returns
-------
A dataframe with:
* *dependent*, the field used as dependent variable
* *value*, the tau value
* *statistic*, the chi-square value
* *df*, the degrees of freedom
* *p-value*, the significance (p-value)
Notes
-----
The formula used (Goodman & Kruskal, 1954, p. 759):
$$\\tau_{Y|X} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{R_i} - \\sum_j C_j^2}{n^2 - \\sum_j C_j^2}$$
$$\\tau_{X|Y} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{C_j} - \\sum_i C_i^2}{n^2 - \\sum_i C_i^2}$$
The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178):
$$\\chi_{\\tau_{Y|X}}^2 = \\left(n - 1\\right)\\times\\left(c - 1\\right)\\times\\tau_{Y|X}$$
$$\\chi_{\\tau_{X|Y}}^2 = \\left(n - 1\\right)\\times\\left(r - 1\\right)\\times\\tau_{X|Y}$$
$$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
Light and Margolin developed a \\(R^2\\) measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that \\(R^2\\) from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance.
*Symbols used:*
* \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j
* \\(c\\), the number of columns
* \\(r\\), the number of rows
* \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
* \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
* \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
References
----------
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. *Journal of the American Statistical Association, 66*(335), 534–544. doi:10.1080/01621459.1971.10482297
Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. *Journal of the American Statistical Association, 69*(347), 755–764. doi:10.1080/01621459.1974.10480201
Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/
Särndal, C. E. (1974). A comparative study of association measures. *Psychometrika, 39*(2), 165–187. doi:10.1007/BF02291467
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
#create the cross table
ct = tab_cross(field1, field2, categories1, categories2, totals="include")
#basic counts
nrows = ct.shape[0]-1
ncols = ct.shape[1]-1
n = ct.iloc[nrows, ncols]
#the margin totals
rs = ct.iloc[0:nrows, ncols]
cs = ct.iloc[nrows, 0:ncols]
#tau
tauyx = 0
tauxy = 0
for i in range(0, nrows):
for j in range(0, ncols):
tauyx = tauyx + ct.iloc[i, j]**2 / rs[i]
tauxy = tauxy + ct.iloc[i, j]**2 / cs[j]
scs2 = 0
for j in range(0, ncols):
scs2 = scs2 + cs[j]**2
tauyx = (n * tauyx - scs2) / (n**2 - scs2)
srs2 = 0
for i in range(0, nrows):
srs2 = srs2 + rs[i]**2
tauxy = (n * tauxy - srs2) / (n**2 - srs2)
chi2yx = (n - 1) * (ncols - 1) * tauyx
chi2xy = (n - 1) * (nrows - 1) * tauxy
df = (nrows - 1) * (ncols - 1)
pyx = chi2.sf(chi2yx, df)
pxy = chi2.sf(chi2xy, df)
#the results
ver = ["field1", "field2"]
tau = [tauxy, tauyx]
statistic = [chi2xy, chi2yx]
dfs = [df, df]
pvals = [pxy, pyx]
colNames = ["dependent", "value", "statistic", "df", "p-value"]
results = pd.DataFrame(list(zip(ver, tau, statistic, dfs, pvals)), columns=colNames)
return results
Functions
def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None)
-
Goodman-Kruskal Tau
According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table.
Parameters
field1
:list
orpandas series
- the first categorical field
field2
:list
orpandas series
- the first categorical field
categories1
:list
ordictionary
, optional- order and/or selection for categories of field1
categories2
:list
ordictionary
, optional- order and/or selection for categories of field2
Returns
A dataframe with:
- dependent, the field used as dependent variable
- value, the tau value
- statistic, the chi-square value
- df, the degrees of freedom
- p-value, the significance (p-value)
Notes
The formula used (Goodman & Kruskal, 1954, p. 759): \tau_{Y|X} = \frac{n\times\sum_{i,j}\frac{F_{i,j}^2}{R_i} - \sum_j C_j^2}{n^2 - \sum_j C_j^2} \tau_{X|Y} = \frac{n\times\sum_{i,j}\frac{F_{i,j}^2}{C_j} - \sum_i C_i^2}{n^2 - \sum_i C_i^2}
The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178): \chi_{\tau_{Y|X}}^2 = \left(n - 1\right)\times\left(c - 1\right)\times\tau_{Y|X} \chi_{\tau_{X|Y}}^2 = \left(n - 1\right)\times\left(r - 1\right)\times\tau_{X|Y} df = \left(r - 1\right)\times\left(c - 1\right)
Light and Margolin developed a R^2 measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that R^2 from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance.
Symbols used:
- F_{i,j}, the absolute frequency (observed count) from row i and column j
- c, the number of columns
- r, the number of rows
- R_i, row total of row i, it can be calculated using R_i=\sum_{j=1}^{c}F_{i,j}
- C_j, column total of column j, it can be calculated using C_j=\sum_{i=1}^{r}F_{i,j}
- n = the total number of cases, it can be calculated in various ways, n=\sum_{j=1}^{c}C_j=\sum_{i=1}^{r}R_i=\sum_{i=1}^{r}\sum_{j=1}^{c}F_{i,j}
References
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. doi:10.2307/2281536
Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. Journal of the American Statistical Association, 66(335), 534–544. doi:10.1080/01621459.1971.10482297
Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. Journal of the American Statistical Association, 69(347), 755–764. doi:10.1080/01621459.1974.10480201
Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/
Särndal, C. E. (1974). A comparative study of association measures. Psychometrika, 39(2), 165–187. doi:10.1007/BF02291467
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def es_goodman_kruskal_tau(field1, field2, categories1=None, categories2=None): ''' Goodman-Kruskal Tau ------------------- According to minitab the Goodman-Kruskal tau "measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables)" (n.d.). It is an effect size measure that can be used with a cross table. Parameters ---------- field1 : list or pandas series the first categorical field field2 : list or pandas series the first categorical field categories1 : list or dictionary, optional order and/or selection for categories of field1 categories2 : list or dictionary, optional order and/or selection for categories of field2 Returns ------- A dataframe with: * *dependent*, the field used as dependent variable * *value*, the tau value * *statistic*, the chi-square value * *df*, the degrees of freedom * *p-value*, the significance (p-value) Notes ----- The formula used (Goodman & Kruskal, 1954, p. 759): $$\\tau_{Y|X} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{R_i} - \\sum_j C_j^2}{n^2 - \\sum_j C_j^2}$$ $$\\tau_{X|Y} = \\frac{n\\times\\sum_{i,j}\\frac{F_{i,j}^2}{C_j} - \\sum_i C_i^2}{n^2 - \\sum_i C_i^2}$$ The p-value is then obtained by (Light & Margolin, 1971, p. 538; Särndal, 1974, p. 178): $$\\chi_{\\tau_{Y|X}}^2 = \\left(n - 1\\right)\\times\\left(c - 1\\right)\\times\\tau_{Y|X}$$ $$\\chi_{\\tau_{X|Y}}^2 = \\left(n - 1\\right)\\times\\left(r - 1\\right)\\times\\tau_{X|Y}$$ $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$ Light and Margolin developed a \\(R^2\\) measure for categorical data, they proposed a test CATANOVA (Categorical Anova) for this measure. This was a chi-square test (p. 538). Sarndal (1974, p. 178) concluded that \\(R^2\\) from Light and Mangolin, was the same as Goodman-Kendal tau, and uses their test for tau. Margolin and Light (1974) reach the same conclusion and proof the equivelance. *Symbols used:* * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j * \\(c\\), the number of columns * \\(r\\), the number of rows * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\) * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\) * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\) References ---------- Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536 Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. *Journal of the American Statistical Association, 66*(335), 534–544. doi:10.1080/01621459.1971.10482297 Margolin, B. H., & Light, R. J. (1974). An analysis of variance for categorical data, II: Small sample comparisons with chi square and other competitors. *Journal of the American Statistical Association, 69*(347), 755–764. doi:10.1080/01621459.1974.10480201 Minitab. (n.d.). What are the Goodman-Kruskal statistics? Minitab 20 Support. Retrieved October 11, 2023, from https://support.minitab.com/en-us/minitab/20/help-and-how-to/statistics/tables/supporting-topics/other-statistics-and-tests/what-are-the-goodman-kruskal-statistics/ Särndal, C. E. (1974). A comparative study of association measures. *Psychometrika, 39*(2), 165–187. doi:10.1007/BF02291467 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' #create the cross table ct = tab_cross(field1, field2, categories1, categories2, totals="include") #basic counts nrows = ct.shape[0]-1 ncols = ct.shape[1]-1 n = ct.iloc[nrows, ncols] #the margin totals rs = ct.iloc[0:nrows, ncols] cs = ct.iloc[nrows, 0:ncols] #tau tauyx = 0 tauxy = 0 for i in range(0, nrows): for j in range(0, ncols): tauyx = tauyx + ct.iloc[i, j]**2 / rs[i] tauxy = tauxy + ct.iloc[i, j]**2 / cs[j] scs2 = 0 for j in range(0, ncols): scs2 = scs2 + cs[j]**2 tauyx = (n * tauyx - scs2) / (n**2 - scs2) srs2 = 0 for i in range(0, nrows): srs2 = srs2 + rs[i]**2 tauxy = (n * tauxy - srs2) / (n**2 - srs2) chi2yx = (n - 1) * (ncols - 1) * tauyx chi2xy = (n - 1) * (nrows - 1) * tauxy df = (nrows - 1) * (ncols - 1) pyx = chi2.sf(chi2yx, df) pxy = chi2.sf(chi2xy, df) #the results ver = ["field1", "field2"] tau = [tauxy, tauyx] statistic = [chi2xy, chi2yx] dfs = [df, df] pvals = [pxy, pyx] colNames = ["dependent", "value", "statistic", "df", "p-value"] results = pd.DataFrame(list(zip(ver, tau, statistic, dfs, pvals)), columns=colNames) return results