Module stikpetP.correlations.cor_stuart_tau
Expand source code
import pandas as pd
from statistics import NormalDist
from ..other.table_cross import tab_cross
def r_stuart_tau(ordField1, ordField2, levels1=None, levels2=None, useRanks=False):
'''
Stuart(-Kendall) Tau c
----------------------
A rank correlation coefficient. It ranges from -1 (perfect negative association) to 1 (perfect positive association). A zero would indicate no correlation at all.
A positive correlation indicates that if someone scored high on the first field, they also likely score high on the second, while a negative correlation would indicate a high score on the first would give a low score on the second.
Alternatives for Gamma are Kendall Tau, Stuart-Kendall Tau and Somers D, but also Spearman rho could be considered.
Kendall Tau b looks at so-called discordant and concordant pairs, but unlike Gamma it does not ignore tied pairs. Stuart-Kendall Tau c also, but also takes the size of the table into consideration. Somers d only makes a correction for tied pairs in one of the two directions. Spearman rho is more of a variation on Pearson correlation, but applied to ranks. See Göktaş and İşçi. (2011) for more information on the comparisons.
Kendall Tau a is the same as Goodman-Kruskal Gamma.
Parameters
----------
ordField1 : pandas series
the ordinal or scale scores of the first variable
ordField2 : pandas series
the ordinal or scale scores of the second variable
levels1 : list or dictionary, optional
the categories to use from ordField1
levels2 : list or dictionary, optional
the categories to use from ordField2
cc : boolean, optional
to indicate the use of a continuity correction
useRanks : boolean, optional
rank the data first or not. Default is False
Returns
-------
A dataframe with:
* *tau*, the tau value
* *statistic*, the test statistic (z-value)
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Stuart, 1953, p. 107):
$$\\tau_c = \\frac{P-Q}{n^2\\times\\frac{m-1}{m}}$$
And for the test (Brown & Benedetti, 1977, p. 311):
$$z_{\\tau_c} = \\frac{\\tau_c}{ASE_0}$$
$$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z_{\\tau_c}\\right|\\right)\\right)$$
With:
$$m = \\min\\left(r,c\\right)$$
$$P = \\sum_{i,j} P_{i,j}$$
$$Q = \\sum_{i,j} Q_{i,j}$$
$$P_{i,j} = F_{i,j}\\times C_{i,j}$$
$$Q_{i,j} = F_{i,j}\\times D_{i,j}$$
$$C_{i,j} = \\sum_{h<i}\\sum_{k<j} F_{h,k} + \\sum_{h>i}\\sum_{k>j} F_{h,k}$$
$$D_{i,j} = \\sum_{h<i}\\sum_{k>j} F_{h,k} + \\sum_{h>i}\\sum_{k<j} F_{h,k}$$
$$ASE_0 = \\frac{2\\times m}{\\left(m - 1\\right)\\times n^2}\\times\\sqrt{\\sum_{i=1}^r\\sum_{j=1}^c F_{i,j}\\times\\left(C_{i,j} - D_{i,j}\\right)^2 - \\frac{\\left(P-Q\\right)^2}{n}}$$
*Symbols Used*
* \\(F_{i,j}\\), the number of cases in row i, column j.
* \\(n\\), the total sample size
* \\(r\\), the number of rows
* \\(c\\), the number of columns
* \\(\\Phi\\left(\\dots\\right)\\), the cumulative distribution function of the standard normal distribution.
The continuity correction is applied as (Schaeffer & Levitt, p. 342):
$$\\tau_{cc} = \\left|\\tau\\right| - \\frac{2}{n\\times\\left(n - 1\\right)}$$
Note that this correction should actually be adjusted in case ties are present. Hopefully this can be implemented in a future update.
References
----------
Brown, M. B., & Benedetti, J. K. (1977). Sampling behavior of test for correlation in two-way contingency tables. *Journal of the American Statistical Association, 72*(358), 309–315. doi:10.2307/2286793
Göktaş, A., & İşçi, Ö. (2011). A comparison of the most commonly used measures of association for doubly ordered square contingency tables via simulation. *Advances in Methodology and Statistics, 8*(1). doi:10.51936/milh5641
Schaeffer, M. S., & Levitt, E. E. (1956). Concerning Kendall’s tau, a nonparametric correlation coefficient. *Psychological Bulletin, 53*(4), 338–346. doi:10.1037/h0045013
Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. *Biometrika, 40*(1/2), 105. doi:10.2307/2333101
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
ct = tab_cross(ordField1, ordField2, order1=levels1, order2=levels2)
k1 = ct.shape[0]
k2 = ct.shape[1]
if useRanks==False:
if levels1 is not None:
#replace row labels with numeric score
ct = ct.reset_index(drop=True)
if levels2 is not None:
ct.columns = [i for i in range(0, k2)]
n = 0
conc = [[0]*k1]*k2
disc = [[0]*k1]*k2
conc = pd.DataFrame(conc)
disc = pd.DataFrame(disc)
for i in range(0, k1):
for j in range(0, k2):
for h in range(0, k1):
for k in range(0, k2):
if useRanks:
if h > i and k > j:
conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k]
elif h<i and k<j:
conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k]
elif h>i and k<j:
disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k]
elif h<i and k>j:
disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k]
else:
if ct.index[h] > ct.index[i] and ct.columns[k] > ct.columns[j]:
conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k]
elif ct.index[h] < ct.index[i] and ct.columns[k] < ct.columns[j]:
conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k]
elif ct.index[h] > ct.index[i] and ct.columns[k] < ct.columns[j]:
disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k]
elif ct.index[h] < ct.index[i] and ct.columns[k] > ct.columns[j]:
disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k]
n = n + ct.iloc[i,j]
ct = ct.reset_index(drop=True)
ct.columns = [i for i in range(0, k2)]
p = (ct*conc).sum().sum()
q = (ct*disc).sum().sum()
m = min(k1, k2)
tc = (p - q) / (n**2 * (m - 1) / m)
if cc:
tauTest = abs(tc) - 2/(n*(n - 1))
else:
tauTest = tc
ase0 = (ct*(conc - disc)**2).sum().sum()
ase0 = 2*(m / (m - 1)) * (ase0 - (p - q)**2 / n)**0.5 / (n**2)
z = tauTest/ase0
pVal = 2 * (1 - NormalDist().cdf(abs(z)))
res = pd.DataFrame([[tc, z, pVal]])
res.columns = ["Stuart-Kendall Tau c", "statistic", "p-value"]
return res
Functions
def r_stuart_tau(ordField1, ordField2, levels1=None, levels2=None, useRanks=False)
-
Stuart(-Kendall) Tau c
A rank correlation coefficient. It ranges from -1 (perfect negative association) to 1 (perfect positive association). A zero would indicate no correlation at all.
A positive correlation indicates that if someone scored high on the first field, they also likely score high on the second, while a negative correlation would indicate a high score on the first would give a low score on the second.
Alternatives for Gamma are Kendall Tau, Stuart-Kendall Tau and Somers D, but also Spearman rho could be considered.
Kendall Tau b looks at so-called discordant and concordant pairs, but unlike Gamma it does not ignore tied pairs. Stuart-Kendall Tau c also, but also takes the size of the table into consideration. Somers d only makes a correction for tied pairs in one of the two directions. Spearman rho is more of a variation on Pearson correlation, but applied to ranks. See Göktaş and İşçi. (2011) for more information on the comparisons.
Kendall Tau a is the same as Goodman-Kruskal Gamma.
Parameters
ordField1
:pandas series
- the ordinal or scale scores of the first variable
ordField2
:pandas series
- the ordinal or scale scores of the second variable
levels1
:list
ordictionary
, optional- the categories to use from ordField1
levels2
:list
ordictionary
, optional- the categories to use from ordField2
cc
:boolean
, optional- to indicate the use of a continuity correction
useRanks
:boolean
, optional- rank the data first or not. Default is False
Returns
A dataframe with:
- tau, the tau value
- statistic, the test statistic (z-value)
- p-value, the p-value (significance)
Notes
The formula used (Stuart, 1953, p. 107): \tau_c = \frac{P-Q}{n^2\times\frac{m-1}{m}}
And for the test (Brown & Benedetti, 1977, p. 311): z_{\tau_c} = \frac{\tau_c}{ASE_0} sig. = 2\times\left(1 - \Phi\left(\left|z_{\tau_c}\right|\right)\right)
With: m = \min\left(r,c\right) P = \sum_{i,j} P_{i,j} Q = \sum_{i,j} Q_{i,j} P_{i,j} = F_{i,j}\times C_{i,j} Q_{i,j} = F_{i,j}\times D_{i,j} C_{i,j} = \sum_{h<i}\sum_{k<j} F_{h,k} + \sum_{h>i}\sum_{k>j} F_{h,k} D_{i,j} = \sum_{h<i}\sum_{k>j} F_{h,k} + \sum_{h>i}\sum_{k<j} F_{h,k} ASE_0 = \frac{2\times m}{\left(m - 1\right)\times n^2}\times\sqrt{\sum_{i=1}^r\sum_{j=1}^c F_{i,j}\times\left(C_{i,j} - D_{i,j}\right)^2 - \frac{\left(P-Q\right)^2}{n}}
Symbols Used
- F_{i,j}, the number of cases in row i, column j.
- n, the total sample size
- r, the number of rows
- c, the number of columns
- \Phi\left(\dots\right), the cumulative distribution function of the standard normal distribution.
The continuity correction is applied as (Schaeffer & Levitt, p. 342): \tau_{cc} = \left|\tau\right| - \frac{2}{n\times\left(n - 1\right)}
Note that this correction should actually be adjusted in case ties are present. Hopefully this can be implemented in a future update.
References
Brown, M. B., & Benedetti, J. K. (1977). Sampling behavior of test for correlation in two-way contingency tables. Journal of the American Statistical Association, 72(358), 309–315. doi:10.2307/2286793
Göktaş, A., & İşçi, Ö. (2011). A comparison of the most commonly used measures of association for doubly ordered square contingency tables via simulation. Advances in Methodology and Statistics, 8(1). doi:10.51936/milh5641
Schaeffer, M. S., & Levitt, E. E. (1956). Concerning Kendall’s tau, a nonparametric correlation coefficient. Psychological Bulletin, 53(4), 338–346. doi:10.1037/h0045013
Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. Biometrika, 40(1/2), 105. doi:10.2307/2333101
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def r_stuart_tau(ordField1, ordField2, levels1=None, levels2=None, useRanks=False): ''' Stuart(-Kendall) Tau c ---------------------- A rank correlation coefficient. It ranges from -1 (perfect negative association) to 1 (perfect positive association). A zero would indicate no correlation at all. A positive correlation indicates that if someone scored high on the first field, they also likely score high on the second, while a negative correlation would indicate a high score on the first would give a low score on the second. Alternatives for Gamma are Kendall Tau, Stuart-Kendall Tau and Somers D, but also Spearman rho could be considered. Kendall Tau b looks at so-called discordant and concordant pairs, but unlike Gamma it does not ignore tied pairs. Stuart-Kendall Tau c also, but also takes the size of the table into consideration. Somers d only makes a correction for tied pairs in one of the two directions. Spearman rho is more of a variation on Pearson correlation, but applied to ranks. See Göktaş and İşçi. (2011) for more information on the comparisons. Kendall Tau a is the same as Goodman-Kruskal Gamma. Parameters ---------- ordField1 : pandas series the ordinal or scale scores of the first variable ordField2 : pandas series the ordinal or scale scores of the second variable levels1 : list or dictionary, optional the categories to use from ordField1 levels2 : list or dictionary, optional the categories to use from ordField2 cc : boolean, optional to indicate the use of a continuity correction useRanks : boolean, optional rank the data first or not. Default is False Returns ------- A dataframe with: * *tau*, the tau value * *statistic*, the test statistic (z-value) * *p-value*, the p-value (significance) Notes ----- The formula used (Stuart, 1953, p. 107): $$\\tau_c = \\frac{P-Q}{n^2\\times\\frac{m-1}{m}}$$ And for the test (Brown & Benedetti, 1977, p. 311): $$z_{\\tau_c} = \\frac{\\tau_c}{ASE_0}$$ $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z_{\\tau_c}\\right|\\right)\\right)$$ With: $$m = \\min\\left(r,c\\right)$$ $$P = \\sum_{i,j} P_{i,j}$$ $$Q = \\sum_{i,j} Q_{i,j}$$ $$P_{i,j} = F_{i,j}\\times C_{i,j}$$ $$Q_{i,j} = F_{i,j}\\times D_{i,j}$$ $$C_{i,j} = \\sum_{h<i}\\sum_{k<j} F_{h,k} + \\sum_{h>i}\\sum_{k>j} F_{h,k}$$ $$D_{i,j} = \\sum_{h<i}\\sum_{k>j} F_{h,k} + \\sum_{h>i}\\sum_{k<j} F_{h,k}$$ $$ASE_0 = \\frac{2\\times m}{\\left(m - 1\\right)\\times n^2}\\times\\sqrt{\\sum_{i=1}^r\\sum_{j=1}^c F_{i,j}\\times\\left(C_{i,j} - D_{i,j}\\right)^2 - \\frac{\\left(P-Q\\right)^2}{n}}$$ *Symbols Used* * \\(F_{i,j}\\), the number of cases in row i, column j. * \\(n\\), the total sample size * \\(r\\), the number of rows * \\(c\\), the number of columns * \\(\\Phi\\left(\\dots\\right)\\), the cumulative distribution function of the standard normal distribution. The continuity correction is applied as (Schaeffer & Levitt, p. 342): $$\\tau_{cc} = \\left|\\tau\\right| - \\frac{2}{n\\times\\left(n - 1\\right)}$$ Note that this correction should actually be adjusted in case ties are present. Hopefully this can be implemented in a future update. References ---------- Brown, M. B., & Benedetti, J. K. (1977). Sampling behavior of test for correlation in two-way contingency tables. *Journal of the American Statistical Association, 72*(358), 309–315. doi:10.2307/2286793 Göktaş, A., & İşçi, Ö. (2011). A comparison of the most commonly used measures of association for doubly ordered square contingency tables via simulation. *Advances in Methodology and Statistics, 8*(1). doi:10.51936/milh5641 Schaeffer, M. S., & Levitt, E. E. (1956). Concerning Kendall’s tau, a nonparametric correlation coefficient. *Psychological Bulletin, 53*(4), 338–346. doi:10.1037/h0045013 Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. *Biometrika, 40*(1/2), 105. doi:10.2307/2333101 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' ct = tab_cross(ordField1, ordField2, order1=levels1, order2=levels2) k1 = ct.shape[0] k2 = ct.shape[1] if useRanks==False: if levels1 is not None: #replace row labels with numeric score ct = ct.reset_index(drop=True) if levels2 is not None: ct.columns = [i for i in range(0, k2)] n = 0 conc = [[0]*k1]*k2 disc = [[0]*k1]*k2 conc = pd.DataFrame(conc) disc = pd.DataFrame(disc) for i in range(0, k1): for j in range(0, k2): for h in range(0, k1): for k in range(0, k2): if useRanks: if h > i and k > j: conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k] elif h<i and k<j: conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k] elif h>i and k<j: disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k] elif h<i and k>j: disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k] else: if ct.index[h] > ct.index[i] and ct.columns[k] > ct.columns[j]: conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k] elif ct.index[h] < ct.index[i] and ct.columns[k] < ct.columns[j]: conc.iloc[i,j] = conc.iloc[i,j] + ct.iloc[h,k] elif ct.index[h] > ct.index[i] and ct.columns[k] < ct.columns[j]: disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k] elif ct.index[h] < ct.index[i] and ct.columns[k] > ct.columns[j]: disc.iloc[i,j] = disc.iloc[i,j] + ct.iloc[h,k] n = n + ct.iloc[i,j] ct = ct.reset_index(drop=True) ct.columns = [i for i in range(0, k2)] p = (ct*conc).sum().sum() q = (ct*disc).sum().sum() m = min(k1, k2) tc = (p - q) / (n**2 * (m - 1) / m) if cc: tauTest = abs(tc) - 2/(n*(n - 1)) else: tauTest = tc ase0 = (ct*(conc - disc)**2).sum().sum() ase0 = 2*(m / (m - 1)) * (ase0 - (p - q)**2 / n)**0.5 / (n**2) z = tauTest/ase0 pVal = 2 * (1 - NormalDist().cdf(abs(z))) res = pd.DataFrame([[tc, z, pVal]]) res.columns = ["Stuart-Kendall Tau c", "statistic", "p-value"] return res