Module stikpetP.other.poho_dunn_q

Expand source code
import pandas as pd
from statistics import NormalDist

def ph_dunn_q(data, success=None):
    '''
    Post-Hoc Dunn Test (for Cochran Q test)
    ---------------------------------------
    An adaptation from IBM SPSS on the Dunn test, so it can be used as a post-hoc test for a Cochran Q test.
    
    Parameters
    ----------
    data : dataframe
        dataframe with a column for each category
    success : object, optional
        the value that represents a 'success'. If None (default) the first value found will be used as success.
        
    Returns
    -------
    res : dataframe
        test results with the following columns
    
    * *category 1*, label of first variable in comparison
    * *category 2*, label of second variable in comparison
    * *n suc. 1*, number of successes in first variable in comparison
    * *n suc. 2*, number of successes in second variable in comparison
    * *statistic*, test statistic
    * *z-value*, standardized test statistic (z-value)
    * *p-value*, p-value of the z-value
    * *adj. p-value*, Bonferroni corrected p-value
    
    Notes
    -----
    The formula used (IBM, 2021, p. 814):
    $$z_{1,2} = \\frac{\\bar{d}_{1,2}}{SE}$$
    $$sig. = 2\\times\\left(1-\\Phi\\left(\\left|z_{1,2}\\right|\\right)\\right)$$
    
    With:
    $$\\bar{d}_{1,2} = \\frac{ns_1 - ns_2}{n}$$
    $$SE = \\sqrt{2\\times\\frac{k\\times\\sum_{i=1}^n R_i - \\sum_{i=1}^n R_i^2}{n^2\\times k\\times\\left(k-1\\right)}}$$
    $$R_i = \\sum_{j=1}^k s_{i,j}$$
    $$ns_j = \\sum_{i=1}^n s_{i,j}$$    
    $$s_{i,j} = \\begin{cases} 1 & \\text{ if } x_{i,j}= \\text{success} \\\\ 0 & \\text{ if } x_{i,j} \\neq \\text{success} \\end{cases}$$
    
    IBM SPSS mentions this is an adaptation from Dunn (1964), originally for the Kruskal-Wallis test.
    
    The Bonferroni adjustment is done using:
    $$sig._{adj} = \\min \\left(sig. \\times n_c, 1\\right)$$
    $$n_c = \\frac{k\\times\\left(k-1\\right)}{2}$$
    
    *Symbols used*
    
    * \\(x_{i,j}\\), the score in row i and column j
    * \\(k\\), the number of variables
    * \\(n\\), the total number of cases used
    * \\(ns_j\\), the total number of successes in column j
    * \\(R_i\\), the total number of successes in row i
    * \\(\\Phi\\left(\\dots\\right)\\), the standard normal cumulative distribution function.
    * \\(n_c\\), the number of comparisons (pairs)

    References
    ----------
    Dunn, O. J. (1964). Multiple comparisons using rank sums. *Technometrics, 6*(3), 241–252. doi:10.1080/00401706.1964.10490181
    
    IBM. (2021). IBM SPSS Statistics Algorithms. IBM.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    df = data.dropna()
    varNames = df.columns
    k = len(varNames)
    n = df.shape[0]
    
    if success is None:
        suc = df.iloc[0,0]
    else:
        suc = success
    
    #replace success with 1 and failures with 0
    pd.options.mode.chained_assignment = None
    df[df!= suc]  = 0
    df[df == suc] = 1
    
    #Row successes
    df["rs"] = df.sum(axis=1)
    rst = sum(df["rs"])
    rs2t = sum(df["rs"]**2)
    
    #standard error
    se = (2 * (k * rst - rs2t) / (k * (k - 1) * n**2))**0.5
    
    #number of comparisons
    ncomp = k * (k - 1) / 2
    
    #the pairwise comparisons
    res = pd.DataFrame()    
    resRow=0
    for i in range(k-1):
        for j in range(i+1, k):
            # create pairs
            cat1 = varNames[i]
            cat2 = varNames[j]
            selDf = df[[cat1, cat2]]
            selDf = selDf.dropna()
            n1 = sum(selDf[cat1]==1)
            n2 = sum(selDf[cat2]==1)            
            t = (n1 - n2)/n
            z = t/se
            pVal = 2 * (1 - NormalDist().cdf(abs(z)))
            if pVal*ncomp > 1:
                pAdj = 1
            else:
                pAdj = pVal*ncomp
            
            res.at[resRow, 0] = cat1
            res.at[resRow, 1] = cat2
            res.at[resRow, 2] = n1
            res.at[resRow, 3] = n2
            res.at[resRow, 4] = t
            res.at[resRow, 5] = z
            res.at[resRow, 6] = pVal
            res.at[resRow, 7] = pAdj
            
            resRow=resRow+1
            
    res.columns =["category 1", "category 2", "n suc. 1", "n suc. 2", "statistic", "z-value", "p-value", "adj. p-value"]
    
    return res

Functions

def ph_dunn_q(data, success=None)

Post-Hoc Dunn Test (for Cochran Q test)

An adaptation from IBM SPSS on the Dunn test, so it can be used as a post-hoc test for a Cochran Q test.

Parameters

data : dataframe
dataframe with a column for each category
success : object, optional
the value that represents a 'success'. If None (default) the first value found will be used as success.

Returns

res : dataframe
test results with the following columns
  • category 1, label of first variable in comparison
  • category 2, label of second variable in comparison
  • n suc. 1, number of successes in first variable in comparison
  • n suc. 2, number of successes in second variable in comparison
  • statistic, test statistic
  • z-value, standardized test statistic (z-value)
  • p-value, p-value of the z-value
  • adj. p-value, Bonferroni corrected p-value

Notes

The formula used (IBM, 2021, p. 814): z_{1,2} = \frac{\bar{d}_{1,2}}{SE} sig. = 2\times\left(1-\Phi\left(\left|z_{1,2}\right|\right)\right)

With: \bar{d}_{1,2} = \frac{ns_1 - ns_2}{n} SE = \sqrt{2\times\frac{k\times\sum_{i=1}^n R_i - \sum_{i=1}^n R_i^2}{n^2\times k\times\left(k-1\right)}} R_i = \sum_{j=1}^k s_{i,j} ns_j = \sum_{i=1}^n s_{i,j}
s_{i,j} = \begin{cases} 1 & \text{ if } x_{i,j}= \text{success} \\ 0 & \text{ if } x_{i,j} \neq \text{success} \end{cases}

IBM SPSS mentions this is an adaptation from Dunn (1964), originally for the Kruskal-Wallis test.

The Bonferroni adjustment is done using: sig._{adj} = \min \left(sig. \times n_c, 1\right) n_c = \frac{k\times\left(k-1\right)}{2}

Symbols used

  • x_{i,j}, the score in row i and column j
  • k, the number of variables
  • n, the total number of cases used
  • ns_j, the total number of successes in column j
  • R_i, the total number of successes in row i
  • \Phi\left(\dots\right), the standard normal cumulative distribution function.
  • n_c, the number of comparisons (pairs)

References

Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3), 241–252. doi:10.1080/00401706.1964.10490181

IBM. (2021). IBM SPSS Statistics Algorithms. IBM.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ph_dunn_q(data, success=None):
    '''
    Post-Hoc Dunn Test (for Cochran Q test)
    ---------------------------------------
    An adaptation from IBM SPSS on the Dunn test, so it can be used as a post-hoc test for a Cochran Q test.
    
    Parameters
    ----------
    data : dataframe
        dataframe with a column for each category
    success : object, optional
        the value that represents a 'success'. If None (default) the first value found will be used as success.
        
    Returns
    -------
    res : dataframe
        test results with the following columns
    
    * *category 1*, label of first variable in comparison
    * *category 2*, label of second variable in comparison
    * *n suc. 1*, number of successes in first variable in comparison
    * *n suc. 2*, number of successes in second variable in comparison
    * *statistic*, test statistic
    * *z-value*, standardized test statistic (z-value)
    * *p-value*, p-value of the z-value
    * *adj. p-value*, Bonferroni corrected p-value
    
    Notes
    -----
    The formula used (IBM, 2021, p. 814):
    $$z_{1,2} = \\frac{\\bar{d}_{1,2}}{SE}$$
    $$sig. = 2\\times\\left(1-\\Phi\\left(\\left|z_{1,2}\\right|\\right)\\right)$$
    
    With:
    $$\\bar{d}_{1,2} = \\frac{ns_1 - ns_2}{n}$$
    $$SE = \\sqrt{2\\times\\frac{k\\times\\sum_{i=1}^n R_i - \\sum_{i=1}^n R_i^2}{n^2\\times k\\times\\left(k-1\\right)}}$$
    $$R_i = \\sum_{j=1}^k s_{i,j}$$
    $$ns_j = \\sum_{i=1}^n s_{i,j}$$    
    $$s_{i,j} = \\begin{cases} 1 & \\text{ if } x_{i,j}= \\text{success} \\\\ 0 & \\text{ if } x_{i,j} \\neq \\text{success} \\end{cases}$$
    
    IBM SPSS mentions this is an adaptation from Dunn (1964), originally for the Kruskal-Wallis test.
    
    The Bonferroni adjustment is done using:
    $$sig._{adj} = \\min \\left(sig. \\times n_c, 1\\right)$$
    $$n_c = \\frac{k\\times\\left(k-1\\right)}{2}$$
    
    *Symbols used*
    
    * \\(x_{i,j}\\), the score in row i and column j
    * \\(k\\), the number of variables
    * \\(n\\), the total number of cases used
    * \\(ns_j\\), the total number of successes in column j
    * \\(R_i\\), the total number of successes in row i
    * \\(\\Phi\\left(\\dots\\right)\\), the standard normal cumulative distribution function.
    * \\(n_c\\), the number of comparisons (pairs)

    References
    ----------
    Dunn, O. J. (1964). Multiple comparisons using rank sums. *Technometrics, 6*(3), 241–252. doi:10.1080/00401706.1964.10490181
    
    IBM. (2021). IBM SPSS Statistics Algorithms. IBM.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    df = data.dropna()
    varNames = df.columns
    k = len(varNames)
    n = df.shape[0]
    
    if success is None:
        suc = df.iloc[0,0]
    else:
        suc = success
    
    #replace success with 1 and failures with 0
    pd.options.mode.chained_assignment = None
    df[df!= suc]  = 0
    df[df == suc] = 1
    
    #Row successes
    df["rs"] = df.sum(axis=1)
    rst = sum(df["rs"])
    rs2t = sum(df["rs"]**2)
    
    #standard error
    se = (2 * (k * rst - rs2t) / (k * (k - 1) * n**2))**0.5
    
    #number of comparisons
    ncomp = k * (k - 1) / 2
    
    #the pairwise comparisons
    res = pd.DataFrame()    
    resRow=0
    for i in range(k-1):
        for j in range(i+1, k):
            # create pairs
            cat1 = varNames[i]
            cat2 = varNames[j]
            selDf = df[[cat1, cat2]]
            selDf = selDf.dropna()
            n1 = sum(selDf[cat1]==1)
            n2 = sum(selDf[cat2]==1)            
            t = (n1 - n2)/n
            z = t/se
            pVal = 2 * (1 - NormalDist().cdf(abs(z)))
            if pVal*ncomp > 1:
                pAdj = 1
            else:
                pAdj = pVal*ncomp
            
            res.at[resRow, 0] = cat1
            res.at[resRow, 1] = cat2
            res.at[resRow, 2] = n1
            res.at[resRow, 3] = n2
            res.at[resRow, 4] = t
            res.at[resRow, 5] = z
            res.at[resRow, 6] = pVal
            res.at[resRow, 7] = pAdj
            
            resRow=resRow+1
            
    res.columns =["category 1", "category 2", "n suc. 1", "n suc. 2", "statistic", "z-value", "p-value", "adj. p-value"]
    
    return res