Module `stikpetP.effect_sizes.eff_size_scott_pi`

Expand source code

import pandas as pd
from statistics import NormalDist
from ..other.table_cross import tab_cross

def es_scott_pi(field1, field2, categories=None):
    '''
    Scott Pi
    -----------
    An effect size meaure, that measures the how strongly two raters or variables, agree with each other. Full agreement would result in a pi of 1.
    
    The measure is very similar to Cohen's kappa. The difference is with the calculation of the expected marginal proportions. Cohen's kappa uses a squared geometric mean, while Scott's pi uses squared arithmetic means. 
    
    Scott developed this in criticism on Bennett-Alpert-Goldstein's S (see es_bag_s()).
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories : list or dictionary, optional
        order and/or selection for categories of field1 and field2
        
    Returns
    -------
    A dataframe with:
    * *Scott pi*, the Scott pi value.
    * *n*, the sample size
    * *statistic*, the test statistic (z-value)
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Scott, 1955, p. 323):
    $$\\pi = \\frac{p_0 - p_e}{1 - p_e}$$
    
    With:
    $$P = \\sum_{i=1}^r F_{i,i}$$
    $$p_0 = \\frac{P}{n}$$
    $$p_e = \\sum_{i=1}^r\\left(\\frac{R_i + C_i}{2\\times n}\\right)^2$$
    
    The asymptotic standard errors is calculated using (Scott, 1955, p. 325):
    $$ASE = \\sqrt{\\left(\\frac{1}{1 - p_e}\\right)^2\\times\\frac{p_0\\times\\left(1 - p_0\\right)}{n - 1}}$$
    
    The p-value (significance) is then calculated using:
    $$z_{\\pi} = \\frac{\\pi}{ASE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(z_{\\kappa}\\right)\\right)$$
    
    *Symbols used*
    
    * \\(F_{i,j}\\), the observed count in row i and column j.
    * \\(r\\), is the number of rows (categories in the first variable)
    * \\(c\\), is the number of columns (categories in the second variable)
    * \\(n\\), is the total number of scores
    * \\(R_i\\), the row total of row i. \\(R_i = \\sum_{j=1}^c F_{i,j}\\)
    * \\(C_j\\), the column total of column j. \\(C_j = \\sum_{i=1}^r F_{i,j}\\)
    
    References
    ----------
    Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. *The Public Opinion Quarterly, 19*(3), 321–325.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #create the cross table
    ct = tab_cross(field1, field2, categories, categories, totals="include")    
    
    #basic counts
    k = ct.shape[0]-1
    n = ct.iloc[k, k]
    
    #determine p0 and pe
    p0 = 0
    pe = 0
    for i in range(0, k):
        p0 = p0 + ct.iloc[i, i]
        pe = pe + ((ct.iloc[i, k] + ct.iloc[k, i])/(2*n))**2
    p0 = p0/n
    
    #Scott's Pi
    scottPi = (p0 - pe) / (1 - pe)
    
    #Test
    ase = ((1 / (1 - pe))**2 * p0 * (1 - p0) / (n - 1))**0.5
    z = scottPi / ase
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    
    #the results
    colnames = ["Scott pi", "n", "statistic", "p-value"]
    results = pd.DataFrame([[scottPi, n, z, pValue]], columns=colnames)
    
    return (results)

Functions

def es_scott_pi(field1, field2, categories=None)

Scott Pi

An effect size meaure, that measures the how strongly two raters or variables, agree with each other. Full agreement would result in a pi of 1.

The measure is very similar to Cohen's kappa. The difference is with the calculation of the expected marginal proportions. Cohen's kappa uses a squared geometric mean, while Scott's pi uses squared arithmetic means.

Scott developed this in criticism on Bennett-Alpert-Goldstein's S (see es_bag_s()).

Parameters

field1 : list or pandas series: the first categorical field
field2 : list or pandas series: the first categorical field
categories : list or dictionary, optional: order and/or selection for categories of field1 and field2

Returns

A dataframe with:

Scott pi, the Scott pi value.
n, the sample size
statistic, the test statistic (z-value)
p-value, the p-value (significance)

Notes

The formula used (Scott, 1955, p. 323): $\pi = \frac{p_0 - p_e}{1 - p_e}$

With: $P = \sum_{i=1}^r F_{i,i}$ $p_0 = \frac{P}{n}$ $p_e = \sum_{i=1}^r\left(\frac{R_i + C_i}{2\times n}\right)^2$

The asymptotic standard errors is calculated using (Scott, 1955, p. 325): $ASE = \sqrt{\left(\frac{1}{1 - p_e}\right)^2\times\frac{p_0\times\left(1 - p_0\right)}{n - 1}}$

The p-value (significance) is then calculated using: $z_{\pi} = \frac{\pi}{ASE}$ $sig. = 2\times\left(1 - \Phi\left(z_{\kappa}\right)\right)$

Symbols used

$F_{i,j}$ , the observed count in row i and column j.
$r$ , is the number of rows (categories in the first variable)
$c$ , is the number of columns (categories in the second variable)
$n$ , is the total number of scores
$R_i$ , the row total of row i. $R_i = \sum_{j=1}^c F_{i,j}$
$C_j$ , the column total of column j. $C_j = \sum_{i=1}^r F_{i,j}$

References

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. The Public Opinion Quarterly, 19(3), 321–325.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def es_scott_pi(field1, field2, categories=None):
    '''
    Scott Pi
    -----------
    An effect size meaure, that measures the how strongly two raters or variables, agree with each other. Full agreement would result in a pi of 1.
    
    The measure is very similar to Cohen's kappa. The difference is with the calculation of the expected marginal proportions. Cohen's kappa uses a squared geometric mean, while Scott's pi uses squared arithmetic means. 
    
    Scott developed this in criticism on Bennett-Alpert-Goldstein's S (see es_bag_s()).
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories : list or dictionary, optional
        order and/or selection for categories of field1 and field2
        
    Returns
    -------
    A dataframe with:
    * *Scott pi*, the Scott pi value.
    * *n*, the sample size
    * *statistic*, the test statistic (z-value)
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Scott, 1955, p. 323):
    $$\\pi = \\frac{p_0 - p_e}{1 - p_e}$$
    
    With:
    $$P = \\sum_{i=1}^r F_{i,i}$$
    $$p_0 = \\frac{P}{n}$$
    $$p_e = \\sum_{i=1}^r\\left(\\frac{R_i + C_i}{2\\times n}\\right)^2$$
    
    The asymptotic standard errors is calculated using (Scott, 1955, p. 325):
    $$ASE = \\sqrt{\\left(\\frac{1}{1 - p_e}\\right)^2\\times\\frac{p_0\\times\\left(1 - p_0\\right)}{n - 1}}$$
    
    The p-value (significance) is then calculated using:
    $$z_{\\pi} = \\frac{\\pi}{ASE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(z_{\\kappa}\\right)\\right)$$
    
    *Symbols used*
    
    * \\(F_{i,j}\\), the observed count in row i and column j.
    * \\(r\\), is the number of rows (categories in the first variable)
    * \\(c\\), is the number of columns (categories in the second variable)
    * \\(n\\), is the total number of scores
    * \\(R_i\\), the row total of row i. \\(R_i = \\sum_{j=1}^c F_{i,j}\\)
    * \\(C_j\\), the column total of column j. \\(C_j = \\sum_{i=1}^r F_{i,j}\\)
    
    References
    ----------
    Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. *The Public Opinion Quarterly, 19*(3), 321–325.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #create the cross table
    ct = tab_cross(field1, field2, categories, categories, totals="include")    
    
    #basic counts
    k = ct.shape[0]-1
    n = ct.iloc[k, k]
    
    #determine p0 and pe
    p0 = 0
    pe = 0
    for i in range(0, k):
        p0 = p0 + ct.iloc[i, i]
        pe = pe + ((ct.iloc[i, k] + ct.iloc[k, i])/(2*n))**2
    p0 = p0/n
    
    #Scott's Pi
    scottPi = (p0 - pe) / (1 - pe)
    
    #Test
    ase = ((1 / (1 - pe))**2 * p0 * (1 - p0) / (n - 1))**0.5
    z = scottPi / ase
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    
    #the results
    colnames = ["Scott pi", "n", "statistic", "p-value"]
    results = pd.DataFrame([[scottPi, n, z, pValue]], columns=colnames)
    
    return (results)