Module stikpetP.effect_sizes.eff_size_phi

Expand source code
import pandas as pd
from ..other.table_cross import tab_cross

def es_phi(field1, field2, categories1=None, categories2=None):
    '''
    Pearson/Yule Phi Coefficient / Cole C2 / Mean Square Contingency
    -----------------------------
    
    After performing chi-square test the question of the effect size comes up. An obvious candidate to use in a measure of effect size is the test statistic, the \\eqn{\\chi^2}. One of the earliest and often mentioned measure uses this: the phi coefficient (or mean square contingency). Both Yule (1912, p. 596) and Pearson (1900, p. 12) mention this measure, and Cole (1949, p. 415) refers to it as Cole C2. It is also the same as Cohen's w (Cohen, 1988, p. 216), but Cohen does not specify it to be only for 2x2 tables.
    
    It is interesting that this gives the same result, as if you would assign a 0 and 1 to each of the two variables categories, and calculate the regular correlation coefficient.
    
    Pearson (1904, p. 6) calls the squared value (i.e. not taking the square root) the Mean Square Contingency.
    
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used

    Returns
    -------
    phi coefficient
        
    Notes
    -----    
    The formula used is (Pearson, 1900, p. 12):
    $$\\phi = \\frac{a\\times d - b\\times c}{\\sqrt{R_1\\times R_2 \\times C_1 \\times C_2}}$$
    
    *Symbols used:*
    
    * \\(a\\) the count in the top-left cell of the cross table
    * \\(b\\) the count in the top-right cell of the cross table 
    * \\(c\\) the count in the bottom-left cell of the cross table 
    * \\(d\\) the count in the bottom-right cell of the cross table 
    * \\(R_i\\) the sum of counts in the i-th row 
    * \\(C_i\\) the sum of counts in the i-th column 
    
    The formula is also sometimes expressed with a \\eqn{\\chi^2} value (Pearson, 1904, p.6; Cohen, 1988, p. 216):
    $$\\phi = \\sqrt{\\frac{\\chi^2}{n}}$$
    
    Note that Cohen w did not limit the size of the table, but uses the same formula.
    
    See Also
    --------
    stikpetP.other.thumb_cohen_w.th_cohen_w : rules of thumb for Cohen w

    References
    ----------
    Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). L. Erlbaum Associates.
    
    Cole, L. C. (1949). The measurement of interspecific associaton. *Ecology, 30*(4), 411–424. https://doi.org/10.2307/1932444
    
    Pearson, K. (1900). Mathematical Contributions to the Theory of Evolution. VII. On the Correlation of Characters not Quantitatively Measurable. *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character*, 195, 1–405.
    
    Pearson, K. (1904). *Contributions to the Mathematical Theory of Evolution. XIII. On the theory of contingency and its relation to association and normal correlation*. Dulau and Co.
    
    Yule, G. U. (1912). On the methods of measuring association between two attributes. *Journal of the Royal Statistical Society, 75*(6), 579–652. https://doi.org/10.2307/2340126
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_phi(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    np.float64(0.1293456121124377)
    
        
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    R1 = a+b
    R2 = c+d
    C1 = a+c
    C2 = b+d
    
    phi =(a*d - b*c)/(R1*R2*C1*C2)**0.5
    
    
    return (phi)

Functions

def es_phi(field1, field2, categories1=None, categories2=None)

Pearson/Yule Phi Coefficient / Cole C2 / Mean Square Contingency

After performing chi-square test the question of the effect size comes up. An obvious candidate to use in a measure of effect size is the test statistic, the \eqn{\chi^2}. One of the earliest and often mentioned measure uses this: the phi coefficient (or mean square contingency). Both Yule (1912, p. 596) and Pearson (1900, p. 12) mention this measure, and Cole (1949, p. 415) refers to it as Cole C2. It is also the same as Cohen's w (Cohen, 1988, p. 216), but Cohen does not specify it to be only for 2x2 tables.

It is interesting that this gives the same result, as if you would assign a 0 and 1 to each of the two variables categories, and calculate the regular correlation coefficient.

Pearson (1904, p. 6) calls the squared value (i.e. not taking the square root) the Mean Square Contingency.

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used

Returns

phi coefficient
 

Notes

The formula used is (Pearson, 1900, p. 12): \phi = \frac{a\times d - b\times c}{\sqrt{R_1\times R_2 \times C_1 \times C_2}}

Symbols used:

  • a the count in the top-left cell of the cross table
  • b the count in the top-right cell of the cross table
  • c the count in the bottom-left cell of the cross table
  • d the count in the bottom-right cell of the cross table
  • R_i the sum of counts in the i-th row
  • C_i the sum of counts in the i-th column

The formula is also sometimes expressed with a \eqn{\chi^2} value (Pearson, 1904, p.6; Cohen, 1988, p. 216): \phi = \sqrt{\frac{\chi^2}{n}}

Note that Cohen w did not limit the size of the table, but uses the same formula.

See Also

th_cohen_w()
rules of thumb for Cohen w

References

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). L. Erlbaum Associates.

Cole, L. C. (1949). The measurement of interspecific associaton. Ecology, 30(4), 411–424. https://doi.org/10.2307/1932444

Pearson, K. (1900). Mathematical Contributions to the Theory of Evolution. VII. On the Correlation of Characters not Quantitatively Measurable. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 195, 1–405.

Pearson, K. (1904). Contributions to the Mathematical Theory of Evolution. XIII. On the theory of contingency and its relation to association and normal correlation. Dulau and Co.

Yule, G. U. (1912). On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6), 579–652. https://doi.org/10.2307/2340126

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> es_phi(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
np.float64(0.1293456121124377)
Expand source code
def es_phi(field1, field2, categories1=None, categories2=None):
    '''
    Pearson/Yule Phi Coefficient / Cole C2 / Mean Square Contingency
    -----------------------------
    
    After performing chi-square test the question of the effect size comes up. An obvious candidate to use in a measure of effect size is the test statistic, the \\eqn{\\chi^2}. One of the earliest and often mentioned measure uses this: the phi coefficient (or mean square contingency). Both Yule (1912, p. 596) and Pearson (1900, p. 12) mention this measure, and Cole (1949, p. 415) refers to it as Cole C2. It is also the same as Cohen's w (Cohen, 1988, p. 216), but Cohen does not specify it to be only for 2x2 tables.
    
    It is interesting that this gives the same result, as if you would assign a 0 and 1 to each of the two variables categories, and calculate the regular correlation coefficient.
    
    Pearson (1904, p. 6) calls the squared value (i.e. not taking the square root) the Mean Square Contingency.
    
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used

    Returns
    -------
    phi coefficient
        
    Notes
    -----    
    The formula used is (Pearson, 1900, p. 12):
    $$\\phi = \\frac{a\\times d - b\\times c}{\\sqrt{R_1\\times R_2 \\times C_1 \\times C_2}}$$
    
    *Symbols used:*
    
    * \\(a\\) the count in the top-left cell of the cross table
    * \\(b\\) the count in the top-right cell of the cross table 
    * \\(c\\) the count in the bottom-left cell of the cross table 
    * \\(d\\) the count in the bottom-right cell of the cross table 
    * \\(R_i\\) the sum of counts in the i-th row 
    * \\(C_i\\) the sum of counts in the i-th column 
    
    The formula is also sometimes expressed with a \\eqn{\\chi^2} value (Pearson, 1904, p.6; Cohen, 1988, p. 216):
    $$\\phi = \\sqrt{\\frac{\\chi^2}{n}}$$
    
    Note that Cohen w did not limit the size of the table, but uses the same formula.
    
    See Also
    --------
    stikpetP.other.thumb_cohen_w.th_cohen_w : rules of thumb for Cohen w

    References
    ----------
    Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). L. Erlbaum Associates.
    
    Cole, L. C. (1949). The measurement of interspecific associaton. *Ecology, 30*(4), 411–424. https://doi.org/10.2307/1932444
    
    Pearson, K. (1900). Mathematical Contributions to the Theory of Evolution. VII. On the Correlation of Characters not Quantitatively Measurable. *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character*, 195, 1–405.
    
    Pearson, K. (1904). *Contributions to the Mathematical Theory of Evolution. XIII. On the theory of contingency and its relation to association and normal correlation*. Dulau and Co.
    
    Yule, G. U. (1912). On the methods of measuring association between two attributes. *Journal of the Royal Statistical Society, 75*(6), 579–652. https://doi.org/10.2307/2340126
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_phi(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    np.float64(0.1293456121124377)
    
        
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    R1 = a+b
    R2 = c+d
    C1 = a+c
    C2 = b+d
    
    phi =(a*d - b*c)/(R1*R2*C1*C2)**0.5
    
    
    return (phi)