Module stikpetP.tests.test_fisher

Expand source code
import math
from ..other.table_cross import tab_cross

def ts_fisher(field1, field2, categories1=None, categories2=None):
    '''
    Fisher Exact Test
    -----------------
    
    Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77).
    
    Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used.

    Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data.

    As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment.

    This function is shown in this [YouTube video](https://youtu.be/gvjzU4FayMs) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/FisherExactTest.html)
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used    
    Returns
    -------
    pval : the two-sided p-value (sig.)
    
    Notes
    -----    
    The formula used is from Fisher (1950, p. 96):
    $$p = \\sum_{i=a_{min}}^{a_{max}}\\begin{cases} p_i & \\text{if } p_i \\leq p_s \\\\  0 & \\text{ else } \\end{cases}$$
    
    With:
    $$p_x = \\frac{\\binom{R_1}{x}\\times \\binom{n - R_1}{C_1-x}}{\\binom{n}{C_1}}$$
    $$a_{min} = \\max\\left(0, C_1 + R_1 - n\\right)$$
    $$a_{max} = \\min\\left(R_1, C_1\\right)$$
    $$\\binom{x}{y}=\\frac{x!}{y!\\times\\left(x-y\\right)!}$$
    
    *Symbols used:*
    
    * \\(p_s\\), the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data.
    * \\(R_1\\), is the total of the first row, 
    * \\(C_1\\) the total of the first column. 
    * \\(n\\), is the total sample size.    
    
    The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n.
    
    The maximum for 'a' is simply the minimum of either it's row total, or column total.
    
    Note that \\(p_x\\) is the probability mass function of a hypergeometric distribution.

    
    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table:
    [tab_cross](../other/table_cross.html#tab_cross)

    After this you might an effect size measure, a lot of them are available via:
    [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin)
    

    References
    ----------
    Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. *Journal of the Royal Statistical Society, 85*(1), 87–94. https://doi.org/10.2307/2340521
    
    Fisher, R. A. (1934). *Statistical methods for research workers* (5th ed.). Oliver and Boyd.
    
    Fisher, R. A. (1935a). *The design of experiments*. Oliver and Boyd.
    
    Fisher, R. A. (1935b). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435
    
    Fisher, R. A. (1950). *Statistical methods for research workers* (11th rev.). Oliver and Boyd.
    
    Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. *Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5*(2), 1–14.
    
    Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. *Metron, 12*(2), 83–94.
    
    McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
    
    Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. *Journal of the Royal Statistical Society. Series A (General), 145*(1), 86–105. https://doi.org/10.2307/2981423
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. https://doi.org/10.2307/2983604

    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> import pandas as pd
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    0.004338292519487543
    
    '''
    
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    # row, column, and grand total
    R1 = a + b
    R2 = c + d
    C1 = a + c
    n = R1 + R2
    
    # probability of sample table
    pSample = math.comb(R1, a) * math.comb(R2, c) / math.comb(n, C1)
    
    # loop over all possible tables with same total
    den = math.comb(n, C1)
    pVal = 0
    for a in range(max(0, C1 + R1 - n), min(C1, R1)+1):
        b = R1 - a
        c = C1 - a
        d = R2 - c
        pTable = math.comb(R1, a) * math.comb(R2, c) / den
        if pTable <= pSample:
            pVal = pVal + pTable
            
    return (pVal)

Functions

def ts_fisher(field1, field2, categories1=None, categories2=None)

Fisher Exact Test

Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77).

Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used.

Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data.

As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment.

This function is shown in this YouTube video and the test is also described at PeterStatistics.com

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used

Returns

pval : the two-sided p-value (sig.)
 

Notes

The formula used is from Fisher (1950, p. 96): p = \sum_{i=a_{min}}^{a_{max}}\begin{cases} p_i & \text{if } p_i \leq p_s \\ 0 & \text{ else } \end{cases}

With: p_x = \frac{\binom{R_1}{x}\times \binom{n - R_1}{C_1-x}}{\binom{n}{C_1}} a_{min} = \max\left(0, C_1 + R_1 - n\right) a_{max} = \min\left(R_1, C_1\right) \binom{x}{y}=\frac{x!}{y!\times\left(x-y\right)!}

Symbols used:

  • p_s, the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data.
  • R_1, is the total of the first row,
  • C_1 the total of the first column.
  • n, is the total sample size.

The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n.

The maximum for 'a' is simply the minimum of either it's row total, or column total.

Note that p_x is the probability mass function of a hypergeometric distribution.

Before, After and Alternatives

Before running the test you might first want to get an impression using a cross table: tab_cross

After this you might an effect size measure, a lot of them are available via: es_bin_bin

References

Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1), 87–94. https://doi.org/10.2307/2340521

Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Oliver and Boyd.

Fisher, R. A. (1935a). The design of experiments. Oliver and Boyd.

Fisher, R. A. (1935b). The logic of inductive inference. Journal of the Royal Statistical Society, 98(1), 39–82. https://doi.org/10.2307/2342435

Fisher, R. A. (1950). Statistical methods for research workers (11th rev.). Oliver and Boyd.

Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5(2), 1–14.

Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. Metron, 12(2), 83–94.

McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing.

Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. Journal of the Royal Statistical Society. Series A (General), 145(1), 86–105. https://doi.org/10.2307/2981423

Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. https://doi.org/10.2307/2983604

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> import pandas as pd
>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
0.004338292519487543
Expand source code
def ts_fisher(field1, field2, categories1=None, categories2=None):
    '''
    Fisher Exact Test
    -----------------
    
    Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77).
    
    Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used.

    Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data.

    As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment.

    This function is shown in this [YouTube video](https://youtu.be/gvjzU4FayMs) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/FisherExactTest.html)
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used    
    Returns
    -------
    pval : the two-sided p-value (sig.)
    
    Notes
    -----    
    The formula used is from Fisher (1950, p. 96):
    $$p = \\sum_{i=a_{min}}^{a_{max}}\\begin{cases} p_i & \\text{if } p_i \\leq p_s \\\\  0 & \\text{ else } \\end{cases}$$
    
    With:
    $$p_x = \\frac{\\binom{R_1}{x}\\times \\binom{n - R_1}{C_1-x}}{\\binom{n}{C_1}}$$
    $$a_{min} = \\max\\left(0, C_1 + R_1 - n\\right)$$
    $$a_{max} = \\min\\left(R_1, C_1\\right)$$
    $$\\binom{x}{y}=\\frac{x!}{y!\\times\\left(x-y\\right)!}$$
    
    *Symbols used:*
    
    * \\(p_s\\), the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data.
    * \\(R_1\\), is the total of the first row, 
    * \\(C_1\\) the total of the first column. 
    * \\(n\\), is the total sample size.    
    
    The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n.
    
    The maximum for 'a' is simply the minimum of either it's row total, or column total.
    
    Note that \\(p_x\\) is the probability mass function of a hypergeometric distribution.

    
    Before, After and Alternatives
    ------------------------------
    Before running the test you might first want to get an impression using a cross table:
    [tab_cross](../other/table_cross.html#tab_cross)

    After this you might an effect size measure, a lot of them are available via:
    [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin)
    

    References
    ----------
    Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. *Journal of the Royal Statistical Society, 85*(1), 87–94. https://doi.org/10.2307/2340521
    
    Fisher, R. A. (1934). *Statistical methods for research workers* (5th ed.). Oliver and Boyd.
    
    Fisher, R. A. (1935a). *The design of experiments*. Oliver and Boyd.
    
    Fisher, R. A. (1935b). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435
    
    Fisher, R. A. (1950). *Statistical methods for research workers* (11th rev.). Oliver and Boyd.
    
    Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. *Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5*(2), 1–14.
    
    Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. *Metron, 12*(2), 83–94.
    
    McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
    
    Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. *Journal of the Royal Statistical Society. Series A (General), 145*(1), 86–105. https://doi.org/10.2307/2981423
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. https://doi.org/10.2307/2983604

    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> import pandas as pd
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    0.004338292519487543
    
    '''
    
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    # row, column, and grand total
    R1 = a + b
    R2 = c + d
    C1 = a + c
    n = R1 + R2
    
    # probability of sample table
    pSample = math.comb(R1, a) * math.comb(R2, c) / math.comb(n, C1)
    
    # loop over all possible tables with same total
    den = math.comb(n, C1)
    pVal = 0
    for a in range(max(0, C1 + R1 - n), min(C1, R1)+1):
        b = R1 - a
        c = C1 - a
        d = R2 - c
        pTable = math.comb(R1, a) * math.comb(R2, c) / den
        if pTable <= pSample:
            pVal = pVal + pTable
            
    return (pVal)