Module stikpetP.tests.test_fisher
Expand source code
import math
from ..other.table_cross import tab_cross
def ts_fisher(field1, field2, categories1=None, categories2=None):
'''
Fisher Exact Test
-----------------
Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77).
Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used.
Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data.
As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment.
This function is shown in this [YouTube video](https://youtu.be/gvjzU4FayMs) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/FisherExactTest.html)
Parameters
----------
field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used
Returns
-------
pval : the two-sided p-value (sig.)
Notes
-----
The formula used is from Fisher (1950, p. 96):
$$p = \\sum_{i=a_{min}}^{a_{max}}\\begin{cases} p_i & \\text{if } p_i \\leq p_s \\\\ 0 & \\text{ else } \\end{cases}$$
With:
$$p_x = \\frac{\\binom{R_1}{x}\\times \\binom{n - R_1}{C_1-x}}{\\binom{n}{C_1}}$$
$$a_{min} = \\max\\left(0, C_1 + R_1 - n\\right)$$
$$a_{max} = \\min\\left(R_1, C_1\\right)$$
$$\\binom{x}{y}=\\frac{x!}{y!\\times\\left(x-y\\right)!}$$
*Symbols used:*
* \\(p_s\\), the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data.
* \\(R_1\\), is the total of the first row,
* \\(C_1\\) the total of the first column.
* \\(n\\), is the total sample size.
The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n.
The maximum for 'a' is simply the minimum of either it's row total, or column total.
Note that \\(p_x\\) is the probability mass function of a hypergeometric distribution.
Before, After and Alternatives
------------------------------
Before running the test you might first want to get an impression using a cross table:
[tab_cross](../other/table_cross.html#tab_cross)
After this you might an effect size measure, a lot of them are available via:
[es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin)
References
----------
Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. *Journal of the Royal Statistical Society, 85*(1), 87–94. https://doi.org/10.2307/2340521
Fisher, R. A. (1934). *Statistical methods for research workers* (5th ed.). Oliver and Boyd.
Fisher, R. A. (1935a). *The design of experiments*. Oliver and Boyd.
Fisher, R. A. (1935b). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435
Fisher, R. A. (1950). *Statistical methods for research workers* (11th rev.). Oliver and Boyd.
Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. *Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5*(2), 1–14.
Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. *Metron, 12*(2), 83–94.
McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. *Journal of the Royal Statistical Society. Series A (General), 145*(1), 86–105. https://doi.org/10.2307/2981423
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. https://doi.org/10.2307/2983604
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> import pandas as pd
>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
0.004338292519487543
'''
# determine sample cross table
tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
# cell values of sample cross table
a = tab.iloc[0,0]
b = tab.iloc[0,1]
c = tab.iloc[1,0]
d = tab.iloc[1,1]
# row, column, and grand total
R1 = a + b
R2 = c + d
C1 = a + c
n = R1 + R2
# probability of sample table
pSample = math.comb(R1, a) * math.comb(R2, c) / math.comb(n, C1)
# loop over all possible tables with same total
den = math.comb(n, C1)
pVal = 0
for a in range(max(0, C1 + R1 - n), min(C1, R1)+1):
b = R1 - a
c = C1 - a
d = R2 - c
pTable = math.comb(R1, a) * math.comb(R2, c) / den
if pTable <= pSample:
pVal = pVal + pTable
return (pVal)
Functions
def ts_fisher(field1, field2, categories1=None, categories2=None)
-
Fisher Exact Test
Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77).
Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used.
Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data.
As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment.
This function is shown in this YouTube video and the test is also described at PeterStatistics.com
Parameters
field1
:pandas series
- data with categories for the rows
field2
:pandas series
- data with categories for the columns
categories1
:list
ordictionary
, optional- the two categories to use from field1. If not set the first two found will be used
categories2
:list
ordictionary
, optional- the two categories to use from field2. If not set the first two found will be used
Returns
pval
:the two-sided p-value (sig.)
Notes
The formula used is from Fisher (1950, p. 96): p = \sum_{i=a_{min}}^{a_{max}}\begin{cases} p_i & \text{if } p_i \leq p_s \\ 0 & \text{ else } \end{cases}
With: p_x = \frac{\binom{R_1}{x}\times \binom{n - R_1}{C_1-x}}{\binom{n}{C_1}} a_{min} = \max\left(0, C_1 + R_1 - n\right) a_{max} = \min\left(R_1, C_1\right) \binom{x}{y}=\frac{x!}{y!\times\left(x-y\right)!}
Symbols used:
- p_s, the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data.
- R_1, is the total of the first row,
- C_1 the total of the first column.
- n, is the total sample size.
The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n.
The maximum for 'a' is simply the minimum of either it's row total, or column total.
Note that p_x is the probability mass function of a hypergeometric distribution.
Before, After and Alternatives
Before running the test you might first want to get an impression using a cross table: tab_cross
After this you might an effect size measure, a lot of them are available via: es_bin_bin
References
Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1), 87–94. https://doi.org/10.2307/2340521
Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Oliver and Boyd.
Fisher, R. A. (1935a). The design of experiments. Oliver and Boyd.
Fisher, R. A. (1935b). The logic of inductive inference. Journal of the Royal Statistical Society, 98(1), 39–82. https://doi.org/10.2307/2342435
Fisher, R. A. (1950). Statistical methods for research workers (11th rev.). Oliver and Boyd.
Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5(2), 1–14.
Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. Metron, 12(2), 83–94.
McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing.
Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. Journal of the Royal Statistical Society. Series A (General), 145(1), 86–105. https://doi.org/10.2307/2981423
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. https://doi.org/10.2307/2983604
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> import pandas as pd >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"]) 0.004338292519487543
Expand source code
def ts_fisher(field1, field2, categories1=None, categories2=None): ''' Fisher Exact Test ----------------- Perhaps the most commonly used test when you have two binary variables is the Fisher (Exact) Test. It tests if "the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable" (McDonald, 2014, p. 77). Note that for a 2x2 table there are quite a lot of different tests. Upton (1982) discusses 24 of them. For larger tables a Fisher-Freeman-Halton Exact Test could be used. Its important to note that the test assumes the margins are fixed (i.e. the row and column totals don't change), so only use this test if this assumptions is valid for your data. As Hitchcock (2009, pp. 3–4) points out, the history is a bit murky. Some refer to Fisher (1922) who does seem to mention the exact distribution in a footnote on page 339, but the test is supposedly first fully discussed by Fisher in the fifth edition of his book (1934). Irwin (1935) notes that his paper was concluded already in 1933, but publication was delayed. He also refers to Yates (1934) who also discuss the test, and refers to personal communication with Fisher. Another paper from Fisher (1935b) is also sometimes referred to. Fisher (1935a, pp. 24–29) described an experiment and the exact test to use, which is commonly known as the Lady Tasting Tea Experiment. This function is shown in this [YouTube video](https://youtu.be/gvjzU4FayMs) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/FisherExactTest.html) Parameters ---------- field1 : pandas series data with categories for the rows field2 : pandas series data with categories for the columns categories1 : list or dictionary, optional the two categories to use from field1. If not set the first two found will be used categories2 : list or dictionary, optional the two categories to use from field2. If not set the first two found will be used Returns ------- pval : the two-sided p-value (sig.) Notes ----- The formula used is from Fisher (1950, p. 96): $$p = \\sum_{i=a_{min}}^{a_{max}}\\begin{cases} p_i & \\text{if } p_i \\leq p_s \\\\ 0 & \\text{ else } \\end{cases}$$ With: $$p_x = \\frac{\\binom{R_1}{x}\\times \\binom{n - R_1}{C_1-x}}{\\binom{n}{C_1}}$$ $$a_{min} = \\max\\left(0, C_1 + R_1 - n\\right)$$ $$a_{max} = \\min\\left(R_1, C_1\\right)$$ $$\\binom{x}{y}=\\frac{x!}{y!\\times\\left(x-y\\right)!}$$ *Symbols used:* * \\(p_s\\), the probability of sample cross table, i.e. p_x with x being the upper-left cell of the the cross table from the sample data. * \\(R_1\\), is the total of the first row, * \\(C_1\\) the total of the first column. * \\(n\\), is the total sample size. The reason for the minimum value of 'a', is first that it cannot be negative, since these are counts. So 0 would be the lowest ever possible. However, once 'a' is set, and the totals are fixed, all other values should also be positive (or zero). The value for 'b' will be if 'a' is 0, it will simply be R1 - a. The value for 'c' is also no issue, this is simply C1 - a. However 'd' might be negative, even if a = 0. The value for 'd' is n - R1 - c. Since c = C1 - a, we get d = n - R1 - C1 + a. But this could be negative if R1 + C1 > n. So, 'a' must be at least C1 + R1 - n. The maximum for 'a' is simply the minimum of either it's row total, or column total. Note that \\(p_x\\) is the probability mass function of a hypergeometric distribution. Before, After and Alternatives ------------------------------ Before running the test you might first want to get an impression using a cross table: [tab_cross](../other/table_cross.html#tab_cross) After this you might an effect size measure, a lot of them are available via: [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin) References ---------- Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of p. *Journal of the Royal Statistical Society, 85*(1), 87–94. https://doi.org/10.2307/2340521 Fisher, R. A. (1934). *Statistical methods for research workers* (5th ed.). Oliver and Boyd. Fisher, R. A. (1935a). *The design of experiments*. Oliver and Boyd. Fisher, R. A. (1935b). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435 Fisher, R. A. (1950). *Statistical methods for research workers* (11th rev.). Oliver and Boyd. Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. *Journal Électronique d’Histoire Des Probabilités et de La Statistique, 5*(2), 1–14. Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers. *Metron, 12*(2), 83–94. McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing. Upton, G. J. G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. *Journal of the Royal Statistical Society. Series A (General), 145*(1), 86–105. https://doi.org/10.2307/2981423 Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. https://doi.org/10.2307/2983604 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> import pandas as pd >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ts_fisher(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"]) 0.004338292519487543 ''' # determine sample cross table tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude") # cell values of sample cross table a = tab.iloc[0,0] b = tab.iloc[0,1] c = tab.iloc[1,0] d = tab.iloc[1,1] # row, column, and grand total R1 = a + b R2 = c + d C1 = a + c n = R1 + R2 # probability of sample table pSample = math.comb(R1, a) * math.comb(R2, c) / math.comb(n, C1) # loop over all possible tables with same total den = math.comb(n, C1) pVal = 0 for a in range(max(0, C1 + R1 - n), min(C1, R1)+1): b = R1 - a c = C1 - a d = R2 - c pTable = math.comb(R1, a) * math.comb(R2, c) / den if pTable <= pSample: pVal = pVal + pTable return (pVal)