Module `stikpetP.tests.test_freeman_tukey_ind`

Expand source code

import pandas as pd
from scipy.stats import chi2
from ..other.table_cross import tab_cross

def ts_freeman_tukey_ind(field1, field2, categories1=None, categories2=None, cc= None, version=1):
    '''
    Freeman-Tukey Test of Independence
    ----------------------------------
    To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05 (or another pre-defined threshold),the two nominal variables have a significant association.
    
    The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent.
    
    The Freeman-Tukey test does the same, but attempts to approximate the normal distribution with a binomial or Poisson distribution.
    
    One problem though is that the test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
        
    field2 : list or pandas series
        the first categorical field
        
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
        
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
        
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
        
    version : {1, 2, 3}, optional
        which version of the test to use (see notes)
    
    Returns
    -------
    A dataframe with:
    
    * *n*, the sample size
    * *n rows*, number of categories used in first field
    * *n col.*, number of categories used in second field
    * *statistic*, the test statistic (chi-square value)
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    * *min. exp.*, the minimum expected count
    * *prop. exp. below 5*, proportion of cells with expected count less than 5
    * *test*, description of the test used
    
    Notes
    -----
    The formula used for version 1 is (Bishop et al., 2007, p. 513):
    $$T^2=4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$
    
    The formula used for version 2 is (Bishop, 1969, p. 284; Lawal, 1984, p. 415):
    $$T^2=\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}}+\\sqrt{F_{i,j}+1} - \\sqrt{4\\times E_{i,j}+1}\\right)^2$$

    The formula used for version 3 is (Read & Cressie, 1988, p. 82):
    $$T^2=\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}}+\\sqrt{F_{i,j}+1} - \\sqrt{4\\left(\\times E_{i,j}+1\\right)}\\right)^2$$
    
    
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    $$sig. = 1 - \\chi^2\\left(T^2,df\\right)$$
    
    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i,j} = \\frac{R_i\\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^c F_{i,j}$$
    $$C_j = \\sum_{i=1}^r F_{i,j}$$
    
    *Symbols used:*
    
    * $r$, the number of categories in the first variable (the number of rows)
    * $c$, the number of categories in the second variable (the number of columns)
    * $F_{i,j}$, the observed count in row i and column j
    * $E_{i,j}$, the expected count in row i and column j
    * $R_i$, the i-th row total
    * $C_j$, the j-th column total
    * $n$, the sum of all counts
    * $\\chi^2\\left(\\dots\\right)$, the chi-square cumulative density function
    
    The test is attributed to Freeman and Tukey (1950), but couldn't really find it in there. Ayinde and Abidoye (2010) also show the formula in more modern notation from version 1, and an another source for version 2 is Ozturk et al. (2023).
    
    The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{PP}^2 = T^2\\times\\frac{n - 1}{n}$$
    
    The Williams correction (williams) is calculated using (Williams, 1976, p. 36):
    $$\\chi_{PW}^2 = \\frac{T^2}{q}$$
    
    With:
    $$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$
    
    References
    ----------
    Ayinde, K., & Abidoye, A. O. (2010). Simplified Freeman-Tukey test statistics for testing probabilities in contingency tables. *Science World Journal, 2*(2), 21–27. doi:10.4314/swj.v2i2.51730

    Bishop, Y. M. M. (1969). Calculating smoothed contingency tables. In J. P. Bunker, W. H. Forrest, F. Mosteller, & L. D. Vandam (Eds.), The national halothane study: A study of the possible association between halothane anesthesia and postoperative hepatic necrosis (pp. 273–286). National Institute of Health.
    
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.
    
    Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. *Biometrics, 10*(4), 417. doi:10.2307/3001616
    
    Fisher, R. A. (1925). *Statistical methods for research workers*. Oliver and Boyd.
    
    Freeman, M. F., & Tukey, J. W. (1950). Transformations Related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756
    
    Lawal, H. B. (1984). Comparisons of the X 2 , Y 2 , Freeman-Tukey and Williams’s improved G 2 test statistics in small samples of one-way multinomials. *Biometrika, 71*(2), 415–418. doi:10.2307/2336263
    
    McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
    
    Ozturk, E., Basol, M., Goksuluk, D., & Karahan, S. (2023). Performance comparison of independence tests in two-way contingency table. *REVSTAT-Statistical Journal, 21*(2), Article 2. doi:10.57805/revstat.v21i2.403
    
    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
    
    Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer-Verlag.
    
    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    
    '''
    testUsed = "Freeman-Tukey test of independence"
    if cc == "yates":
        testUsed = testUsed + ", with Yates continuity correction"
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0] - 1
    ncols =  ct.shape[1] - 1
    n = ct.iloc[nrows, ncols]
    
    #determine the expected counts & chi-square value
    chi2Val = 0
    expMin = -1
    nExpBelow5 = 0    
    expC = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
            
            #add or remove a half in case Yates correction
            if cc=="yates":
                if ct.iloc[i,j] > expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] - 0.5
                elif ct.iloc[i,j] < expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] + 0.5
            
            if version == 1:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5 - expC.iloc[i, j]**0.5)**2
            elif version == 2:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5+(ct.iloc[i, j]+1)**0.5 - (4*expC.iloc[i, j]+1)**0.5)**2
            elif version == 2:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5+(ct.iloc[i, j]+1)**0.5 - (4*(expC.iloc[i, j]+1))**0.5)**2
            
            #check if below 5
            if expMin < 0 or expC.iloc[i,j] < expMin:
                expMin = expC.iloc[i,j]            
            if expC.iloc[i,j] < 5:
                nExpBelow5 = nExpBelow5 + 1
    
    nExpBelow5 = nExpBelow5/(nrows*ncols)
    if version==1:
        chi2Val = 4*chi2Val
    
    #Degrees of freedom
    df = (nrows - 1)*(ncols - 1)
    
    #Williams and Pearson correction
    if cc == "williams":
        testUsed = testUsed + ", with Williams continuity correction"
        rTotInv = 0
        for i in range(0, nrows):
            rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
        
        cTotInv = 0
        for j in range(0, ncols):
            cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
        
        q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
        chi2Val = chi2Val / q
    elif cc == "pearson":
        testUsed = testUsed + ", with E.S. Pearson continuity correction"
        chi2Val = chi2Val * (n - 1) / n
    
    #The test
    pvalue = chi2.sf(chi2Val, df)
    
    #Prepare the results
    colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
    testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
    pd.set_option('display.max_colwidth', None)
    
    return testResults

Functions

def ts_freeman_tukey_ind(field1, field2, categories1=None, categories2=None, cc=None, version=1)

Freeman-Tukey Test of Independence

To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05 (or another pre-defined threshold),the two nominal variables have a significant association.

The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent.

The Freeman-Tukey test does the same, but attempts to approximate the normal distribution with a binomial or Poisson distribution.

One problem though is that the test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .

Parameters

field1 : list or pandas series: the first categorical field
field2 : list or pandas series: the first categorical field
categories1 : list or dictionary, optional: order and/or selection for categories of field1
categories2 : list or dictionary, optional: order and/or selection for categories of field2
cc : {None, "yates", "pearson", "williams"}, optional: method for continuity correction
version : {1, 2, 3}, optional: which version of the test to use (see notes)

Returns

A dataframe with:

n, the sample size
n rows, number of categories used in first field
n col., number of categories used in second field
statistic, the test statistic (chi-square value)
df, the degrees of freedom
p-value, the significance (p-value)
min. exp., the minimum expected count
prop. exp. below 5, proportion of cells with expected count less than 5
test, description of the test used

Notes

The formula used for version 1 is (Bishop et al., 2007, p. 513): $T^2=4\times\sum_{i=1}^r \sum_{j=1}^c \left(\sqrt{F_{i,j}} - \sqrt{E_{i,j}}\right)^2$

The formula used for version 2 is (Bishop, 1969, p. 284; Lawal, 1984, p. 415): $T^2=\sum_{i=1}^r \sum_{j=1}^c \left(\sqrt{F_{i,j}}+\sqrt{F_{i,j}+1} - \sqrt{4\times E_{i,j}+1}\right)^2$

The formula used for version 3 is (Read & Cressie, 1988, p. 82): $T^2=\sum_{i=1}^r \sum_{j=1}^c \left(\sqrt{F_{i,j}}+\sqrt{F_{i,j}+1} - \sqrt{4\left(\times E_{i,j}+1\right)}\right)^2$

$df = \left(r - 1\right)\times\left(c - 1\right)$ $sig. = 1 - \chi^2\left(T^2,df\right)$

With: $n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j}$ $E_{i,j} = \frac{R_i\times C_j}{n}$ $R_i = \sum_{j=1}^c F_{i,j}$ $C_j = \sum_{i=1}^r F_{i,j}$

Symbols used:

$r$, the number of categories in the first variable (the number of rows)
$c$, the number of categories in the second variable (the number of columns)
$F_{i,j}$, the observed count in row i and column j
$E_{i,j}$, the expected count in row i and column j
$R_i$, the i-th row total
$C_j$, the j-th column total
$n$, the sum of all counts
$\chi^2\left(\dots\right)$, the chi-square cumulative density function

The test is attributed to Freeman and Tukey (1950), but couldn't really find it in there. Ayinde and Abidoye (2010) also show the formula in more modern notation from version 1, and an another source for version 2 is Ozturk et al. (2023).

The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157): $\chi_{PP}^2 = T^2\times\frac{n - 1}{n}$

The Williams correction (williams) is calculated using (Williams, 1976, p. 36): $\chi_{PW}^2 = \frac{T^2}{q}$

With: $q = 1 + \frac{\left(n\times\left(\sum_{i=1}^r \frac{1}{R_i}\right)-1\right) \times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_j}\right)-1\right)}{6\times n\times df}$

References

Ayinde, K., & Abidoye, A. O. (2010). Simplified Freeman-Tukey test statistics for testing probabilities in contingency tables. Science World Journal, 2(2), 21–27. doi:10.4314/swj.v2i2.51730

Bishop, Y. M. M. (1969). Calculating smoothed contingency tables. In J. P. Bunker, W. H. Forrest, F. Mosteller, & L. D. Vandam (Eds.), The national halothane study: A study of the possible association between halothane anesthesia and postoperative hepatic necrosis (pp. 273–286). National Institute of Health.

Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). Discrete multivariate analysis. Springer.

Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. Biometrics, 10(4), 417. doi:10.2307/3001616

Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd.

Freeman, M. F., & Tukey, J. W. (1950). Transformations Related to the angular and the square root. The Annals of Mathematical Statistics, 21(4), 607–611. doi:10.1214/aoms/1177729756

Lawal, H. B. (1984). Comparisons of the X 2 , Y 2 , Freeman-Tukey and Williams’s improved G 2 test statistics in small samples of one-way multinomials. Biometrika, 71(2), 415–418. doi:10.2307/2336263

McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing.

Ozturk, E., Basol, M., Goksuluk, D., & Karahan, S. (2023). Performance comparison of independence tests in two-way contingency table. REVSTAT-Statistical Journal, 21(2), Article 2. doi:10.57805/revstat.v21i2.403

Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2), 139–167. doi:10.2307/2332518

Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer-Verlag.

Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. Biometrika, 63(1), 33–37. doi:10.2307/2335081

Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. doi:10.2307/2983604

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def ts_freeman_tukey_ind(field1, field2, categories1=None, categories2=None, cc= None, version=1):
    '''
    Freeman-Tukey Test of Independence
    ----------------------------------
    To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05 (or another pre-defined threshold),the two nominal variables have a significant association.
    
    The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent.
    
    The Freeman-Tukey test does the same, but attempts to approximate the normal distribution with a binomial or Poisson distribution.
    
    One problem though is that the test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
        
    field2 : list or pandas series
        the first categorical field
        
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
        
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
        
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
        
    version : {1, 2, 3}, optional
        which version of the test to use (see notes)
    
    Returns
    -------
    A dataframe with:
    
    * *n*, the sample size
    * *n rows*, number of categories used in first field
    * *n col.*, number of categories used in second field
    * *statistic*, the test statistic (chi-square value)
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    * *min. exp.*, the minimum expected count
    * *prop. exp. below 5*, proportion of cells with expected count less than 5
    * *test*, description of the test used
    
    Notes
    -----
    The formula used for version 1 is (Bishop et al., 2007, p. 513):
    $$T^2=4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$
    
    The formula used for version 2 is (Bishop, 1969, p. 284; Lawal, 1984, p. 415):
    $$T^2=\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}}+\\sqrt{F_{i,j}+1} - \\sqrt{4\\times E_{i,j}+1}\\right)^2$$

    The formula used for version 3 is (Read & Cressie, 1988, p. 82):
    $$T^2=\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}}+\\sqrt{F_{i,j}+1} - \\sqrt{4\\left(\\times E_{i,j}+1\\right)}\\right)^2$$
    
    
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    $$sig. = 1 - \\chi^2\\left(T^2,df\\right)$$
    
    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i,j} = \\frac{R_i\\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^c F_{i,j}$$
    $$C_j = \\sum_{i=1}^r F_{i,j}$$
    
    *Symbols used:*
    
    * $r$, the number of categories in the first variable (the number of rows)
    * $c$, the number of categories in the second variable (the number of columns)
    * $F_{i,j}$, the observed count in row i and column j
    * $E_{i,j}$, the expected count in row i and column j
    * $R_i$, the i-th row total
    * $C_j$, the j-th column total
    * $n$, the sum of all counts
    * $\\chi^2\\left(\\dots\\right)$, the chi-square cumulative density function
    
    The test is attributed to Freeman and Tukey (1950), but couldn't really find it in there. Ayinde and Abidoye (2010) also show the formula in more modern notation from version 1, and an another source for version 2 is Ozturk et al. (2023).
    
    The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{PP}^2 = T^2\\times\\frac{n - 1}{n}$$
    
    The Williams correction (williams) is calculated using (Williams, 1976, p. 36):
    $$\\chi_{PW}^2 = \\frac{T^2}{q}$$
    
    With:
    $$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$
    
    References
    ----------
    Ayinde, K., & Abidoye, A. O. (2010). Simplified Freeman-Tukey test statistics for testing probabilities in contingency tables. *Science World Journal, 2*(2), 21–27. doi:10.4314/swj.v2i2.51730

    Bishop, Y. M. M. (1969). Calculating smoothed contingency tables. In J. P. Bunker, W. H. Forrest, F. Mosteller, & L. D. Vandam (Eds.), The national halothane study: A study of the possible association between halothane anesthesia and postoperative hepatic necrosis (pp. 273–286). National Institute of Health.
    
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.
    
    Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. *Biometrics, 10*(4), 417. doi:10.2307/3001616
    
    Fisher, R. A. (1925). *Statistical methods for research workers*. Oliver and Boyd.
    
    Freeman, M. F., & Tukey, J. W. (1950). Transformations Related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756
    
    Lawal, H. B. (1984). Comparisons of the X 2 , Y 2 , Freeman-Tukey and Williams’s improved G 2 test statistics in small samples of one-way multinomials. *Biometrika, 71*(2), 415–418. doi:10.2307/2336263
    
    McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
    
    Ozturk, E., Basol, M., Goksuluk, D., & Karahan, S. (2023). Performance comparison of independence tests in two-way contingency table. *REVSTAT-Statistical Journal, 21*(2), Article 2. doi:10.57805/revstat.v21i2.403
    
    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
    
    Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer-Verlag.
    
    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    
    '''
    testUsed = "Freeman-Tukey test of independence"
    if cc == "yates":
        testUsed = testUsed + ", with Yates continuity correction"
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0] - 1
    ncols =  ct.shape[1] - 1
    n = ct.iloc[nrows, ncols]
    
    #determine the expected counts & chi-square value
    chi2Val = 0
    expMin = -1
    nExpBelow5 = 0    
    expC = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
            
            #add or remove a half in case Yates correction
            if cc=="yates":
                if ct.iloc[i,j] > expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] - 0.5
                elif ct.iloc[i,j] < expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] + 0.5
            
            if version == 1:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5 - expC.iloc[i, j]**0.5)**2
            elif version == 2:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5+(ct.iloc[i, j]+1)**0.5 - (4*expC.iloc[i, j]+1)**0.5)**2
            elif version == 2:
                chi2Val = chi2Val + (ct.iloc[i, j]**0.5+(ct.iloc[i, j]+1)**0.5 - (4*(expC.iloc[i, j]+1))**0.5)**2
            
            #check if below 5
            if expMin < 0 or expC.iloc[i,j] < expMin:
                expMin = expC.iloc[i,j]            
            if expC.iloc[i,j] < 5:
                nExpBelow5 = nExpBelow5 + 1
    
    nExpBelow5 = nExpBelow5/(nrows*ncols)
    if version==1:
        chi2Val = 4*chi2Val
    
    #Degrees of freedom
    df = (nrows - 1)*(ncols - 1)
    
    #Williams and Pearson correction
    if cc == "williams":
        testUsed = testUsed + ", with Williams continuity correction"
        rTotInv = 0
        for i in range(0, nrows):
            rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
        
        cTotInv = 0
        for j in range(0, ncols):
            cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
        
        q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
        chi2Val = chi2Val / q
    elif cc == "pearson":
        testUsed = testUsed + ", with E.S. Pearson continuity correction"
        chi2Val = chi2Val * (n - 1) / n
    
    #The test
    pvalue = chi2.sf(chi2Val, df)
    
    #Prepare the results
    colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
    testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
    pd.set_option('display.max_colwidth', None)
    
    return testResults