Module stikpetP.tests.test_pearson_ind
Expand source code
import pandas as pd
from scipy.stats import chi2
from ..other.table_cross import tab_cross
def ts_pearson_ind(field1, field2, categories1=None, categories2=None, cc= None):
'''
Pearson Chi-Square Test of Independence
---------------------------------------
To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05, the two nominal variables have a significant association.
The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent.
If for example I had 50 male and 50 female respondents, and 50 agreed with a statement and 50 disagreed with the statement, the expected value for each combination (male-agree, female-agree, male-disagree, and female-disagree) would be 25.
Note that if in the survey the real results would be that all male disagreed, and all female would agree, there is a full dependency (i.e. gender fully decides if you agree or disagree), even though the row and column totals would still be 50. In essence the Pearson chi-square test, checks if your data is more toward the expected values (independence) or the full dependency one.
One problem though is that the Pearson chi-square test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .
Parameters
----------
field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories1 : list or dictionary, optional
order and/or selection for categories of field1
categories2 : list or dictionary, optional
order and/or selection for categories of field2
cc : {None, "yates", "pearson", "williams"}, optional
method for continuity correction
Returns
-------
A dataframe with:
* *n*, the sample size
* *n rows*, number of categories used in first field
* *n col.*, number of categories used in second field
* *statistic*, the test statistic (chi-square value)
* *df*, the degrees of freedom
* *p-value*, the significance (p-value)
* *min. exp.*, the minimum expected count
* *prop. exp. below 5*, proportion of cells with expected count less than 5
* *test*, description of the test used
Notes
-----
The formula used is (Pearson, 1900, p. 165):
$$\\chi_p^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$
$$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
$$sig. = 1 - \\chi^2\\left(\\chi_p^2, df\\right)$$
With:
$$E_{i,j} = \\frac{R_i \\times C_j}{n}$$
$$R_i = \\sum_{j=1}^c F_{i,j}$$
$$C_j = \\sum_{i=1}^r F_{i,j}$$
$$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j} = \\sum_{i=1}^r R_i = \\sum_{j=1}^c C_j$$
Symbols:
* \\(r\\), the number of rows
* \\(c\\), the number of columns
* \\(F_{i,j}\\), the observed count in row i and column j.
* \\(E_{i,j}\\), the expected count in row i and column j.
* \\(R_i\\), the row total of row i
* \\(C_j\\), the column total of column j
* \\(n\\), the overall total.
* \\(df\\), the degrees of freedom
The **Yates** correction uses \\(F_{i,j}'\\) instead of \\(F_{i,j}\\), defined as (Yates, 1934, p. 222):
$$F_{i,j}' = \\begin{cases} F_{i,j}-\\frac{1}{2} & \\text{ if } F_{i,j}> E_{i,j} \\\\ F_{i,j}+\\frac{1}{2} & \\text{ if } F_{i,j}< E_{i,j} \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j} \\end{cases} $$
The **Williams** correction, adjusts the Pearson chi-square value:
$$\\chi_{wil}^2 = \\frac{\\chi_p^2}{q}$$
With:
$$q = 1+\\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right) - 1\\right)\\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_i}\\right) - 1\\right)}{6\\times n \\times\\left(r - 1\\right)\\times\\left(c - 1\\right)}$$
The formula is probably from Williams (1976) but the one shown here is taken from McDonald (1976, p. 36).
The **Pearson** correction also adjusts the Pearson chi-square value with (E.S. Pearson, 1947, p. 157):
$$\\chi_{epearson}^2 = \\frac{n - 1}{n}\\times \\chi_p^2$$
References
----------
Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. *Biometrics, 10*(4), 417. doi:10.2307/3001616
Fisher, R. A. (1925). *Statistical methods for research workers*. Oliver and Boyd.
McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897
Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
testUsed = "Pearson chi-square test of independence"
if cc == "yates":
testUsed = testUsed + ", with Yates continuity correction"
#create the cross table
ct = tab_cross(field1, field2, categories1, categories2, totals="include")
#basic counts
nrows = ct.shape[0] - 1
ncols = ct.shape[1] - 1
n = ct.iloc[nrows, ncols]
#determine the expected counts & chi-square value
chi2Val = 0
expMin = -1
nExpBelow5 = 0
expC = pd.DataFrame()
for i in range(0, nrows):
for j in range(0, ncols):
expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
#add or remove a half in case Yates correction
if cc=="yates":
if ct.iloc[i,j] > expC.iloc[i,j]:
ct.iloc[i,j] = ct.iloc[i,j] - 0.5
elif ct.iloc[i,j] < expC.iloc[i,j]:
ct.iloc[i,j] = ct.iloc[i,j] + 0.5
chi2Val = chi2Val + (ct.iloc[i, j] - expC.iloc[i, j])**2 / expC.iloc[i, j]
#check if below 5
if expMin < 0 or expC.iloc[i,j] < expMin:
expMin = expC.iloc[i,j]
if expC.iloc[i,j] < 5:
nExpBelow5 = nExpBelow5 + 1
nExpBelow5 = nExpBelow5/(nrows*ncols)
#Degrees of freedom
df = (nrows - 1)*(ncols - 1)
#Williams and Pearson correction
if cc == "williams":
testUsed = testUsed + ", with Williams continuity correction"
rTotInv = 0
for i in range(0, nrows):
rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
cTotInv = 0
for j in range(0, ncols):
cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
chi2Val = chi2Val / q
elif cc == "pearson":
testUsed = testUsed + ", with E.S. Pearson continuity correction"
chi2Val = chi2Val * (n - 1) / n
#The test
pvalue = chi2.sf(chi2Val, df)
#Prepare the results
colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
pd.set_option('display.max_colwidth', None)
return testResults
Functions
def ts_pearson_ind(field1, field2, categories1=None, categories2=None, cc=None)-
Pearson Chi-Square Test of Independence
To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05, the two nominal variables have a significant association.
The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent.
If for example I had 50 male and 50 female respondents, and 50 agreed with a statement and 50 disagreed with the statement, the expected value for each combination (male-agree, female-agree, male-disagree, and female-disagree) would be 25.
Note that if in the survey the real results would be that all male disagreed, and all female would agree, there is a full dependency (i.e. gender fully decides if you agree or disagree), even though the row and column totals would still be 50. In essence the Pearson chi-square test, checks if your data is more toward the expected values (independence) or the full dependency one.
One problem though is that the Pearson chi-square test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 .
Parameters
field1:listorpandas series- the first categorical field
field2:listorpandas series- the first categorical field
categories1:listordictionary, optional- order and/or selection for categories of field1
categories2:listordictionary, optional- order and/or selection for categories of field2
cc:{None, "yates", "pearson", "williams"}, optional- method for continuity correction
Returns
A dataframe with:
- n, the sample size
- n rows, number of categories used in first field
- n col., number of categories used in second field
- statistic, the test statistic (chi-square value)
- df, the degrees of freedom
- p-value, the significance (p-value)
- min. exp., the minimum expected count
- prop. exp. below 5, proportion of cells with expected count less than 5
- test, description of the test used
Notes
The formula used is (Pearson, 1900, p. 165): \chi_p^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{\left(F_{i,j} - E_{i,j}\right)^2}{E_{i,j}} df = \left(r - 1\right)\times\left(c - 1\right) sig. = 1 - \chi^2\left(\chi_p^2, df\right)
With: E_{i,j} = \frac{R_i \times C_j}{n} R_i = \sum_{j=1}^c F_{i,j} C_j = \sum_{i=1}^r F_{i,j} n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j} = \sum_{i=1}^r R_i = \sum_{j=1}^c C_j
Symbols:
- r, the number of rows
- c, the number of columns
- F_{i,j}, the observed count in row i and column j.
- E_{i,j}, the expected count in row i and column j.
- R_i, the row total of row i
- C_j, the column total of column j
- n, the overall total.
- df, the degrees of freedom
The Yates correction uses F_{i,j}' instead of F_{i,j}, defined as (Yates, 1934, p. 222): F_{i,j}' = \begin{cases} F_{i,j}-\frac{1}{2} & \text{ if } F_{i,j}> E_{i,j} \\ F_{i,j}+\frac{1}{2} & \text{ if } F_{i,j}< E_{i,j} \\ F_{i,j} & \text{ if } F_{i,j}= E_{i,j} \end{cases}
The Williams correction, adjusts the Pearson chi-square value: \chi_{wil}^2 = \frac{\chi_p^2}{q}
With: q = 1+\frac{\left(n\times\left(\sum_{i=1}^r \frac{1}{R_i}\right) - 1\right)\times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_i}\right) - 1\right)}{6\times n \times\left(r - 1\right)\times\left(c - 1\right)}
The formula is probably from Williams (1976) but the one shown here is taken from McDonald (1976, p. 36).
The Pearson correction also adjusts the Pearson chi-square value with (E.S. Pearson, 1947, p. 157): \chi_{epearson}^2 = \frac{n - 1}{n}\times \chi_p^2
References
Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. Biometrics, 10(4), 417. doi:10.2307/3001616
Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd.
McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing.
Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2), 139–167. doi:10.2307/2332518
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175. doi:10.1080/14786440009463897
Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. Biometrika, 63(1), 33–37. doi:10.2307/2335081
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. doi:10.2307/2983604
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_pearson_ind(field1, field2, categories1=None, categories2=None, cc= None): ''' Pearson Chi-Square Test of Independence --------------------------------------- To test if two nominal variables have an association, the most commonly used test is the Pearson chi-square test of independence (Pearson, 1900). If the significance of this test is below 0.05, the two nominal variables have a significant association. The test compares the observed counts of the cross table with the so-called expected counts. The expected values are the number of respondents you would expect if the two variables would be independent. If for example I had 50 male and 50 female respondents, and 50 agreed with a statement and 50 disagreed with the statement, the expected value for each combination (male-agree, female-agree, male-disagree, and female-disagree) would be 25. Note that if in the survey the real results would be that all male disagreed, and all female would agree, there is a full dependency (i.e. gender fully decides if you agree or disagree), even though the row and column totals would still be 50. In essence the Pearson chi-square test, checks if your data is more toward the expected values (independence) or the full dependency one. One problem though is that the Pearson chi-square test should only be used if not too many cells have a so-called expected count, of less than 5, and the minimum expected count is at least 1. So you will also have to check first if these conditions are met. Most often ‘not too many cells’ is fixed at no more than 20% of the cells. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5 . Parameters ---------- field1 : list or pandas series the first categorical field field2 : list or pandas series the first categorical field categories1 : list or dictionary, optional order and/or selection for categories of field1 categories2 : list or dictionary, optional order and/or selection for categories of field2 cc : {None, "yates", "pearson", "williams"}, optional method for continuity correction Returns ------- A dataframe with: * *n*, the sample size * *n rows*, number of categories used in first field * *n col.*, number of categories used in second field * *statistic*, the test statistic (chi-square value) * *df*, the degrees of freedom * *p-value*, the significance (p-value) * *min. exp.*, the minimum expected count * *prop. exp. below 5*, proportion of cells with expected count less than 5 * *test*, description of the test used Notes ----- The formula used is (Pearson, 1900, p. 165): $$\\chi_p^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$ $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$ $$sig. = 1 - \\chi^2\\left(\\chi_p^2, df\\right)$$ With: $$E_{i,j} = \\frac{R_i \\times C_j}{n}$$ $$R_i = \\sum_{j=1}^c F_{i,j}$$ $$C_j = \\sum_{i=1}^r F_{i,j}$$ $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j} = \\sum_{i=1}^r R_i = \\sum_{j=1}^c C_j$$ Symbols: * \\(r\\), the number of rows * \\(c\\), the number of columns * \\(F_{i,j}\\), the observed count in row i and column j. * \\(E_{i,j}\\), the expected count in row i and column j. * \\(R_i\\), the row total of row i * \\(C_j\\), the column total of column j * \\(n\\), the overall total. * \\(df\\), the degrees of freedom The **Yates** correction uses \\(F_{i,j}'\\) instead of \\(F_{i,j}\\), defined as (Yates, 1934, p. 222): $$F_{i,j}' = \\begin{cases} F_{i,j}-\\frac{1}{2} & \\text{ if } F_{i,j}> E_{i,j} \\\\ F_{i,j}+\\frac{1}{2} & \\text{ if } F_{i,j}< E_{i,j} \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j} \\end{cases} $$ The **Williams** correction, adjusts the Pearson chi-square value: $$\\chi_{wil}^2 = \\frac{\\chi_p^2}{q}$$ With: $$q = 1+\\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right) - 1\\right)\\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_i}\\right) - 1\\right)}{6\\times n \\times\\left(r - 1\\right)\\times\\left(c - 1\\right)}$$ The formula is probably from Williams (1976) but the one shown here is taken from McDonald (1976, p. 36). The **Pearson** correction also adjusts the Pearson chi-square value with (E.S. Pearson, 1947, p. 157): $$\\chi_{epearson}^2 = \\frac{n - 1}{n}\\times \\chi_p^2$$ References ---------- Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. *Biometrics, 10*(4), 417. doi:10.2307/3001616 Fisher, R. A. (1925). *Statistical methods for research workers*. Oliver and Boyd. McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing. Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518 Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897 Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081 Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' testUsed = "Pearson chi-square test of independence" if cc == "yates": testUsed = testUsed + ", with Yates continuity correction" #create the cross table ct = tab_cross(field1, field2, categories1, categories2, totals="include") #basic counts nrows = ct.shape[0] - 1 ncols = ct.shape[1] - 1 n = ct.iloc[nrows, ncols] #determine the expected counts & chi-square value chi2Val = 0 expMin = -1 nExpBelow5 = 0 expC = pd.DataFrame() for i in range(0, nrows): for j in range(0, ncols): expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n #add or remove a half in case Yates correction if cc=="yates": if ct.iloc[i,j] > expC.iloc[i,j]: ct.iloc[i,j] = ct.iloc[i,j] - 0.5 elif ct.iloc[i,j] < expC.iloc[i,j]: ct.iloc[i,j] = ct.iloc[i,j] + 0.5 chi2Val = chi2Val + (ct.iloc[i, j] - expC.iloc[i, j])**2 / expC.iloc[i, j] #check if below 5 if expMin < 0 or expC.iloc[i,j] < expMin: expMin = expC.iloc[i,j] if expC.iloc[i,j] < 5: nExpBelow5 = nExpBelow5 + 1 nExpBelow5 = nExpBelow5/(nrows*ncols) #Degrees of freedom df = (nrows - 1)*(ncols - 1) #Williams and Pearson correction if cc == "williams": testUsed = testUsed + ", with Williams continuity correction" rTotInv = 0 for i in range(0, nrows): rTotInv = rTotInv + 1 / ct.iloc[i, ncols] cTotInv = 0 for j in range(0, ncols): cTotInv = cTotInv + 1 / ct.iloc[nrows, j] q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df) chi2Val = chi2Val / q elif cc == "pearson": testUsed = testUsed + ", with E.S. Pearson continuity correction" chi2Val = chi2Val * (n - 1) / n #The test pvalue = chi2.sf(chi2Val, df) #Prepare the results colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"] testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames) pd.set_option('display.max_colwidth', None) return testResults