Module stikpetP.tests.test_alexander_govern_owa
Expand source code
import pandas as pd
from scipy.stats import chi2
from numpy import log
def ts_alexander_govern_owa(nomField, scaleField, categories=None):
'''
Alexander-Govern One-Way ANOVA
------------------------------
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285).
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *statistic*, the test statistic (chi-square value)
* *df*, degrees of freedom
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Alexander & Govern, 1994, pp. 92-94):
$$ A = \\sum_{j=1}^k z_j^2 $$
$$ df = k - 1 $$
$$ A \\sim \\chi^2\\left(df\\right) $$
With:
$$ z_j = c_j + \\frac{c_j^3 + 3\\times c_j}{b_j} - \\frac{4\\times c_j^7 + 33\\times c_j^5 + 240\\times c_j^3 + 855\\times c_j}{10\\times b_j^2 + 8\\times b_j\\times c_j^4 + 1000\\times b_j} $$
$$ c_j = \\sqrt{a_j\\times\\ln\\left(1 + \\frac{t_j^2}{n_j - 1}\\right)} $$
$$ b_j = 48\\times a_j^2 $$
$$ a_j = n_j - 1.5 $$
$$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
$$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$
$$ h_j = \\frac{w_j}{w}$$
$$ w_j = \\frac{n_j}{s_j^2}$$
$$ w = \\sum_{j=1}^k w_j$$
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
*Symbols used:*
* \\(k\\), for the number of categories
* \\(x_{i,j}\\), for the i-th score in category j
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(n\\), the total sample size
* \\(df\\), the degrees of freedom.
References
----------
Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. *Journal of Educational Statistics, 19*(2), 91–101. doi:10.2307/1165140
Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. *The Journal of Experimental Education, 65*(3), 271–286. doi:10.1080/00220973.1997.9943459
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
#number of categories
k = len(mj)
sej = (sj2/nj)**0.5
ssej = (1/sej**2).sum()
wj = 1/(sej**2 * ssej)
ym = (wj*mj).sum()
tj = (mj - ym)/sej
aj = nj - 1.5
bj = 48*aj**2
cj = (aj*log(1+tj**2/(nj - 1)))**0.5
zj = cj + (cj**3 + 3*cj)/bj - (4*cj**7 + 33*cj**5 + 240*cj**3 + 855*cj)/(10*bj**2 + 8*bj*cj**4 + 1000*bj)
a = float((zj**2).sum())
df = k - 1
pVal = chi2.sf(a, df)
#results
res = pd.DataFrame([[n, a, df, pVal]])
res.columns = ["n", "statistic", "df", "p-value"]
return res
Functions
def ts_alexander_govern_owa(nomField, scaleField, categories=None)-
Alexander-Govern One-Way ANOVA
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285).
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField:pandas series- data with categories
scaleField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- statistic, the test statistic (chi-square value)
- df, degrees of freedom
- p-value, the p-value (significance)
Notes
The formula used (Alexander & Govern, 1994, pp. 92-94): A = \sum_{j=1}^k z_j^2 df = k - 1 A \sim \chi^2\left(df\right)
With: z_j = c_j + \frac{c_j^3 + 3\times c_j}{b_j} - \frac{4\times c_j^7 + 33\times c_j^5 + 240\times c_j^3 + 855\times c_j}{10\times b_j^2 + 8\times b_j\times c_j^4 + 1000\times b_j} c_j = \sqrt{a_j\times\ln\left(1 + \frac{t_j^2}{n_j - 1}\right)} b_j = 48\times a_j^2 a_j = n_j - 1.5 t_j = \frac{\bar{x}_j - \bar{y}_w}{\sqrt{\frac{s_j^2}{n_j}}} \bar{y}_w = \sum_{j=1}^k h_j\times \bar{x}_j h_j = \frac{w_j}{w} w_j = \frac{n_j}{s_j^2} w = \sum_{j=1}^k w_j s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j}
Symbols used:
- k, for the number of categories
- x_{i,j}, for the i-th score in category j
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- n, the total sample size
- df, the degrees of freedom.
References
Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. Journal of Educational Statistics, 19(2), 91–101. doi:10.2307/1165140
Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. The Journal of Experimental Education, 65(3), 271–286. doi:10.1080/00220973.1997.9943459
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_alexander_govern_owa(nomField, scaleField, categories=None): ''' Alexander-Govern One-Way ANOVA ------------------------------ Tests if the means (averages) of each category could be the same in the population. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285). There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *statistic*, the test statistic (chi-square value) * *df*, degrees of freedom * *p-value*, the p-value (significance) Notes ----- The formula used (Alexander & Govern, 1994, pp. 92-94): $$ A = \\sum_{j=1}^k z_j^2 $$ $$ df = k - 1 $$ $$ A \\sim \\chi^2\\left(df\\right) $$ With: $$ z_j = c_j + \\frac{c_j^3 + 3\\times c_j}{b_j} - \\frac{4\\times c_j^7 + 33\\times c_j^5 + 240\\times c_j^3 + 855\\times c_j}{10\\times b_j^2 + 8\\times b_j\\times c_j^4 + 1000\\times b_j} $$ $$ c_j = \\sqrt{a_j\\times\\ln\\left(1 + \\frac{t_j^2}{n_j - 1}\\right)} $$ $$ b_j = 48\\times a_j^2 $$ $$ a_j = n_j - 1.5 $$ $$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$ $$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$ $$ h_j = \\frac{w_j}{w}$$ $$ w_j = \\frac{n_j}{s_j^2}$$ $$ w = \\sum_{j=1}^k w_j$$ $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ *Symbols used:* * \\(k\\), for the number of categories * \\(x_{i,j}\\), for the i-th score in category j * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(n\\), the total sample size * \\(df\\), the degrees of freedom. References ---------- Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. *Journal of Educational Statistics, 19*(2), 91–101. doi:10.2307/1165140 Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. *The Journal of Experimental Education, 65*(3), 271–286. doi:10.1080/00220973.1997.9943459 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() #number of categories k = len(mj) sej = (sj2/nj)**0.5 ssej = (1/sej**2).sum() wj = 1/(sej**2 * ssej) ym = (wj*mj).sum() tj = (mj - ym)/sej aj = nj - 1.5 bj = 48*aj**2 cj = (aj*log(1+tj**2/(nj - 1)))**0.5 zj = cj + (cj**3 + 3*cj)/bj - (4*cj**7 + 33*cj**5 + 240*cj**3 + 855*cj)/(10*bj**2 + 8*bj*cj**4 + 1000*bj) a = float((zj**2).sum()) df = k - 1 pVal = chi2.sf(a, df) #results res = pd.DataFrame([[n, a, df, pVal]]) res.columns = ["n", "statistic", "df", "p-value"] return res