Module stikpetP.tests.test_box_owa
Expand source code
import pandas as pd
from scipy.stats import f
def ts_box_owa(nomField, scaleField, categories=None):
'''
Box One-Way ANOVA
-----------------
Tests if the means (averages) of each category could be the same in the population.
Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *k*, the number of categories
* *statistic*, the test statistic (F value)
* *df1*, degrees of freedom 1
* *df2*, degrees of freedom 2
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Box, 1954, p. 299):
$$ F_{Box} = \\frac{F_{Fisher}}{c} $$
$$ df_1^* = \\frac{\\left(\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2\\right)^2}{\\left(\\sum_{j=1}^k n_j\\times s_j^2\\right)^2 + n\\times\\sum_{j=1}^k\\left(n - 2\\times n_j\\right)\\times s_j^4} $$
$$ df_2^* = \\frac{\\left(\\sum_{j=1}^k \\left(n_j-1\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^4}$$
$$ F_{Box} \\sim F\\left(df_1^*, df_2^*\\right) $$
With:
$$ c = \\frac{n-k}{n\\times\\left(k-1\\right)}\\times\\frac{\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^2} $$
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
*Symbols used:*
* \\(k\\), for the number of categories
* \\(x_{i,j}\\), for the i-th score in category j
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(w_j\\), the weight for category j
* \\(df_i^*\\), the i-th adjusted degrees of freedom
* \\(F_{Fisher}\\), is the F-statistic from the regular one-way ANOVA
The \\(F_{Box}\\) value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result:
$$ df_2^* = \\frac{\\left(\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\frac{\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)^2\\times s_j^4}{n-k}} $$
Asiribo and Gurland (1990) derive the same correction as Box, although their notation for \\(df_1^*\\) is different, but will give the same result.
References
----------
Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. *Communications in Statistics - Theory and Methods, 19*(11), 4029–4048. doi:10.1080/03610929008830427
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. *The Annals of Mathematical Statistics, 25*(2), 290–302. doi:10.1214/aoms/1177728786
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
#number of categories
k = len(mj)
#Fisher's regular F-statistic
ssb = float((nj*(mj-m)**2).sum())
ssw = sst - ssb
dfb = k - 1
dfw = n - k
dft = n - 1
msb = ssb/dfb
msw = ssw/dfw
fVal = msb/msw
#Box correction:
c = (n - k)/(n*(k-1)) * ((n - nj)*sj2).sum() / ((nj - 1)*sj2).sum()
fVal = float(fVal / c)
#Box degrees of freedom
df1 = float(((n - nj)*sj2).sum()**2 / ((nj*sj2).sum()**2 + n * ((n - 2*nj)*sj2**2).sum()))
df2 = float(((nj - 1)*sj2).sum()**2 / ((nj - 1)*sj2**2).sum())
pVal = f.sf(fVal, df1, df2)
#results
res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
return res
Functions
def ts_box_owa(nomField, scaleField, categories=None)-
Box One-Way ANOVA
Tests if the means (averages) of each category could be the same in the population.
Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField:pandas series- data with categories
scaleField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- k, the number of categories
- statistic, the test statistic (F value)
- df1, degrees of freedom 1
- df2, degrees of freedom 2
- p-value, the p-value (significance)
Notes
The formula used (Box, 1954, p. 299): F_{Box} = \frac{F_{Fisher}}{c} df_1^* = \frac{\left(\sum_{j=1}^k\left(n-n_j\right)\times s_j^2\right)^2}{\left(\sum_{j=1}^k n_j\times s_j^2\right)^2 + n\times\sum_{j=1}^k\left(n - 2\times n_j\right)\times s_j^4} df_2^* = \frac{\left(\sum_{j=1}^k \left(n_j-1\right)\times s_j^2\right)^2}{\sum_{j=1}^k\left(n_j-1\right)\times s_j^4} F_{Box} \sim F\left(df_1^*, df_2^*\right)
With: c = \frac{n-k}{n\times\left(k-1\right)}\times\frac{\sum_{j=1}^k\left(n-n_j\right)\times s_j^2}{\sum_{j=1}^k\left(n_j-1\right)\times s_j^2} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j}
Symbols used:
- k, for the number of categories
- x_{i,j}, for the i-th score in category j
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- w_j, the weight for category j
- df_i^*, the i-th adjusted degrees of freedom
- F_{Fisher}, is the F-statistic from the regular one-way ANOVA
The F_{Box} value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result:
df_2^* = \frac{\left(\sum_{j=1}^k\left(1 - \frac{n_j}{n}\right)\times s_j^2\right)^2}{\frac{\sum_{j=1}^k\left(1 - \frac{n_j}{n}\right)^2\times s_j^4}{n-k}}
Asiribo and Gurland (1990) derive the same correction as Box, although their notation for df_1^* is different, but will give the same result.
References
Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. Communications in Statistics - Theory and Methods, 19(11), 4029–4048. doi:10.1080/03610929008830427
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290–302. doi:10.1214/aoms/1177728786
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_box_owa(nomField, scaleField, categories=None): ''' Box One-Way ANOVA ----------------- Tests if the means (averages) of each category could be the same in the population. Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *k*, the number of categories * *statistic*, the test statistic (F value) * *df1*, degrees of freedom 1 * *df2*, degrees of freedom 2 * *p-value*, the p-value (significance) Notes ----- The formula used (Box, 1954, p. 299): $$ F_{Box} = \\frac{F_{Fisher}}{c} $$ $$ df_1^* = \\frac{\\left(\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2\\right)^2}{\\left(\\sum_{j=1}^k n_j\\times s_j^2\\right)^2 + n\\times\\sum_{j=1}^k\\left(n - 2\\times n_j\\right)\\times s_j^4} $$ $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k \\left(n_j-1\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^4}$$ $$ F_{Box} \\sim F\\left(df_1^*, df_2^*\\right) $$ With: $$ c = \\frac{n-k}{n\\times\\left(k-1\\right)}\\times\\frac{\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^2} $$ $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ *Symbols used:* * \\(k\\), for the number of categories * \\(x_{i,j}\\), for the i-th score in category j * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(w_j\\), the weight for category j * \\(df_i^*\\), the i-th adjusted degrees of freedom * \\(F_{Fisher}\\), is the F-statistic from the regular one-way ANOVA The \\(F_{Box}\\) value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result: $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\frac{\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)^2\\times s_j^4}{n-k}} $$ Asiribo and Gurland (1990) derive the same correction as Box, although their notation for \\(df_1^*\\) is different, but will give the same result. References ---------- Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. *Communications in Statistics - Theory and Methods, 19*(11), 4029–4048. doi:10.1080/03610929008830427 Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. *The Annals of Mathematical Statistics, 25*(2), 290–302. doi:10.1214/aoms/1177728786 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() #number of categories k = len(mj) #Fisher's regular F-statistic ssb = float((nj*(mj-m)**2).sum()) ssw = sst - ssb dfb = k - 1 dfw = n - k dft = n - 1 msb = ssb/dfb msw = ssw/dfw fVal = msb/msw #Box correction: c = (n - k)/(n*(k-1)) * ((n - nj)*sj2).sum() / ((nj - 1)*sj2).sum() fVal = float(fVal / c) #Box degrees of freedom df1 = float(((n - nj)*sj2).sum()**2 / ((nj*sj2).sum()**2 + n * ((n - 2*nj)*sj2**2).sum())) df2 = float(((nj - 1)*sj2).sum()**2 / ((nj - 1)*sj2**2).sum()) pVal = f.sf(fVal, df1, df2) #results res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]]) res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"] return res