Module stikpetP.tests.test_scott_smith_owa
Expand source code
import pandas as pd
from scipy.stats import chi2
def ts_scott_smith_owa(nomField, scaleField, categories=None):
'''
Scott-Smith One-Way ANOVA
-----------------
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA).
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *k*, the number of categories
* *statistic*, the test statistic (chi-square value)
* *df*, degrees of freedom
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Scott & Smith, 1971, p. 277):
$$ \\chi_{SS}^2 = \\sum_{j=1}^k z_j^2 $$
$$ df = k $$
$$ \\chi_{SS}^2 \\sim \\chi^2\\left(df\\right) $$
With:
$$ z_j = t_j\\times\\sqrt{\\frac{n_j-3}{n_j-1}} $$
$$ t_j = \\frac{\\bar{x}_j - \\bar{x}}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
$$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17).
*Symbols used*
* \\(k\\), for the number of categories
* \\(x_{i,j}\\), for the i-th score in category j
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(\\bar{x}\\), the sample mean of all scores
* \\(s_j\\), the sample standard deviation of the scores in category j
* \\(n\\), the total sample size
* \\(df\\), the degrees of freedom.
References
----------
Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). *Mathematics and Statistics, 4*(2), 62–69. doi:10.13189/ms.2016.040203
Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. *The R Journal, 12*(2), 134. doi:10.32614/RJ-2021-008
Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. *Applied Statistics, 20*(3), 276–285. doi:10.2307/2346757
Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. *Communications, Faculty Of Science, University of Ankara*, 15–34. doi:10.1501/Commua1_0000000660
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
#number of categories
k = len(mj)
sj = sj2**0.5
tj = (mj - m)*nj**0.5 / sj
dj = tj*((nj-3)/(nj-1))**0.5
chiVal = float((dj**2).sum())
df = k
pVal = chi2.sf(chiVal, df)
#results
res = pd.DataFrame([[n, chiVal, df, pVal]])
res.columns = ["n", "statistic", "df", "p-value"]
return res
Functions
def ts_scott_smith_owa(nomField, scaleField, categories=None)-
Scott-Smith One-Way ANOVA
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA).
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField:pandas series- data with categories
scaleField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- k, the number of categories
- statistic, the test statistic (chi-square value)
- df, degrees of freedom
- p-value, the p-value (significance)
Notes
The formula used (Scott & Smith, 1971, p. 277): \chi_{SS}^2 = \sum_{j=1}^k z_j^2 df = k \chi_{SS}^2 \sim \chi^2\left(df\right)
With: z_j = t_j\times\sqrt{\frac{n_j-3}{n_j-1}} t_j = \frac{\bar{x}_j - \bar{x}}{\sqrt{\frac{s_j^2}{n_j}}} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} \bar{x} = \frac{\sum_{j=1}^{n_j}n_j\times \bar{x}_j}{n} = \frac{\sum_{j=1}^{k}\sum_{i=1}^{n_j} x_{i,j}}{n}
The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17).
Symbols used
- k, for the number of categories
- x_{i,j}, for the i-th score in category j
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- \bar{x}, the sample mean of all scores
- s_j, the sample standard deviation of the scores in category j
- n, the total sample size
- df, the degrees of freedom.
References
Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). Mathematics and Statistics, 4(2), 62–69. doi:10.13189/ms.2016.040203
Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. The R Journal, 12(2), 134. doi:10.32614/RJ-2021-008
Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. Applied Statistics, 20(3), 276–285. doi:10.2307/2346757
Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. Communications, Faculty Of Science, University of Ankara, 15–34. doi:10.1501/Commua1_0000000660
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_scott_smith_owa(nomField, scaleField, categories=None): ''' Scott-Smith One-Way ANOVA ----------------- Tests if the means (averages) of each category could be the same in the population. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA). There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *k*, the number of categories * *statistic*, the test statistic (chi-square value) * *df*, degrees of freedom * *p-value*, the p-value (significance) Notes ----- The formula used (Scott & Smith, 1971, p. 277): $$ \\chi_{SS}^2 = \\sum_{j=1}^k z_j^2 $$ $$ df = k $$ $$ \\chi_{SS}^2 \\sim \\chi^2\\left(df\\right) $$ With: $$ z_j = t_j\\times\\sqrt{\\frac{n_j-3}{n_j-1}} $$ $$ t_j = \\frac{\\bar{x}_j - \\bar{x}}{\\sqrt{\\frac{s_j^2}{n_j}}} $$ $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$ The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17). *Symbols used* * \\(k\\), for the number of categories * \\(x_{i,j}\\), for the i-th score in category j * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(\\bar{x}\\), the sample mean of all scores * \\(s_j\\), the sample standard deviation of the scores in category j * \\(n\\), the total sample size * \\(df\\), the degrees of freedom. References ---------- Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). *Mathematics and Statistics, 4*(2), 62–69. doi:10.13189/ms.2016.040203 Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. *The R Journal, 12*(2), 134. doi:10.32614/RJ-2021-008 Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. *Applied Statistics, 20*(3), 276–285. doi:10.2307/2346757 Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. *Communications, Faculty Of Science, University of Ankara*, 15–34. doi:10.1501/Commua1_0000000660 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() #number of categories k = len(mj) sj = sj2**0.5 tj = (mj - m)*nj**0.5 / sj dj = tj*((nj-3)/(nj-1))**0.5 chiVal = float((dj**2).sum()) df = k pVal = chi2.sf(chiVal, df) #results res = pd.DataFrame([[n, chiVal, df, pVal]]) res.columns = ["n", "statistic", "df", "p-value"] return res