Module stikpetP.tests.test_scott_smith_owa

Expand source code
import pandas as pd
from scipy.stats import chi2

def ts_scott_smith_owa(nomField, scaleField, categories=None):
    '''
    Scott-Smith One-Way ANOVA
    -----------------
    Tests if the means (averages) of each category could be the same in the population.
    
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA).
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (chi-square value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Scott & Smith, 1971, p. 277):
    $$ \\chi_{SS}^2 = \\sum_{j=1}^k z_j^2 $$
    $$ df = k $$
    $$ \\chi_{SS}^2 \\sim \\chi^2\\left(df\\right) $$
    
    With:
    $$ z_j = t_j\\times\\sqrt{\\frac{n_j-3}{n_j-1}} $$
    $$ t_j = \\frac{\\bar{x}_j - \\bar{x}}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
    
    The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17).
    
    *Symbols used* 
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(\\bar{x}\\), the sample mean of all scores
    * \\(s_j\\), the sample standard deviation of the scores in category j
    * \\(n\\), the total sample size
    * \\(df\\), the degrees of freedom.
    
    References
    ----------
    Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). *Mathematics and Statistics, 4*(2), 62–69. doi:10.13189/ms.2016.040203
    
    Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. *The R Journal, 12*(2), 134. doi:10.32614/RJ-2021-008
    
    Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. *Applied Statistics, 20*(3), 276–285. doi:10.2307/2346757
    
    Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. *Communications, Faculty Of Science, University of Ankara*, 15–34. doi:10.1501/Commua1_0000000660
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    sj = sj2**0.5
    tj = (mj - m)*nj**0.5 / sj
    dj = tj*((nj-3)/(nj-1))**0.5
    
    chiVal = float((dj**2).sum())
    df = k
    
    pVal = chi2.sf(chiVal, df)
    
    #results
    res = pd.DataFrame([[n, chiVal, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res

Functions

def ts_scott_smith_owa(nomField, scaleField, categories=None)

Scott-Smith One-Way ANOVA

Tests if the means (averages) of each category could be the same in the population.

If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.

Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA).

There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.

Parameters

nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField

Returns

Dataframe with:
 
  • n, the sample size
  • k, the number of categories
  • statistic, the test statistic (chi-square value)
  • df, degrees of freedom
  • p-value, the p-value (significance)

Notes

The formula used (Scott & Smith, 1971, p. 277): \chi_{SS}^2 = \sum_{j=1}^k z_j^2 df = k \chi_{SS}^2 \sim \chi^2\left(df\right)

With: z_j = t_j\times\sqrt{\frac{n_j-3}{n_j-1}} t_j = \frac{\bar{x}_j - \bar{x}}{\sqrt{\frac{s_j^2}{n_j}}} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} \bar{x} = \frac{\sum_{j=1}^{n_j}n_j\times \bar{x}_j}{n} = \frac{\sum_{j=1}^{k}\sum_{i=1}^{n_j} x_{i,j}}{n}

The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17).

Symbols used

  • k, for the number of categories
  • x_{i,j}, for the i-th score in category j
  • n_j, the sample size of category j
  • \bar{x}_j, the sample mean of category j
  • s_j^2, the sample variance of the scores in category j
  • \bar{x}, the sample mean of all scores
  • s_j, the sample standard deviation of the scores in category j
  • n, the total sample size
  • df, the degrees of freedom.

References

Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). Mathematics and Statistics, 4(2), 62–69. doi:10.13189/ms.2016.040203

Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. The R Journal, 12(2), 134. doi:10.32614/RJ-2021-008

Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. Applied Statistics, 20(3), 276–285. doi:10.2307/2346757

Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. Communications, Faculty Of Science, University of Ankara, 15–34. doi:10.1501/Commua1_0000000660

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_scott_smith_owa(nomField, scaleField, categories=None):
    '''
    Scott-Smith One-Way ANOVA
    -----------------
    Tests if the means (averages) of each category could be the same in the population.
    
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    Yiğit and Gökpina (2010, p. 32) concluded that this test is inferior to some other alternatives when there is heteroscedasticity (variances in the groups not the same) are preferred (for example the Welch one-way ANOVA).
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (chi-square value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Scott & Smith, 1971, p. 277):
    $$ \\chi_{SS}^2 = \\sum_{j=1}^k z_j^2 $$
    $$ df = k $$
    $$ \\chi_{SS}^2 \\sim \\chi^2\\left(df\\right) $$
    
    With:
    $$ z_j = t_j\\times\\sqrt{\\frac{n_j-3}{n_j-1}} $$
    $$ t_j = \\frac{\\bar{x}_j - \\bar{x}}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
    
    The formulas can also be found in Adepoju et al. (2016, p. 64), Cavus and Yazici (2020, p. 7), or Yiğit and Gökpinar (2010, p. 17).
    
    *Symbols used* 
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(\\bar{x}\\), the sample mean of all scores
    * \\(s_j\\), the sample standard deviation of the scores in category j
    * \\(n\\), the total sample size
    * \\(df\\), the degrees of freedom.
    
    References
    ----------
    Adepoju, K. A., Shittu, O. I., & Chukwu, A. U. (2016). On the development of an exponentiated F test for one-way ANOVA in the presence of outlier(s). *Mathematics and Statistics, 4*(2), 62–69. doi:10.13189/ms.2016.040203
    
    Cavus, M., & Yazici, B. (2020). Testing the equality of normal distributed and independent groups’ means under unequal variances by doex package. *The R Journal, 12*(2), 134. doi:10.32614/RJ-2021-008
    
    Scott, A. J., & Smith, T. M. F. (1971). Interval estimates for linear combinations of means. *Applied Statistics, 20*(3), 276–285. doi:10.2307/2346757
    
    Yiğit, E., & Gökpinar, F. (2010). A simulation study on tests for one-way ANOVA under the unequal variance assumption. *Communications, Faculty Of Science, University of Ankara*, 15–34. doi:10.1501/Commua1_0000000660
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    sj = sj2**0.5
    tj = (mj - m)*nj**0.5 / sj
    dj = tj*((nj-3)/(nj-1))**0.5
    
    chiVal = float((dj**2).sum())
    df = k
    
    pVal = chi2.sf(chiVal, df)
    
    #results
    res = pd.DataFrame([[n, chiVal, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res