Module stikpetP.tests.test_alexander_govern_owa

Expand source code
import pandas as pd
from scipy.stats import chi2
from numpy import log

def ts_alexander_govern_owa(nomField, scaleField, categories=None):
    '''
    Alexander-Govern One-Way ANOVA
    ------------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285).
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *statistic*, the test statistic (chi-square value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Alexander & Govern, 1994, pp. 92-94):
    $$ A = \\sum_{j=1}^k z_j^2 $$
    $$ df = k - 1 $$
    $$ A \\sim \\chi^2\\left(df\\right) $$
    
    With:
    $$ z_j = c_j + \\frac{c_j^3 + 3\\times c_j}{b_j} - \\frac{4\\times c_j^7 + 33\\times c_j^5 + 240\\times c_j^3 + 855\\times c_j}{10\\times b_j^2 + 8\\times b_j\\times c_j^4 + 1000\\times b_j} $$
    $$ c_j = \\sqrt{a_j\\times\\ln\\left(1 + \\frac{t_j^2}{n_j - 1}\\right)} $$
    $$ b_j = 48\\times a_j^2 $$
    $$ a_j = n_j - 1.5 $$
    $$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
    $$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$
    $$ h_j = \\frac{w_j}{w}$$
    $$ w_j = \\frac{n_j}{s_j^2}$$
    $$ w = \\sum_{j=1}^k w_j$$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    
    *Symbols used:*
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(n\\), the total sample size
    * \\(df\\), the degrees of freedom.

    References
    ----------
    Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. *Journal of Educational Statistics, 19*(2), 91–101. doi:10.2307/1165140
    
    Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. *The Journal of Experimental Education, 65*(3), 271–286. doi:10.1080/00220973.1997.9943459

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    sej = (sj2/nj)**0.5
    ssej = (1/sej**2).sum()
    wj = 1/(sej**2 * ssej)
    ym = (wj*mj).sum()
    tj = (mj - ym)/sej
    aj = nj - 1.5
    bj = 48*aj**2
    cj = (aj*log(1+tj**2/(nj - 1)))**0.5
    zj = cj + (cj**3 + 3*cj)/bj - (4*cj**7 + 33*cj**5 + 240*cj**3 + 855*cj)/(10*bj**2 + 8*bj*cj**4 + 1000*bj)
    
    a = float((zj**2).sum())
    df = k - 1
    
    pVal = chi2.sf(a, df)
    
    #results
    res = pd.DataFrame([[n, a, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res

Functions

def ts_alexander_govern_owa(nomField, scaleField, categories=None)

Alexander-Govern One-Way ANOVA

Tests if the means (averages) of each category could be the same in the population.

If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.

Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285).

There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.

Parameters

nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField

Returns

Dataframe with:
 
  • n, the sample size
  • statistic, the test statistic (chi-square value)
  • df, degrees of freedom
  • p-value, the p-value (significance)

Notes

The formula used (Alexander & Govern, 1994, pp. 92-94): A = \sum_{j=1}^k z_j^2 df = k - 1 A \sim \chi^2\left(df\right)

With: z_j = c_j + \frac{c_j^3 + 3\times c_j}{b_j} - \frac{4\times c_j^7 + 33\times c_j^5 + 240\times c_j^3 + 855\times c_j}{10\times b_j^2 + 8\times b_j\times c_j^4 + 1000\times b_j} c_j = \sqrt{a_j\times\ln\left(1 + \frac{t_j^2}{n_j - 1}\right)} b_j = 48\times a_j^2 a_j = n_j - 1.5 t_j = \frac{\bar{x}_j - \bar{y}_w}{\sqrt{\frac{s_j^2}{n_j}}} \bar{y}_w = \sum_{j=1}^k h_j\times \bar{x}_j h_j = \frac{w_j}{w} w_j = \frac{n_j}{s_j^2} w = \sum_{j=1}^k w_j s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j}

Symbols used:

  • k, for the number of categories
  • x_{i,j}, for the i-th score in category j
  • n_j, the sample size of category j
  • \bar{x}_j, the sample mean of category j
  • s_j^2, the sample variance of the scores in category j
  • n, the total sample size
  • df, the degrees of freedom.

References

Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. Journal of Educational Statistics, 19(2), 91–101. doi:10.2307/1165140

Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. The Journal of Experimental Education, 65(3), 271–286. doi:10.1080/00220973.1997.9943459

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_alexander_govern_owa(nomField, scaleField, categories=None):
    '''
    Alexander-Govern One-Way ANOVA
    ------------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    Schneider and Penfield (1997) looked at the Welch, Alexander-Govern and the James test (they ignored the Brown-Forsythe since they found it to perform worse than Welch or James), and concluded: “Under variance heterogeneity, Alexander-Govern’s approximation was not only comparable to the Welch test and the James second-order test but was superior, in certain instances, when coupled with the power results for those tests” (p. 285).
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *statistic*, the test statistic (chi-square value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Alexander & Govern, 1994, pp. 92-94):
    $$ A = \\sum_{j=1}^k z_j^2 $$
    $$ df = k - 1 $$
    $$ A \\sim \\chi^2\\left(df\\right) $$
    
    With:
    $$ z_j = c_j + \\frac{c_j^3 + 3\\times c_j}{b_j} - \\frac{4\\times c_j^7 + 33\\times c_j^5 + 240\\times c_j^3 + 855\\times c_j}{10\\times b_j^2 + 8\\times b_j\\times c_j^4 + 1000\\times b_j} $$
    $$ c_j = \\sqrt{a_j\\times\\ln\\left(1 + \\frac{t_j^2}{n_j - 1}\\right)} $$
    $$ b_j = 48\\times a_j^2 $$
    $$ a_j = n_j - 1.5 $$
    $$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
    $$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$
    $$ h_j = \\frac{w_j}{w}$$
    $$ w_j = \\frac{n_j}{s_j^2}$$
    $$ w = \\sum_{j=1}^k w_j$$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    
    *Symbols used:*
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(n\\), the total sample size
    * \\(df\\), the degrees of freedom.

    References
    ----------
    Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation for ANOVA under variance heterogeneity. *Journal of Educational Statistics, 19*(2), 91–101. doi:10.2307/1165140
    
    Schneider, P. J., & Penfield, D. A. (1997). Alexander and Govern’s approximation: Providing an alternative to ANOVA under variance heterogeneity. *The Journal of Experimental Education, 65*(3), 271–286. doi:10.1080/00220973.1997.9943459

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    sej = (sj2/nj)**0.5
    ssej = (1/sej**2).sum()
    wj = 1/(sej**2 * ssej)
    ym = (wj*mj).sum()
    tj = (mj - ym)/sej
    aj = nj - 1.5
    bj = 48*aj**2
    cj = (aj*log(1+tj**2/(nj - 1)))**0.5
    zj = cj + (cj**3 + 3*cj)/bj - (4*cj**7 + 33*cj**5 + 240*cj**3 + 855*cj)/(10*bj**2 + 8*bj*cj**4 + 1000*bj)
    
    a = float((zj**2).sum())
    df = k - 1
    
    pVal = chi2.sf(a, df)
    
    #results
    res = pd.DataFrame([[n, a, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res