Module stikpetP.tests.test_box_owa

Expand source code
import pandas as pd
from scipy.stats import f

def ts_box_owa(nomField, scaleField, categories=None):
    '''
    Box One-Way ANOVA
    -----------------
    Tests if the means (averages) of each category could be the same in the population.
    
    Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom.
    
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (F value)
    * *df1*, degrees of freedom 1
    * *df2*, degrees of freedom 2
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Box, 1954, p. 299):
    $$ F_{Box} = \\frac{F_{Fisher}}{c} $$
    $$ df_1^* = \\frac{\\left(\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2\\right)^2}{\\left(\\sum_{j=1}^k n_j\\times s_j^2\\right)^2 + n\\times\\sum_{j=1}^k\\left(n - 2\\times n_j\\right)\\times s_j^4} $$
    $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k \\left(n_j-1\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^4}$$
    $$ F_{Box} \\sim F\\left(df_1^*, df_2^*\\right) $$
    
    With:
    $$ c = \\frac{n-k}{n\\times\\left(k-1\\right)}\\times\\frac{\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^2} $$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    
    *Symbols used:*
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(w_j\\), the weight for category j
    * \\(df_i^*\\), the i-th adjusted degrees of freedom
    * \\(F_{Fisher}\\), is the F-statistic from the regular one-way ANOVA
    
    The \\(F_{Box}\\) value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result:
    
    $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\frac{\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)^2\\times s_j^4}{n-k}} $$
    
    Asiribo and Gurland (1990) derive the same correction as Box, although their notation for \\(df_1^*\\) is different, but will give the same result. 
    
    References
    ----------
    Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. *Communications in Statistics - Theory and Methods, 19*(11), 4029–4048. doi:10.1080/03610929008830427
    
    Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. *The Annals of Mathematical Statistics, 25*(2), 290–302. doi:10.1214/aoms/1177728786
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    #Fisher's regular F-statistic
    ssb = float((nj*(mj-m)**2).sum())
    ssw = sst - ssb
    dfb = k - 1
    dfw = n - k
    dft = n - 1    
    msb = ssb/dfb
    msw = ssw/dfw    
    fVal = msb/msw
    
    #Box correction:
    c = (n - k)/(n*(k-1)) * ((n - nj)*sj2).sum() / ((nj - 1)*sj2).sum()
    fVal = float(fVal / c)
    
    #Box degrees of freedom
    df1 = float(((n - nj)*sj2).sum()**2 / ((nj*sj2).sum()**2 + n * ((n - 2*nj)*sj2**2).sum()))
    df2 = float(((nj - 1)*sj2).sum()**2 / ((nj - 1)*sj2**2).sum())
    
    pVal = f.sf(fVal, df1, df2)
    
    #results
    res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
    res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
    
    return res

Functions

def ts_box_owa(nomField, scaleField, categories=None)

Box One-Way ANOVA

Tests if the means (averages) of each category could be the same in the population.

Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom.

If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.

There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.

Parameters

nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField

Returns

Dataframe with:
 
  • n, the sample size
  • k, the number of categories
  • statistic, the test statistic (F value)
  • df1, degrees of freedom 1
  • df2, degrees of freedom 2
  • p-value, the p-value (significance)

Notes

The formula used (Box, 1954, p. 299): F_{Box} = \frac{F_{Fisher}}{c} df_1^* = \frac{\left(\sum_{j=1}^k\left(n-n_j\right)\times s_j^2\right)^2}{\left(\sum_{j=1}^k n_j\times s_j^2\right)^2 + n\times\sum_{j=1}^k\left(n - 2\times n_j\right)\times s_j^4} df_2^* = \frac{\left(\sum_{j=1}^k \left(n_j-1\right)\times s_j^2\right)^2}{\sum_{j=1}^k\left(n_j-1\right)\times s_j^4} F_{Box} \sim F\left(df_1^*, df_2^*\right)

With: c = \frac{n-k}{n\times\left(k-1\right)}\times\frac{\sum_{j=1}^k\left(n-n_j\right)\times s_j^2}{\sum_{j=1}^k\left(n_j-1\right)\times s_j^2} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j}

Symbols used:

  • k, for the number of categories
  • x_{i,j}, for the i-th score in category j
  • n_j, the sample size of category j
  • \bar{x}_j, the sample mean of category j
  • s_j^2, the sample variance of the scores in category j
  • w_j, the weight for category j
  • df_i^*, the i-th adjusted degrees of freedom
  • F_{Fisher}, is the F-statistic from the regular one-way ANOVA

The F_{Box} value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result:

df_2^* = \frac{\left(\sum_{j=1}^k\left(1 - \frac{n_j}{n}\right)\times s_j^2\right)^2}{\frac{\sum_{j=1}^k\left(1 - \frac{n_j}{n}\right)^2\times s_j^4}{n-k}}

Asiribo and Gurland (1990) derive the same correction as Box, although their notation for df_1^* is different, but will give the same result.

References

Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. Communications in Statistics - Theory and Methods, 19(11), 4029–4048. doi:10.1080/03610929008830427

Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290–302. doi:10.1214/aoms/1177728786

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_box_owa(nomField, scaleField, categories=None):
    '''
    Box One-Way ANOVA
    -----------------
    Tests if the means (averages) of each category could be the same in the population.
    
    Box proposed a correction to the original Fisher one-way ANOVA, on both the test-statistic and the degrees of freedom.
    
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (F value)
    * *df1*, degrees of freedom 1
    * *df2*, degrees of freedom 2
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Box, 1954, p. 299):
    $$ F_{Box} = \\frac{F_{Fisher}}{c} $$
    $$ df_1^* = \\frac{\\left(\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2\\right)^2}{\\left(\\sum_{j=1}^k n_j\\times s_j^2\\right)^2 + n\\times\\sum_{j=1}^k\\left(n - 2\\times n_j\\right)\\times s_j^4} $$
    $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k \\left(n_j-1\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^4}$$
    $$ F_{Box} \\sim F\\left(df_1^*, df_2^*\\right) $$
    
    With:
    $$ c = \\frac{n-k}{n\\times\\left(k-1\\right)}\\times\\frac{\\sum_{j=1}^k\\left(n-n_j\\right)\\times s_j^2}{\\sum_{j=1}^k\\left(n_j-1\\right)\\times s_j^2} $$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    
    *Symbols used:*
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(w_j\\), the weight for category j
    * \\(df_i^*\\), the i-th adjusted degrees of freedom
    * \\(F_{Fisher}\\), is the F-statistic from the regular one-way ANOVA
    
    The \\(F_{Box}\\) value is the same as the one of the Brown-Forsythe test for means. The R functions in the doex and onewaytests library actually use this. They also have a different formula for the 2nd degrees of freedom, which leads to a different result:
    
    $$ df_2^* = \\frac{\\left(\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\frac{\\sum_{j=1}^k\\left(1 - \\frac{n_j}{n}\\right)^2\\times s_j^4}{n-k}} $$
    
    Asiribo and Gurland (1990) derive the same correction as Box, although their notation for \\(df_1^*\\) is different, but will give the same result. 
    
    References
    ----------
    Asiribo, O., & Gurland, J. (1990). Coping with variance heterogeneity. *Communications in Statistics - Theory and Methods, 19*(11), 4029–4048. doi:10.1080/03610929008830427
    
    Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: Effect of inequality of variance in the one-way classification. *The Annals of Mathematical Statistics, 25*(2), 290–302. doi:10.1214/aoms/1177728786
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    #Fisher's regular F-statistic
    ssb = float((nj*(mj-m)**2).sum())
    ssw = sst - ssb
    dfb = k - 1
    dfw = n - k
    dft = n - 1    
    msb = ssb/dfb
    msw = ssw/dfw    
    fVal = msb/msw
    
    #Box correction:
    c = (n - k)/(n*(k-1)) * ((n - nj)*sj2).sum() / ((nj - 1)*sj2).sum()
    fVal = float(fVal / c)
    
    #Box degrees of freedom
    df1 = float(((n - nj)*sj2).sum()**2 / ((nj*sj2).sum()**2 + n * ((n - 2*nj)*sj2**2).sum()))
    df2 = float(((nj - 1)*sj2).sum()**2 / ((nj - 1)*sj2**2).sum())
    
    pVal = f.sf(fVal, df1, df2)
    
    #results
    res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
    res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
    
    return res