Module stikpetP.tests.test_ozdemir_kurt_owa
Expand source code
import pandas as pd
from scipy.stats import chi2
from scipy.stats import norm
from numpy import log
def ts_ozdemir_kurt_owa(nomField, scaleField, categories=None):
'''
Özdemir-Kurt B2
------------------------------
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *statistic*, the B2 value
* *df*, degrees of freedom
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Özdemir & Kurt, 2006, pp. 85-86):
$$ B^2 = \\sum_{j=1}^k\\left(c_j\\times\\sqrt{\\ln\\left(1+\\frac{t_j^2}{v_i}\\right)}\\right)^2 $$
Reject null hypothesis if \\( B^2 \\gt \\chi_{crit}^2\\)
$$ df = k - 1 $$
$$ B^2 \\approx \\sim\\chi^2\\left(df\\right)$$
With:
$$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$
$$ h_j = \\frac{w_j}{w}$$
$$ w_j = \\frac{n_j}{s_j^2}$$
$$ w = \\sum_{j=1}^k w_j$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
$$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$
$$ v_j = n_j - 1$$
$$ c_j = \\frac{4\\times v_j^2 + \\frac{5\\times\\left(2\\times z_{crit}^2+3\\right)}{24}}{4\\times v_j^2+v_j+\\frac{4\\times z_{crit}^2+9}{12}}\\times\\sqrt{v_j} $$
$$ z_{crit} = Q\\left(Z\\left(1 - \\frac{\\alpha}{2}\\right)\\right) $$
$$ \\chi_{crit}^2 = \\chi_{crit}^2\\left(1 - \\alpha, df\\right) $$
*Symbols used:*
* \\(k\\), for the number of categories
* \\(x_{i,j}\\), for the i-th score in category j
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(w_j\\), the weight for category j
* \\(h_j\\), the adjusted weight for category j
* \\(df\\), the degrees of freedom
* \\(Q\\left(x\\right)\\), the quantile function (= inverse cumulative distribution = percentile function = percent-point function).
The function uses a simple iteration search for an approximate p-value.
References
----------
Özdemir, A. F., & Kurt, S. (2006). One way fixed effect analysis of variance under variance heterogeneity and a solution proposal. *Selçuk Journal of Applied Mathematics, 7*(2), 81–90.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
#number of categories
k = len(mj)
sej = (sj2/nj)**0.5
wj = nj/sj2
w = float(wj.sum())
hj = wj/w
wm = float((hj*mj).sum())
vj = nj - 1
tj = (mj - wm)/sej
pVal = 0.05
zcrit = norm.ppf(1-pVal/2)
cj = (4*vj**2 + 5*(2*zcrit**2 + 3)/24)/(4*vj**2 + vj + (4*zcrit**2+9)/12) * vj**0.5
b2 = float(((cj**2*log(1+tj**2/vj))).sum())
df = k - 1
chiCrit = chi2.ppf(1 - pVal, df)
nIter = 0
pLow = 0
pHigh = 1
while b2 != chiCrit and nIter < 100:
if chiCrit < b2:
pHigh = pVal
pVal = (pLow + pVal)/2
elif chiCrit > b2:
pLow = pVal
pVal = (pHigh + pVal)/2
zcrit = norm.ppf(1-pVal/2)
cj = (4*vj**2 + 5*(2*zcrit**2 + 3)/24)/(4*vj**2 + vj + (4*zcrit**2+9)/12) * vj**0.5
b2 = float(((cj**2*log(1+tj**2/vj))).sum())
chiCrit = chi2.ppf(1 - pVal, df)
nIter = nIter + 1
pVal = chi2.sf(b2, df)
#results
res = pd.DataFrame([[n, b2, df, pVal]])
res.columns = ["n", "statistic", "df", "p-value"]
return res
Functions
def ts_ozdemir_kurt_owa(nomField, scaleField, categories=None)-
Özdemir-Kurt B2
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField:pandas series- data with categories
scaleField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- statistic, the B2 value
- df, degrees of freedom
- p-value, the p-value (significance)
Notes
The formula used (Özdemir & Kurt, 2006, pp. 85-86): B^2 = \sum_{j=1}^k\left(c_j\times\sqrt{\ln\left(1+\frac{t_j^2}{v_i}\right)}\right)^2 Reject null hypothesis if B^2 \gt \chi_{crit}^2 df = k - 1 B^2 \approx \sim\chi^2\left(df\right)
With: \bar{y}_w = \sum_{j=1}^k h_j\times \bar{x}_j h_j = \frac{w_j}{w} w_j = \frac{n_j}{s_j^2} w = \sum_{j=1}^k w_j \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} t_j = \frac{\bar{x}_j - \bar{y}_w}{\sqrt{\frac{s_j^2}{n_j}}} v_j = n_j - 1 c_j = \frac{4\times v_j^2 + \frac{5\times\left(2\times z_{crit}^2+3\right)}{24}}{4\times v_j^2+v_j+\frac{4\times z_{crit}^2+9}{12}}\times\sqrt{v_j} z_{crit} = Q\left(Z\left(1 - \frac{\alpha}{2}\right)\right) \chi_{crit}^2 = \chi_{crit}^2\left(1 - \alpha, df\right)
Symbols used:
- k, for the number of categories
- x_{i,j}, for the i-th score in category j
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- w_j, the weight for category j
- h_j, the adjusted weight for category j
- df, the degrees of freedom
- Q\left(x\right), the quantile function (= inverse cumulative distribution = percentile function = percent-point function).
The function uses a simple iteration search for an approximate p-value.
References
Özdemir, A. F., & Kurt, S. (2006). One way fixed effect analysis of variance under variance heterogeneity and a solution proposal. Selçuk Journal of Applied Mathematics, 7(2), 81–90.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_ozdemir_kurt_owa(nomField, scaleField, categories=None): ''' Özdemir-Kurt B2 ------------------------------ Tests if the means (averages) of each category could be the same in the population. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *statistic*, the B2 value * *df*, degrees of freedom * *p-value*, the p-value (significance) Notes ----- The formula used (Özdemir & Kurt, 2006, pp. 85-86): $$ B^2 = \\sum_{j=1}^k\\left(c_j\\times\\sqrt{\\ln\\left(1+\\frac{t_j^2}{v_i}\\right)}\\right)^2 $$ Reject null hypothesis if \\( B^2 \\gt \\chi_{crit}^2\\) $$ df = k - 1 $$ $$ B^2 \\approx \\sim\\chi^2\\left(df\\right)$$ With: $$ \\bar{y}_w = \\sum_{j=1}^k h_j\\times \\bar{x}_j$$ $$ h_j = \\frac{w_j}{w}$$ $$ w_j = \\frac{n_j}{s_j^2}$$ $$ w = \\sum_{j=1}^k w_j$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ $$ t_j = \\frac{\\bar{x}_j - \\bar{y}_w}{\\sqrt{\\frac{s_j^2}{n_j}}} $$ $$ v_j = n_j - 1$$ $$ c_j = \\frac{4\\times v_j^2 + \\frac{5\\times\\left(2\\times z_{crit}^2+3\\right)}{24}}{4\\times v_j^2+v_j+\\frac{4\\times z_{crit}^2+9}{12}}\\times\\sqrt{v_j} $$ $$ z_{crit} = Q\\left(Z\\left(1 - \\frac{\\alpha}{2}\\right)\\right) $$ $$ \\chi_{crit}^2 = \\chi_{crit}^2\\left(1 - \\alpha, df\\right) $$ *Symbols used:* * \\(k\\), for the number of categories * \\(x_{i,j}\\), for the i-th score in category j * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(w_j\\), the weight for category j * \\(h_j\\), the adjusted weight for category j * \\(df\\), the degrees of freedom * \\(Q\\left(x\\right)\\), the quantile function (= inverse cumulative distribution = percentile function = percent-point function). The function uses a simple iteration search for an approximate p-value. References ---------- Özdemir, A. F., & Kurt, S. (2006). One way fixed effect analysis of variance under variance heterogeneity and a solution proposal. *Selçuk Journal of Applied Mathematics, 7*(2), 81–90. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() #number of categories k = len(mj) sej = (sj2/nj)**0.5 wj = nj/sj2 w = float(wj.sum()) hj = wj/w wm = float((hj*mj).sum()) vj = nj - 1 tj = (mj - wm)/sej pVal = 0.05 zcrit = norm.ppf(1-pVal/2) cj = (4*vj**2 + 5*(2*zcrit**2 + 3)/24)/(4*vj**2 + vj + (4*zcrit**2+9)/12) * vj**0.5 b2 = float(((cj**2*log(1+tj**2/vj))).sum()) df = k - 1 chiCrit = chi2.ppf(1 - pVal, df) nIter = 0 pLow = 0 pHigh = 1 while b2 != chiCrit and nIter < 100: if chiCrit < b2: pHigh = pVal pVal = (pLow + pVal)/2 elif chiCrit > b2: pLow = pVal pVal = (pHigh + pVal)/2 zcrit = norm.ppf(1-pVal/2) cj = (4*vj**2 + 5*(2*zcrit**2 + 3)/24)/(4*vj**2 + vj + (4*zcrit**2+9)/12) * vj**0.5 b2 = float(((cj**2*log(1+tj**2/vj))).sum()) chiCrit = chi2.ppf(1 - pVal, df) nIter = nIter + 1 pVal = chi2.sf(b2, df) #results res = pd.DataFrame([[n, b2, df, pVal]]) res.columns = ["n", "statistic", "df", "p-value"] return res