Module stikpetP.effect_sizes.eff_size_hedges_g_is
Expand source code
from statistics import mean, variance
from math import gamma
import pandas as pd
def es_hedges_g_is(catField, scaleField, categories=None, dmu=0, varWeighted=True, corr=None):
'''
Hedges g / Cohen ds (independent samples)
-----------------------------------------
An effect size measure when comparing two means. A few different variations are available. See the details for more information on them.
The measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/EffectSizes/HedgesG.html)
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
dmu : float, optional
difference according to null hypothesis (default is 0)
varWeighted : boolean, optional
to indicate the use of weighted variances or not. Default is True.
corr : {None, 'exact', 'hedges', 'durlak', 'xue'}, optional
approximation to use. None is default
Returns
-------
A dataframe with:
* *g*, the effect size value
* *version*, description of the effect size calculated
Notes
------
The formula used is (Hedges, 1981, p. 110):
$$g = \\frac{\\bar{x}_1 - \\bar{x}_2}{s_p}$$
With:
$$s_p = \\sqrt{\\frac{SS_1^2 + SS_2^2}{n - 2}}$$
$$SS_i = \\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2$$
$$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$
*Symbols used:*
* \\(x_{i,j}\\) the j-th score in category i
* \\(n_i\\) the number of scores in category i
This is also what Cohen refers to as \\(d_s\\) (Cohen, 1988, p. 66).
This uses by default the formula as shown above for \\(s_p\\). However, sometimes the unweighted version is used. If *varWeighted=FALSE* the following will be used instead:
$$s_p = \\sqrt{\\frac{s_1^2 + s_2^2}{2}}$$
Hedges proposes the following exact bias correction (Hedges, 1981, p. 111):
$$g_{c} = g \\times\\frac{\\Gamma\\left(m\\right)}{\\Gamma\\left(m - \\frac{1}{2}\\right)\\times\\sqrt{m}}$$
With:
$$m = \\frac{df}{2}$$
$$df = n_1 + n_2 - 2= n - 2$$
*Symbols used:*
* \\(df\\) the degrees of freedom
* \\(n\\) the sample size (i.e. the number of scores)
* \\(\\Gamma\\left(\\dots\\right)\\) the gamma function
The formula used for the approximation for this correction from Hedges (1981, p. 114) (appr="hedges"):
$$g_c = g \\times\\left(1 - \\frac{3}{4\\times df - 1}\\right)$$
This approximation can also be found in Hedges and Olkin (1985, p. 81) and
Cohen (1988, p. 66)
The formula used for the approximation from Durlak (2009, p. 927) (appr="durlak"):
$$g_c = g \\times\\frac{n - 3}{n - 2.25} \\times\\sqrt{\\frac{n - 2}{n}}$$
The formula used for the approximation from Xue (2020, p. 3) (appr="xue"):
$$g_c = g \\times \\sqrt[12]{1 - \\frac{9}{df} + \\frac{69}{2\\times df^2} - \\frac{72}{df^3} + \\frac{687}{8\\times df^4} - \\frac{441}{8\\times df^5} + \\frac{247}{16\\times df^6}}$$
Before, After and Alternatives
------------------------------
Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.
After obtaining the measure, you might want to use the rules-of-thumb for Cohen d<sub>s</sub>: [th_cohen_d()](../other/thumb_cohen_d.html).
Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
References
----------
Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). L. Erlbaum Associates.
Durlak, J. A. (2009). How to select, calculate, and interpret effect sizes. *Journal of Pediatric Psychology, 34*(9), 917–928. https://doi.org/10.1093/jpepsy/jsp004
Hedges, L. V. (1981). Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators. *Journal of Educational Statistics, 6*(2), 107–128. https://doi.org/10.2307/1164588
Hedges, L. V., & Olkin, I. (1985). *Statistical methods for meta-analysis*. Academic Press.
Xue, X. (2020). Improved approximations of Hedges’ g*. https://doi.org/10.48550/arXiv.2003.06675
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['age']
>>> ex1 = ex1.replace("89 OR OLDER", "90")
>>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"]))
g version
0 -0.045224 Cohen ds (Hedges g (uncorrected)
>>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="hedges"))
g version
0 -0.045206 Hedges g (approximation)
>>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="durlak"))
g version
0 -0.045183 Hedges g with Durlak approximation
>>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="xue"))
g version
0 -0.045206 Hedges g with Xue approximation
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
>>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
>>> es_hedges_g_is(groups, scores)
g version
0 0.858201 Cohen ds (Hedges g (uncorrected)
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
#make sure they are floats
x1 = [float(x) for x in x1]
x2 = [float(x) for x in x2]
n1 = len(x1)
n2 = len(x2)
n = n1 + n2
var1 = variance(x1)
var2 = variance(x2)
m1 = mean(x1)
m2 = mean(x2)
sd1 = (var1)**0.5
sd2 = (var2)**0.5
#Determine Sum of Squared (deviation from the mean) per category
ss1 = sd1**2*(n1 - 1)
ss2 = sd2**2*(n2 - 1)
if varWeighted:
se = ((ss1 + ss2)/(n - 2))**0.5
else:
se = ((var1 + var2)/2)**0.5
#Determine Hedges g (Cohen's d)
g = (m1 - m2-dmu)/se
c = 1
comment = "Cohen ds (Hedges g (uncorrected)"
if corr is not None:
if (corr=="exact"):
if (n - 2 < 171):
c = gamma((n - 2)/2)/(((n - 2)/2)**0.5 * gamma((n - 3)/2))
comment = "Hedges g (exact method)"
else:
print("WARNING: exact method could not be computed due to large sample size, approximation used instead")
c = 1 - 3/(4*(n - 2) - 1)
comment = "Hedges g (approximation)"
elif(corr=="hedges"):
c = 1 - 3/(4*(n - 2) - 1)
comment = "Hedges g (approximation)"
elif(corr=="durlak"):
c = (n - 3)/(n - 2.25)*((n - 2)/n)**0.5
comment = "Hedges g with Durlak approximation"
elif(corr=="xue"):
# Xue (2020, p. 3) approximation:
df = n - 2
c = (1 - 9/df + 69/(2*df**2) - 72/(df**3) + 687/(8*df**4) - 441/(8*df**5) + 247/(16*df**6))**(1/12)
comment = "Hedges g with Xue approximation"
g = g*c
#the results
colnames = ["g", "version"]
results = pd.DataFrame([[g, comment]], columns=colnames)
return(results)
Functions
def es_hedges_g_is(catField, scaleField, categories=None, dmu=0, varWeighted=True, corr=None)
-
Hedges g / Cohen ds (independent samples)
An effect size measure when comparing two means. A few different variations are available. See the details for more information on them.
The measure is also described at PeterStatistics.com
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
dmu
:float
, optional- difference according to null hypothesis (default is 0)
varWeighted
:boolean
, optional- to indicate the use of weighted variances or not. Default is True.
corr
:{None, 'exact', 'hedges', 'durlak', 'xue'}
, optional- approximation to use. None is default
Returns
A dataframe with:
- g, the effect size value
- version, description of the effect size calculated
Notes
The formula used is (Hedges, 1981, p. 110): g = \frac{\bar{x}_1 - \bar{x}_2}{s_p}
With: s_p = \sqrt{\frac{SS_1^2 + SS_2^2}{n - 2}} SS_i = \sum_{j=1}^{n_i} \left(x_{i,j} - \bar{x}_i\right)^2 \bar{x}_i = \frac{\sum_{j=1}^{n_i} x_{i,j}}{n_i}
Symbols used:
- x_{i,j} the j-th score in category i
- n_i the number of scores in category i
This is also what Cohen refers to as d_s (Cohen, 1988, p. 66).
This uses by default the formula as shown above for s_p. However, sometimes the unweighted version is used. If varWeighted=FALSE the following will be used instead: s_p = \sqrt{\frac{s_1^2 + s_2^2}{2}}
Hedges proposes the following exact bias correction (Hedges, 1981, p. 111): g_{c} = g \times\frac{\Gamma\left(m\right)}{\Gamma\left(m - \frac{1}{2}\right)\times\sqrt{m}} With: m = \frac{df}{2} df = n_1 + n_2 - 2= n - 2
Symbols used:
- df the degrees of freedom
- n the sample size (i.e. the number of scores)
- \Gamma\left(\dots\right) the gamma function
The formula used for the approximation for this correction from Hedges (1981, p. 114) (appr="hedges"): g_c = g \times\left(1 - \frac{3}{4\times df - 1}\right)
This approximation can also be found in Hedges and Olkin (1985, p. 81) and Cohen (1988, p. 66)
The formula used for the approximation from Durlak (2009, p. 927) (appr="durlak"): g_c = g \times\frac{n - 3}{n - 2.25} \times\sqrt{\frac{n - 2}{n}}
The formula used for the approximation from Xue (2020, p. 3) (appr="xue"): g_c = g \times \sqrt[12]{1 - \frac{9}{df} + \frac{69}{2\times df^2} - \frac{72}{df^3} + \frac{687}{8\times df^4} - \frac{441}{8\times df^5} + \frac{247}{16\times df^6}}
Before, After and Alternatives
Before the effect size you might want to run a test. Various options include ts_student_t_os for One-Sample Student t-Test, ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or ts_z_os for One-Sample Z Test.
After obtaining the measure, you might want to use the rules-of-thumb for Cohen ds: th_cohen_d().
Alternative effect sizes include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta
or the correlation coefficients: biserial, point-biserial
References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). L. Erlbaum Associates.
Durlak, J. A. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928. https://doi.org/10.1093/jpepsy/jsp004
Hedges, L. V. (1981). Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators. Journal of Educational Statistics, 6(2), 107–128. https://doi.org/10.2307/1164588
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press.
Xue, X. (2020). Improved approximations of Hedges’ g*. https://doi.org/10.48550/arXiv.2003.06675
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"])) g version 0 -0.045224 Cohen ds (Hedges g (uncorrected) >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="hedges")) g version 0 -0.045206 Hedges g (approximation) >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="durlak")) g version 0 -0.045183 Hedges g with Durlak approximation >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="xue")) g version 0 -0.045206 Hedges g with Xue approximation
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> es_hedges_g_is(groups, scores) g version 0 0.858201 Cohen ds (Hedges g (uncorrected)
Expand source code
def es_hedges_g_is(catField, scaleField, categories=None, dmu=0, varWeighted=True, corr=None): ''' Hedges g / Cohen ds (independent samples) ----------------------------------------- An effect size measure when comparing two means. A few different variations are available. See the details for more information on them. The measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/EffectSizes/HedgesG.html) Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. dmu : float, optional difference according to null hypothesis (default is 0) varWeighted : boolean, optional to indicate the use of weighted variances or not. Default is True. corr : {None, 'exact', 'hedges', 'durlak', 'xue'}, optional approximation to use. None is default Returns ------- A dataframe with: * *g*, the effect size value * *version*, description of the effect size calculated Notes ------ The formula used is (Hedges, 1981, p. 110): $$g = \\frac{\\bar{x}_1 - \\bar{x}_2}{s_p}$$ With: $$s_p = \\sqrt{\\frac{SS_1^2 + SS_2^2}{n - 2}}$$ $$SS_i = \\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2$$ $$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$ *Symbols used:* * \\(x_{i,j}\\) the j-th score in category i * \\(n_i\\) the number of scores in category i This is also what Cohen refers to as \\(d_s\\) (Cohen, 1988, p. 66). This uses by default the formula as shown above for \\(s_p\\). However, sometimes the unweighted version is used. If *varWeighted=FALSE* the following will be used instead: $$s_p = \\sqrt{\\frac{s_1^2 + s_2^2}{2}}$$ Hedges proposes the following exact bias correction (Hedges, 1981, p. 111): $$g_{c} = g \\times\\frac{\\Gamma\\left(m\\right)}{\\Gamma\\left(m - \\frac{1}{2}\\right)\\times\\sqrt{m}}$$ With: $$m = \\frac{df}{2}$$ $$df = n_1 + n_2 - 2= n - 2$$ *Symbols used:* * \\(df\\) the degrees of freedom * \\(n\\) the sample size (i.e. the number of scores) * \\(\\Gamma\\left(\\dots\\right)\\) the gamma function The formula used for the approximation for this correction from Hedges (1981, p. 114) (appr="hedges"): $$g_c = g \\times\\left(1 - \\frac{3}{4\\times df - 1}\\right)$$ This approximation can also be found in Hedges and Olkin (1985, p. 81) and Cohen (1988, p. 66) The formula used for the approximation from Durlak (2009, p. 927) (appr="durlak"): $$g_c = g \\times\\frac{n - 3}{n - 2.25} \\times\\sqrt{\\frac{n - 2}{n}}$$ The formula used for the approximation from Xue (2020, p. 3) (appr="xue"): $$g_c = g \\times \\sqrt[12]{1 - \\frac{9}{df} + \\frac{69}{2\\times df^2} - \\frac{72}{df^3} + \\frac{687}{8\\times df^4} - \\frac{441}{8\\times df^5} + \\frac{247}{16\\times df^6}}$$ Before, After and Alternatives ------------------------------ Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test. After obtaining the measure, you might want to use the rules-of-thumb for Cohen d<sub>s</sub>: [th_cohen_d()](../other/thumb_cohen_d.html). Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html) or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html) References ---------- Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). L. Erlbaum Associates. Durlak, J. A. (2009). How to select, calculate, and interpret effect sizes. *Journal of Pediatric Psychology, 34*(9), 917–928. https://doi.org/10.1093/jpepsy/jsp004 Hedges, L. V. (1981). Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators. *Journal of Educational Statistics, 6*(2), 107–128. https://doi.org/10.2307/1164588 Hedges, L. V., & Olkin, I. (1985). *Statistical methods for meta-analysis*. Academic Press. Xue, X. (2020). Improved approximations of Hedges’ g*. https://doi.org/10.48550/arXiv.2003.06675 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Dataframe >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"])) g version 0 -0.045224 Cohen ds (Hedges g (uncorrected) >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="hedges")) g version 0 -0.045206 Hedges g (approximation) >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="durlak")) g version 0 -0.045183 Hedges g with Durlak approximation >>> print(es_hedges_g_is(df1['sex'], ex1, categories=["MALE", "FEMALE"], corr="xue")) g version 0 -0.045206 Hedges g with Xue approximation Example 2: List >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> es_hedges_g_is(groups, scores) g version 0 0.858201 Cohen ds (Hedges g (uncorrected) ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) #make sure they are floats x1 = [float(x) for x in x1] x2 = [float(x) for x in x2] n1 = len(x1) n2 = len(x2) n = n1 + n2 var1 = variance(x1) var2 = variance(x2) m1 = mean(x1) m2 = mean(x2) sd1 = (var1)**0.5 sd2 = (var2)**0.5 #Determine Sum of Squared (deviation from the mean) per category ss1 = sd1**2*(n1 - 1) ss2 = sd2**2*(n2 - 1) if varWeighted: se = ((ss1 + ss2)/(n - 2))**0.5 else: se = ((var1 + var2)/2)**0.5 #Determine Hedges g (Cohen's d) g = (m1 - m2-dmu)/se c = 1 comment = "Cohen ds (Hedges g (uncorrected)" if corr is not None: if (corr=="exact"): if (n - 2 < 171): c = gamma((n - 2)/2)/(((n - 2)/2)**0.5 * gamma((n - 3)/2)) comment = "Hedges g (exact method)" else: print("WARNING: exact method could not be computed due to large sample size, approximation used instead") c = 1 - 3/(4*(n - 2) - 1) comment = "Hedges g (approximation)" elif(corr=="hedges"): c = 1 - 3/(4*(n - 2) - 1) comment = "Hedges g (approximation)" elif(corr=="durlak"): c = (n - 3)/(n - 2.25)*((n - 2)/n)**0.5 comment = "Hedges g with Durlak approximation" elif(corr=="xue"): # Xue (2020, p. 3) approximation: df = n - 2 c = (1 - 9/df + 69/(2*df**2) - 72/(df**3) + 687/(8*df**4) - 441/(8*df**5) + 247/(16*df**6))**(1/12) comment = "Hedges g with Xue approximation" g = g*c #the results colnames = ["g", "version"] results = pd.DataFrame([[g, comment]], columns=colnames) return(results)