Module stikpetP.effect_sizes.eff_size_hodges_lehmann_is
Expand source code
import pandas as pd
from statistics import median
def es_hodges_lehmann_is(catField, scores, categories=None, levels=None):
'''
Hodges-Lehmann Estimator (Independent Samples)
-----------------------------------------------
The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599).
It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians.
This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data).
Parameters
----------
catField : dataframe or list
the categorical data
scores : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
levels : list or dictionary, optional
the scores in order
Returns
-------
HL : float, the Hodges-Lehmann estimate
Notes
------
The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602):
$$HL = \\text{median}\\left(y_j - x_i | 1 \\leq i \\leq n_x, 1 \\leq j \\leq n_y\\right)$$
*Symbols used:*
* \\(x_i\\) the i-th score in category x
* \\(x_j\\) the j-th score in category y
* \\(n_i\\) the number of scores in category i
There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python.
References
----------
Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. *The Annals of Mathematical Statistics, 34*(2), 598–611. doi:10.1214/aoms/1177704172
Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. *ACM Transactions on Mathematical Software, 10*(3), 265–270. doi:10.1145/1271.319414
van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scores) is list:
scores = pd.Series(scores)
#combine as one dataframe
df = pd.concat([catField, scores], axis=1)
df = df.dropna()
#replace the ordinal values if levels is provided
if levels is not None:
df.iloc[:,1] = df.iloc[:,1].replace(levels)
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
else:
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
#make sure they are floats
x1 = [float(x) for x in x1]
x2 = [float(x) for x in x2]
#all of that just so we can determine HL in one line:
hl = median([j - i for i in x1 for j in x2])
return hl
Functions
def es_hodges_lehmann_is(catField, scores, categories=None, levels=None)
-
Hodges-Lehmann Estimator (Independent Samples)
The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599).
It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians.
This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data).
Parameters
catField
:dataframe
orlist
- the categorical data
scores
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
levels
:list
ordictionary
, optional- the scores in order
Returns
HL
:float, the Hodges-Lehmann estimate
Notes
The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602):
HL = \text{median}\left(y_j - x_i | 1 \leq i \leq n_x, 1 \leq j \leq n_y\right)
Symbols used:
- x_i the i-th score in category x
- x_j the j-th score in category y
- n_i the number of scores in category i
There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python.
References
Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. The Annals of Mathematical Statistics, 34(2), 598–611. doi:10.1214/aoms/1177704172
Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. ACM Transactions on Mathematical Software, 10(3), 265–270. doi:10.1145/1271.319414
van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def es_hodges_lehmann_is(catField, scores, categories=None, levels=None): ''' Hodges-Lehmann Estimator (Independent Samples) ----------------------------------------------- The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599). It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians. This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data). Parameters ---------- catField : dataframe or list the categorical data scores : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. levels : list or dictionary, optional the scores in order Returns ------- HL : float, the Hodges-Lehmann estimate Notes ------ The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602): $$HL = \\text{median}\\left(y_j - x_i | 1 \\leq i \\leq n_x, 1 \\leq j \\leq n_y\\right)$$ *Symbols used:* * \\(x_i\\) the i-th score in category x * \\(x_j\\) the j-th score in category y * \\(n_i\\) the number of scores in category i There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python. References ---------- Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. *The Annals of Mathematical Statistics, 34*(2), 598–611. doi:10.1214/aoms/1177704172 Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. *ACM Transactions on Mathematical Software, 10*(3), 265–270. doi:10.1145/1271.319414 van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scores) is list: scores = pd.Series(scores) #combine as one dataframe df = pd.concat([catField, scores], axis=1) df = df.dropna() #replace the ordinal values if levels is provided if levels is not None: df.iloc[:,1] = df.iloc[:,1].replace(levels) df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) else: df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) #make sure they are floats x1 = [float(x) for x in x1] x2 = [float(x) for x in x2] #all of that just so we can determine HL in one line: hl = median([j - i for i in x1 for j in x2]) return hl