Module `stikpetP.effect_sizes.eff_size_hodges_lehmann_is`

Expand source code

import pandas as pd
from statistics import median

def es_hodges_lehmann_is(catField, scores, categories=None, levels=None):
    '''
    Hodges-Lehmann Estimator (Independent Samples)
    -----------------------------------------------
    The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599).
    
    It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians.
    
    This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data).

    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scores : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
    levels : list or dictionary, optional
        the scores in order
        
    Returns
    -------
    HL : float, the Hodges-Lehmann estimate
    
    Notes
    ------
    The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602):
    
    $$HL = \\text{median}\\left(y_j - x_i | 1 \\leq i \\leq n_x, 1 \\leq j \\leq n_y\\right)$$
        
    *Symbols used:*
    
    * \\(x_i\\) the i-th score in category x
    * \\(x_j\\) the j-th score in category y
    * \\(n_i\\) the number of scores in category i

    There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python.
    
    References
    ----------
    Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. *The Annals of Mathematical Statistics, 34*(2), 598–611. doi:10.1214/aoms/1177704172
    
    Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. *ACM Transactions on Mathematical Software, 10*(3), 265–270. doi:10.1145/1271.319414
    
    van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scores) is list:
        scores = pd.Series(scores)
    
    #combine as one dataframe
    df = pd.concat([catField, scores], axis=1)
    df = df.dropna()

    #replace the ordinal values if levels is provided
    if levels is not None:
        df.iloc[:,1] = df.iloc[:,1].replace(levels)
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    #make sure they are floats
    x1 = [float(x) for x in x1]
    x2 = [float(x) for x in x2]

    #all of that just so we can determine HL in one line:
    hl = median([j - i for i in x1 for j in x2])

    return hl

Functions

def es_hodges_lehmann_is(catField, scores, categories=None, levels=None)

Hodges-Lehmann Estimator (Independent Samples)

The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599).

It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians.

This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data).

Parameters

catField : dataframe or list: the categorical data
scores : dataframe or list: the scores
categories : list, optional: to indicate which two categories of catField to use, otherwise first two found will be used.
levels : list or dictionary, optional: the scores in order

Returns

HL : float, the Hodges-Lehmann estimate

Notes

The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602):

$HL = \text{median}\left(y_j - x_i | 1 \leq i \leq n_x, 1 \leq j \leq n_y\right)$

Symbols used:

$x_i$ the i-th score in category x
$x_j$ the j-th score in category y
$n_i$ the number of scores in category i

There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python.

References

Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. The Annals of Mathematical Statistics, 34(2), 598–611. doi:10.1214/aoms/1177704172

Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. ACM Transactions on Mathematical Software, 10(3), 265–270. doi:10.1145/1271.319414

van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def es_hodges_lehmann_is(catField, scores, categories=None, levels=None):
    '''
    Hodges-Lehmann Estimator (Independent Samples)
    -----------------------------------------------
    The Hodges-Lehmann estimate, is the median of all the possible differences between two sets of data. The authors (Hodges & Lehmann, 1963) describe it as the location shift that is needed to align two distributions (with similar distributions) as much as possible (p. 599).
    
    It is sometimes incorrectly described as the difference between the two medians, but that is incorrect. It is not uncommon to have a different Hodges-Lehmann estimate than simply taking the difference between the two medians.
    
    This measure is sometimes mentioned as an effect size measure for a Mann-Whitney U / Wilcoxon Rank Sum test (van Geloven, 2018), however since it is a median of the possible differences, it is not standardized (i.e. it doesn't range between two fixed values, and depends therefor on the data).

    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scores : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
    levels : list or dictionary, optional
        the scores in order
        
    Returns
    -------
    HL : float, the Hodges-Lehmann estimate
    
    Notes
    ------
    The formula for the Hodges-Lehmann estimator with two samples is (Hodges & Lehmann, 1963, p. 602):
    
    $$HL = \\text{median}\\left(y_j - x_i | 1 \\leq i \\leq n_x, 1 \\leq j \\leq n_y\\right)$$
        
    *Symbols used:*
    
    * \\(x_i\\) the i-th score in category x
    * \\(x_j\\) the j-th score in category y
    * \\(n_i\\) the number of scores in category i

    There might be a faster method to actually determine this. Algorithm 616 (Monahan, 1984), but couldn't translate the Fortran to Python.
    
    References
    ----------
    Hodges, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. *The Annals of Mathematical Statistics, 34*(2), 598–611. doi:10.1214/aoms/1177704172
    
    Monahan, J. F. (1984). Algorithm 616: Fast computation of the Hodges-Lehmann location estimator. *ACM Transactions on Mathematical Software, 10*(3), 265–270. doi:10.1145/1271.319414
    
    van Geloven, N. (2018, March 13). Mann-Whitney U toets [Wiki]. Wikistatistiek. https://wikistatistiek.amc.nl/Mann-Whitney_U_toets
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scores) is list:
        scores = pd.Series(scores)
    
    #combine as one dataframe
    df = pd.concat([catField, scores], axis=1)
    df = df.dropna()

    #replace the ordinal values if levels is provided
    if levels is not None:
        df.iloc[:,1] = df.iloc[:,1].replace(levels)
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    #make sure they are floats
    x1 = [float(x) for x in x1]
    x2 = [float(x) for x in x2]

    #all of that just so we can determine HL in one line:
    hl = median([j - i for i in x1 for j in x2])

    return hl