Module `stikpetP.correlations.cor_rank_biserial_is`

Expand source code

import pandas as pd
from scipy.stats import rankdata

def r_rank_biserial_is(catField, ordField, categories=None, levels=None):
    '''
    (Glass) Rank Biserial Correlation / Cliff Delta
    -----------------------------------------------
    This function will calculate Rank biserial correlation coefficient (independent-samples)
    
    Parameters
    ----------
    catField : pandas series
        data with categories for the rows
    ordField : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from catField. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
        
    Returns
    -------
    rb : (Glass) Rank Biserial Correlation / Cliff Delta value
    
    Notes
    -----
    The formula used is (Glass, 1966, p. 626):
    $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
    
    With:
    $$\\bar{R}_i=\\frac{R_i}{n_i}$$
    
    *Symbols used:*
    
    * \\(\\bar{R}_i\\) the average of ranks in category i
    * \\(R_i\\) the sum of ranks in category i
    * \\(n\\) the total sample size
    * \\(n_i\\) the number of scores in category i
    
    Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same.
    
    The rank biserial can be converted to a Cohen d (using the **es_convert()** function), and then the rules-of-thumb for Cohen d could be used (**th_cohen_d()**)
    
    See Also
    --------
    stikpetP.effect_sizes.convert_es.es_convert : to convert to Cohen d, use `fr="rb", to="cohend"`.
    
    stikpetP.other.thumb_cohen_d.th_cohen_d : rules of thumb for Cohen d
    
    
    References 
    ----------
    Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494
    
    Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. https://doi.org/10.1007/BF02289138
    
    Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. https://doi.org/10.1177/001316446602600307
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
    >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
    -0.018712759581144402
    
    >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
    >>> ordinal = [4, 3, 1, 6, 5, 7, 2]
    >>> r_rank_biserial_is(binary, ordinal)
    0.6666666666666667
    
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(ordField) is list:
        ordField = pd.Series(ordField)
    
    #combine as one dataframe
    df = pd.concat([catField, ordField], axis=1)
    df = df.dropna()
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        pd.set_option('future.no_silent_downcasting', True)
        df.iloc[:,1] = df.iloc[:,1].replace(levels)
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    n1 = len(scoresCat1)
    n2 = len(scoresCat2)
    n = n1 + n2
    
    #combine this into one long list
    allScores = scoresCat1 + scoresCat2
    
    #get the ranks
    allRanks = rankdata(allScores)
    
    #get the ranks per category
    cat1Ranks = allRanks[0:n1]
    cat2Ranks = allRanks[n1:n]
    
    r1 = sum(cat1Ranks)
    r2 = sum(cat2Ranks)
    
    r1Avg = r1/n1
    r2Avg = r2/n2
    
    rb = 2*(r1Avg - r2Avg)/n  
    
    return rb

Functions

def r_rank_biserial_is(catField, ordField, categories=None, levels=None)

(Glass) Rank Biserial Correlation / Cliff Delta

This function will calculate Rank biserial correlation coefficient (independent-samples)

Parameters

catField : pandas series: data with categories for the rows
ordField : pandas series: data with the scores (ordinal field)
categories : list or dictionary, optional: the two categories to use from catField. If not set the first two found will be used
levels : list or dictionary, optional: the scores in order

Returns

rb : (Glass) Rank Biserial Correlation / Cliff Delta value

Notes

The formula used is (Glass, 1966, p. 626): $r_b = \frac{2\times\left(\bar{R}_1 - \bar{R}_2\right)}{n}$

With: $\bar{R}_i=\frac{R_i}{n_i}$

Symbols used:

$\bar{R}_i$ the average of ranks in category i
$R_i$ the sum of ranks in category i
$n$ the total sample size
$n_i$ the number of scores in category i

Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same.

The rank biserial can be converted to a Cohen d (using the es_convert() function), and then the rules-of-thumb for Cohen d could be used (th_cohen_d())

References

Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494

Cureton, E. E. (1956). Rank-biserial correlation. Psychometrika, 21(3), 287–290. https://doi.org/10.1007/BF02289138

Glass, G. V. (1966). Note on rank biserial correlation. Educational and Psychological Measurement, 26(3), 623–631. https://doi.org/10.1177/001316446602600307

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
>>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
-0.018712759581144402

>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
>>> ordinal = [4, 3, 1, 6, 5, 7, 2]
>>> r_rank_biserial_is(binary, ordinal)
0.6666666666666667

Expand source code

def r_rank_biserial_is(catField, ordField, categories=None, levels=None):
    '''
    (Glass) Rank Biserial Correlation / Cliff Delta
    -----------------------------------------------
    This function will calculate Rank biserial correlation coefficient (independent-samples)
    
    Parameters
    ----------
    catField : pandas series
        data with categories for the rows
    ordField : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from catField. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
        
    Returns
    -------
    rb : (Glass) Rank Biserial Correlation / Cliff Delta value
    
    Notes
    -----
    The formula used is (Glass, 1966, p. 626):
    $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
    
    With:
    $$\\bar{R}_i=\\frac{R_i}{n_i}$$
    
    *Symbols used:*
    
    * \\(\\bar{R}_i\\) the average of ranks in category i
    * \\(R_i\\) the sum of ranks in category i
    * \\(n\\) the total sample size
    * \\(n_i\\) the number of scores in category i
    
    Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same.
    
    The rank biserial can be converted to a Cohen d (using the **es_convert()** function), and then the rules-of-thumb for Cohen d could be used (**th_cohen_d()**)
    
    See Also
    --------
    stikpetP.effect_sizes.convert_es.es_convert : to convert to Cohen d, use `fr="rb", to="cohend"`.
    
    stikpetP.other.thumb_cohen_d.th_cohen_d : rules of thumb for Cohen d
    
    
    References 
    ----------
    Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494
    
    Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. https://doi.org/10.1007/BF02289138
    
    Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. https://doi.org/10.1177/001316446602600307
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
    >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
    -0.018712759581144402
    
    >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
    >>> ordinal = [4, 3, 1, 6, 5, 7, 2]
    >>> r_rank_biserial_is(binary, ordinal)
    0.6666666666666667
    
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(ordField) is list:
        ordField = pd.Series(ordField)
    
    #combine as one dataframe
    df = pd.concat([catField, ordField], axis=1)
    df = df.dropna()
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        pd.set_option('future.no_silent_downcasting', True)
        df.iloc[:,1] = df.iloc[:,1].replace(levels)
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    n1 = len(scoresCat1)
    n2 = len(scoresCat2)
    n = n1 + n2
    
    #combine this into one long list
    allScores = scoresCat1 + scoresCat2
    
    #get the ranks
    allRanks = rankdata(allScores)
    
    #get the ranks per category
    cat1Ranks = allRanks[0:n1]
    cat2Ranks = allRanks[n1:n]
    
    r1 = sum(cat1Ranks)
    r2 = sum(cat2Ranks)
    
    r1Avg = r1/n1
    r2Avg = r2/n2
    
    rb = 2*(r1Avg - r2Avg)/n  
    
    return rb