Module stikpetP.correlations.cor_rank_biserial_is

Expand source code
import pandas as pd
from scipy.stats import rankdata

def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version="cureton"):
    '''
    Rank Biserial Correlation
    -------------------------
    This function will calculate Rank biserial correlation coefficient (independent-samples).

    Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022).

    The function is shown in this [YouTube video](https://youtu.be/pq7Fv0yc9uU) and the coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/RankBiserialCorrelation.html)
    
    Parameters
    ----------
    catField : pandas series
        data with categories for the rows
    ordField : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from catField. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    version : {"cureton", "glass"}, optional
        the method to use to calculate rank-biserial correlation.
        
    Returns
    -------
    rb : (Glass) Rank Biserial Correlation / Cliff Delta value
    
    Notes
    -----
    If version='cureton', the formula from Cureton (1968, p. 68) is used:
    $$r_{rb} = \\frac{\\bar{R}_1 - \\left(n + 1\\right)/2}{n_2/2 - B/n_1}$$
    
    If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used:
    $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
    
    With:
    $$B = \\frac{\\sum_{i=1}^c t_{i,1} \\times t_{i,2}}{2}$$
    $$\\bar{R}_i=\\frac{R_i}{n_i}$$
    
    *Symbols used:*
    
    * \\(\\bar{R}_i\\) the average of ranks in category i
    * \\(R_i\\) the sum of ranks in category i
    * \\(n\\) the total sample size
    * \\(n_i\\) the number of scores in category i
    * \\(t_{i,j}\\), the i-th number of tied scores in j
    
    If one category has two scores of 3 and the other has three scores of 3, then \\(t_{1,1} = 2, t_{1,2} = 3\\), if the first category has also one score of 4 and the second has two scores of 4, then \\(t_{2,1} = 1, t_{2,2} = 2\\), etc.

    Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495).

    Before, After and Alternatives
    ------------------------------
    Before determining this effect size measure, you might want to run a test:
    
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test
    * [ts_brunner_munzel](../tests/test_brunner_munzel.html#ts_brunner_munzel) for the Brunner-Munzel test
    * [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test
    * [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test

    After obtaining the coefficient you might want a rule-of-thumb:
    
    * [th_rank_biserial](../other/thumb_rank_biserial.html#th_rank_biserial) for rules-of-thumb for the rank-biserial correlation
    
    Alternative effect size measures could be:
    
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann
    
    References 
    ----------
    Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. doi:10.1037/0033-2909.114.3.494
    
    Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. doi:10.1007/BF02289138
    
    Cureton, E. E. (1968). Rank-biserial correlation when ties are present. *Educational and Psychological Measurement, 28*(1), 77–79. doi:10.1177/001316446802800107
    
    Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. *Journal of Educational Measurement, 2*(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x
    
    Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. doi:10.1177/001316446602600307
    
    Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. *Open Journal of Statistics, 12*(5), 597–622. doi:10.4236/ojs.2022.125036
    
    Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review, 27*(6), 799–811. doi:10.2307/2090408
    
    Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. *Educational and Psychological Measurement, 36*(2), 297–300. doi:10.1177/001316447603600207
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
    >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
    -0.018712759581144402
    
    >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
    >>> ordinal = [4, 3, 1, 6, 5, 7, 2]
    >>> r_rank_biserial_is(binary, ordinal, version='glass')
    0.6666666666666667
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(ordField) is list:
        ordField = pd.Series(ordField)
    
    #combine as one dataframe
    df = pd.concat([catField, ordField], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    n1 = len(scoresCat1)
    n2 = len(scoresCat2)
    n = n1 + n2
    
    #combine this into one long list
    allScores = scoresCat1 + scoresCat2
    
    #get the ranks
    allRanks = rankdata(allScores)
    
    #get the ranks per category
    cat1Ranks = allRanks[0:n1]
    cat2Ranks = allRanks[n1:n]
    
    r1 = sum(cat1Ranks)
    r2 = sum(cat2Ranks)
    
    r1Avg = r1/n1
    r2Avg = r2/n2

    if version=='glass':
        rb = 2*(r1Avg - r2Avg)/n  
    elif version=='cureton':
        #bracket ties
        b = 0
        for i in set(cat1Ranks):
          b += sum(cat2Ranks == i) * sum(cat1Ranks == i)
        # rb using Cureton
        rb = (r1Avg - (n + 1) / 2) / (n2 / 2 - (b / 2) / n1)
    
    return rb

Functions

def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version='cureton')

Rank Biserial Correlation

This function will calculate Rank biserial correlation coefficient (independent-samples).

Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022).

The function is shown in this YouTube video and the coefficient is also described at PeterStatistics.com

Parameters

catField : pandas series
data with categories for the rows
ordField : pandas series
data with the scores (ordinal field)
categories : list or dictionary, optional
the two categories to use from catField. If not set the first two found will be used
levels : list or dictionary, optional
the scores in order
version : {"cureton", "glass"}, optional
the method to use to calculate rank-biserial correlation.

Returns

rb : (Glass) Rank Biserial Correlation / Cliff Delta value
 

Notes

If version='cureton', the formula from Cureton (1968, p. 68) is used: r_{rb} = \frac{\bar{R}_1 - \left(n + 1\right)/2}{n_2/2 - B/n_1}

If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used: r_b = \frac{2\times\left(\bar{R}_1 - \bar{R}_2\right)}{n}

With: B = \frac{\sum_{i=1}^c t_{i,1} \times t_{i,2}}{2} \bar{R}_i=\frac{R_i}{n_i}

Symbols used:

  • \bar{R}_i the average of ranks in category i
  • R_i the sum of ranks in category i
  • n the total sample size
  • n_i the number of scores in category i
  • t_{i,j}, the i-th number of tied scores in j

If one category has two scores of 3 and the other has three scores of 3, then t_{1,1} = 2, t_{1,2} = 3, if the first category has also one score of 4 and the second has two scores of 4, then t_{2,1} = 1, t_{2,2} = 2, etc.

Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495).

Before, After and Alternatives

Before determining this effect size measure, you might want to run a test:

After obtaining the coefficient you might want a rule-of-thumb:

Alternative effect size measures could be:

References

Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. doi:10.1037/0033-2909.114.3.494

Cureton, E. E. (1956). Rank-biserial correlation. Psychometrika, 21(3), 287–290. doi:10.1007/BF02289138

Cureton, E. E. (1968). Rank-biserial correlation when ties are present. Educational and Psychological Measurement, 28(1), 77–79. doi:10.1177/001316446802800107

Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. Journal of Educational Measurement, 2(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x

Glass, G. V. (1966). Note on rank biserial correlation. Educational and Psychological Measurement, 26(3), 623–631. doi:10.1177/001316446602600307

Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. Open Journal of Statistics, 12(5), 597–622. doi:10.4236/ojs.2022.125036

Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. American Sociological Review, 27(6), 799–811. doi:10.2307/2090408

Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. Educational and Psychological Measurement, 36(2), 297–300. doi:10.1177/001316447603600207

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
>>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
-0.018712759581144402
>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
>>> ordinal = [4, 3, 1, 6, 5, 7, 2]
>>> r_rank_biserial_is(binary, ordinal, version='glass')
0.6666666666666667
Expand source code
def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version="cureton"):
    '''
    Rank Biserial Correlation
    -------------------------
    This function will calculate Rank biserial correlation coefficient (independent-samples).

    Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022).

    The function is shown in this [YouTube video](https://youtu.be/pq7Fv0yc9uU) and the coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/RankBiserialCorrelation.html)
    
    Parameters
    ----------
    catField : pandas series
        data with categories for the rows
    ordField : pandas series
        data with the scores (ordinal field)
    categories : list or dictionary, optional
        the two categories to use from catField. If not set the first two found will be used
    levels : list or dictionary, optional
        the scores in order
    version : {"cureton", "glass"}, optional
        the method to use to calculate rank-biserial correlation.
        
    Returns
    -------
    rb : (Glass) Rank Biserial Correlation / Cliff Delta value
    
    Notes
    -----
    If version='cureton', the formula from Cureton (1968, p. 68) is used:
    $$r_{rb} = \\frac{\\bar{R}_1 - \\left(n + 1\\right)/2}{n_2/2 - B/n_1}$$
    
    If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used:
    $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
    
    With:
    $$B = \\frac{\\sum_{i=1}^c t_{i,1} \\times t_{i,2}}{2}$$
    $$\\bar{R}_i=\\frac{R_i}{n_i}$$
    
    *Symbols used:*
    
    * \\(\\bar{R}_i\\) the average of ranks in category i
    * \\(R_i\\) the sum of ranks in category i
    * \\(n\\) the total sample size
    * \\(n_i\\) the number of scores in category i
    * \\(t_{i,j}\\), the i-th number of tied scores in j
    
    If one category has two scores of 3 and the other has three scores of 3, then \\(t_{1,1} = 2, t_{1,2} = 3\\), if the first category has also one score of 4 and the second has two scores of 4, then \\(t_{2,1} = 1, t_{2,2} = 2\\), etc.

    Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495).

    Before, After and Alternatives
    ------------------------------
    Before determining this effect size measure, you might want to run a test:
    
    * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
    * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test
    * [ts_brunner_munzel](../tests/test_brunner_munzel.html#ts_brunner_munzel) for the Brunner-Munzel test
    * [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test
    * [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test

    After obtaining the coefficient you might want a rule-of-thumb:
    
    * [th_rank_biserial](../other/thumb_rank_biserial.html#th_rank_biserial) for rules-of-thumb for the rank-biserial correlation
    
    Alternative effect size measures could be:
    
    * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
    * [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann
    
    References 
    ----------
    Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. doi:10.1037/0033-2909.114.3.494
    
    Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. doi:10.1007/BF02289138
    
    Cureton, E. E. (1968). Rank-biserial correlation when ties are present. *Educational and Psychological Measurement, 28*(1), 77–79. doi:10.1177/001316446802800107
    
    Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. *Journal of Educational Measurement, 2*(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x
    
    Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. doi:10.1177/001316446602600307
    
    Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. *Open Journal of Statistics, 12*(5), 597–622. doi:10.4236/ojs.2022.125036
    
    Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review, 27*(6), 799–811. doi:10.2307/2090408
    
    Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. *Educational and Psychological Measurement, 36*(2), 297–300. doi:10.1177/001316447603600207
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
    >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
    -0.018712759581144402
    
    >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
    >>> ordinal = [4, 3, 1, 6, 5, 7, 2]
    >>> r_rank_biserial_is(binary, ordinal, version='glass')
    0.6666666666666667
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(ordField) is list:
        ordField = pd.Series(ordField)
    
    #combine as one dataframe
    df = pd.concat([catField, ordField], axis=1)
    df = df.dropna()
    df.columns=['cat', 'score']
    
    #replace the ordinal values if levels is provided
    if levels is not None:
        df['score'] = df['score'].map(levels).astype('Int8')
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    else:
        df.iloc[:,1]  = pd.to_numeric(df.iloc[:,1] )
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    n1 = len(scoresCat1)
    n2 = len(scoresCat2)
    n = n1 + n2
    
    #combine this into one long list
    allScores = scoresCat1 + scoresCat2
    
    #get the ranks
    allRanks = rankdata(allScores)
    
    #get the ranks per category
    cat1Ranks = allRanks[0:n1]
    cat2Ranks = allRanks[n1:n]
    
    r1 = sum(cat1Ranks)
    r2 = sum(cat2Ranks)
    
    r1Avg = r1/n1
    r2Avg = r2/n2

    if version=='glass':
        rb = 2*(r1Avg - r2Avg)/n  
    elif version=='cureton':
        #bracket ties
        b = 0
        for i in set(cat1Ranks):
          b += sum(cat2Ranks == i) * sum(cat1Ranks == i)
        # rb using Cureton
        rb = (r1Avg - (n + 1) / 2) / (n2 / 2 - (b / 2) / n1)
    
    return rb