Module stikpetP.correlations.cor_rank_biserial_is
Expand source code
import pandas as pd
from scipy.stats import rankdata
def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version="cureton"):
'''
Rank Biserial Correlation
-------------------------
This function will calculate Rank biserial correlation coefficient (independent-samples).
Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022).
The function is shown in this [YouTube video](https://youtu.be/pq7Fv0yc9uU) and the coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/RankBiserialCorrelation.html)
Parameters
----------
catField : pandas series
data with categories for the rows
ordField : pandas series
data with the scores (ordinal field)
categories : list or dictionary, optional
the two categories to use from catField. If not set the first two found will be used
levels : list or dictionary, optional
the scores in order
version : {"cureton", "glass"}, optional
the method to use to calculate rank-biserial correlation.
Returns
-------
rb : (Glass) Rank Biserial Correlation / Cliff Delta value
Notes
-----
If version='cureton', the formula from Cureton (1968, p. 68) is used:
$$r_{rb} = \\frac{\\bar{R}_1 - \\left(n + 1\\right)/2}{n_2/2 - B/n_1}$$
If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used:
$$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
With:
$$B = \\frac{\\sum_{i=1}^c t_{i,1} \\times t_{i,2}}{2}$$
$$\\bar{R}_i=\\frac{R_i}{n_i}$$
*Symbols used:*
* \\(\\bar{R}_i\\) the average of ranks in category i
* \\(R_i\\) the sum of ranks in category i
* \\(n\\) the total sample size
* \\(n_i\\) the number of scores in category i
* \\(t_{i,j}\\), the i-th number of tied scores in j
If one category has two scores of 3 and the other has three scores of 3, then \\(t_{1,1} = 2, t_{1,2} = 3\\), if the first category has also one score of 4 and the second has two scores of 4, then \\(t_{2,1} = 1, t_{2,2} = 2\\), etc.
Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495).
Before, After and Alternatives
------------------------------
Before determining this effect size measure, you might want to run a test:
* [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test
* [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test
* [ts_brunner_munzel](../tests/test_brunner_munzel.html#ts_brunner_munzel) for the Brunner-Munzel test
* [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test
* [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test
After obtaining the coefficient you might want a rule-of-thumb:
* [th_rank_biserial](../other/thumb_rank_biserial.html#th_rank_biserial) for rules-of-thumb for the rank-biserial correlation
Alternative effect size measures could be:
* [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size
* [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann
References
----------
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. doi:10.1037/0033-2909.114.3.494
Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. doi:10.1007/BF02289138
Cureton, E. E. (1968). Rank-biserial correlation when ties are present. *Educational and Psychological Measurement, 28*(1), 77–79. doi:10.1177/001316446802800107
Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. *Journal of Educational Measurement, 2*(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x
Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. doi:10.1177/001316446602600307
Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. *Open Journal of Statistics, 12*(5), 597–622. doi:10.4236/ojs.2022.125036
Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review, 27*(6), 799–811. doi:10.2307/2090408
Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. *Educational and Psychological Measurement, 36*(2), 297–300. doi:10.1177/001316447603600207
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
>>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
-0.018712759581144402
>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
>>> ordinal = [4, 3, 1, 6, 5, 7, 2]
>>> r_rank_biserial_is(binary, ordinal, version='glass')
0.6666666666666667
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(ordField) is list:
ordField = pd.Series(ordField)
#combine as one dataframe
df = pd.concat([catField, ordField], axis=1)
df = df.dropna()
df.columns=['cat', 'score']
#replace the ordinal values if levels is provided
if levels is not None:
df['score'] = df['score'].map(levels).astype('Int8')
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
else:
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
n1 = len(scoresCat1)
n2 = len(scoresCat2)
n = n1 + n2
#combine this into one long list
allScores = scoresCat1 + scoresCat2
#get the ranks
allRanks = rankdata(allScores)
#get the ranks per category
cat1Ranks = allRanks[0:n1]
cat2Ranks = allRanks[n1:n]
r1 = sum(cat1Ranks)
r2 = sum(cat2Ranks)
r1Avg = r1/n1
r2Avg = r2/n2
if version=='glass':
rb = 2*(r1Avg - r2Avg)/n
elif version=='cureton':
#bracket ties
b = 0
for i in set(cat1Ranks):
b += sum(cat2Ranks == i) * sum(cat1Ranks == i)
# rb using Cureton
rb = (r1Avg - (n + 1) / 2) / (n2 / 2 - (b / 2) / n1)
return rb
Functions
def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version='cureton')-
Rank Biserial Correlation
This function will calculate Rank biserial correlation coefficient (independent-samples).
Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022).
The function is shown in this YouTube video and the coefficient is also described at PeterStatistics.com
Parameters
catField:pandas series- data with categories for the rows
ordField:pandas series- data with the scores (ordinal field)
categories:listordictionary, optional- the two categories to use from catField. If not set the first two found will be used
levels:listordictionary, optional- the scores in order
version:{"cureton", "glass"}, optional- the method to use to calculate rank-biserial correlation.
Returns
rb:(Glass) Rank Biserial Correlation / Cliff Delta value
Notes
If version='cureton', the formula from Cureton (1968, p. 68) is used: r_{rb} = \frac{\bar{R}_1 - \left(n + 1\right)/2}{n_2/2 - B/n_1}
If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used: r_b = \frac{2\times\left(\bar{R}_1 - \bar{R}_2\right)}{n}
With: B = \frac{\sum_{i=1}^c t_{i,1} \times t_{i,2}}{2} \bar{R}_i=\frac{R_i}{n_i}
Symbols used:
- \bar{R}_i the average of ranks in category i
- R_i the sum of ranks in category i
- n the total sample size
- n_i the number of scores in category i
- t_{i,j}, the i-th number of tied scores in j
If one category has two scores of 3 and the other has three scores of 3, then t_{1,1} = 2, t_{1,2} = 3, if the first category has also one score of 4 and the second has two scores of 4, then t_{2,1} = 1, t_{2,2} = 2, etc.
Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495).
Before, After and Alternatives
Before determining this effect size measure, you might want to run a test:
- ts_mann_whitney for the Mann-Whitney U test
- ts_fligner_policello for the Fligner-Policello test
- ts_brunner_munzel for the Brunner-Munzel test
- ts_brunner_munzel_perm for the Brunner-Munzel Permutation test
- ts_c_square for the C^2 test
After obtaining the coefficient you might want a rule-of-thumb:
- th_rank_biserial for rules-of-thumb for the rank-biserial correlation
Alternative effect size measures could be:
- es_common_language_is for Common Language Effect Size
- me_hodges_lehmann_is for Hodges-Lehmann
References
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. doi:10.1037/0033-2909.114.3.494
Cureton, E. E. (1956). Rank-biserial correlation. Psychometrika, 21(3), 287–290. doi:10.1007/BF02289138
Cureton, E. E. (1968). Rank-biserial correlation when ties are present. Educational and Psychological Measurement, 28(1), 77–79. doi:10.1177/001316446802800107
Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. Journal of Educational Measurement, 2(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x
Glass, G. V. (1966). Note on rank biserial correlation. Educational and Psychological Measurement, 26(3), 623–631. doi:10.1177/001316446602600307
Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. Open Journal of Statistics, 12(5), 597–622. doi:10.4236/ojs.2022.125036
Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. American Sociological Review, 27(6), 799–811. doi:10.2307/2090408
Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. Educational and Psychological Measurement, 36(2), 297–300. doi:10.1177/001316447603600207
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4} >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels) -0.018712759581144402>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"] >>> ordinal = [4, 3, 1, 6, 5, 7, 2] >>> r_rank_biserial_is(binary, ordinal, version='glass') 0.6666666666666667Expand source code
def r_rank_biserial_is(catField, ordField, categories=None, levels=None, version="cureton"): ''' Rank Biserial Correlation ------------------------- This function will calculate Rank biserial correlation coefficient (independent-samples). Cureton (1956) was perhaps the first to mention this term and provided a formula. His formula actually yields the same result as Goodman-Kruskal gamma (Goodman & Kruskal, 1954). Glass (1965; 1966) also developed a formula, but only for cases when there are no ties between the two categories. His formula will yield the same result as Somers'd (1962) and Cliff delta (1993). Cureton (1968) responded to Glass and gave his formula in an alternative form. Willson (1976) showed the link with Cureton formula and the Mann-Whitney U statistic. For more details on this see the article from Rubia (2022). The function is shown in this [YouTube video](https://youtu.be/pq7Fv0yc9uU) and the coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/RankBiserialCorrelation.html) Parameters ---------- catField : pandas series data with categories for the rows ordField : pandas series data with the scores (ordinal field) categories : list or dictionary, optional the two categories to use from catField. If not set the first two found will be used levels : list or dictionary, optional the scores in order version : {"cureton", "glass"}, optional the method to use to calculate rank-biserial correlation. Returns ------- rb : (Glass) Rank Biserial Correlation / Cliff Delta value Notes ----- If version='cureton', the formula from Cureton (1968, p. 68) is used: $$r_{rb} = \\frac{\\bar{R}_1 - \\left(n + 1\\right)/2}{n_2/2 - B/n_1}$$ If version='glass', the formula from Glass (1965, p. 91; 1966, p. 626) is used: $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$ With: $$B = \\frac{\\sum_{i=1}^c t_{i,1} \\times t_{i,2}}{2}$$ $$\\bar{R}_i=\\frac{R_i}{n_i}$$ *Symbols used:* * \\(\\bar{R}_i\\) the average of ranks in category i * \\(R_i\\) the sum of ranks in category i * \\(n\\) the total sample size * \\(n_i\\) the number of scores in category i * \\(t_{i,j}\\), the i-th number of tied scores in j If one category has two scores of 3 and the other has three scores of 3, then \\(t_{1,1} = 2, t_{1,2} = 3\\), if the first category has also one score of 4 and the second has two scores of 4, then \\(t_{2,1} = 1, t_{2,2} = 2\\), etc. Cureton's version is the same as Goodman-Kruskal gamma, while Glass's version is the same as Somers' d (1962, p. 804) and Cliff Delta (1993, p. 495). Before, After and Alternatives ------------------------------ Before determining this effect size measure, you might want to run a test: * [ts_mann_whitney](../tests/test_mann_whitney.html#ts_mann_whitney) for the Mann-Whitney U test * [ts_fligner_policello](../tests/test_fligner_policello.html#ts_fligner_policello) for the Fligner-Policello test * [ts_brunner_munzel](../tests/test_brunner_munzel.html#ts_brunner_munzel) for the Brunner-Munzel test * [ts_brunner_munzel_perm](../tests/test_brunner_munzel.html#ts_brunner_munzel_perm) for the Brunner-Munzel Permutation test * [ts_c_square](../tests/test_c_square.html#ts_c_square) for the \\(C^2\\) test After obtaining the coefficient you might want a rule-of-thumb: * [th_rank_biserial](../other/thumb_rank_biserial.html#th_rank_biserial) for rules-of-thumb for the rank-biserial correlation Alternative effect size measures could be: * [es_common_language_is](../effect_sizes/eff_size_common_language_is.html#es_common_language_is) for Common Language Effect Size * [me_hodges_lehmann_is](../measures/meas_hodges_lehmann_is.html#me_hodges_lehmann_is) for Hodges-Lehmann References ---------- Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. doi:10.1037/0033-2909.114.3.494 Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. doi:10.1007/BF02289138 Cureton, E. E. (1968). Rank-biserial correlation when ties are present. *Educational and Psychological Measurement, 28*(1), 77–79. doi:10.1177/001316446802800107 Glass, G. V. (1965). A ranking variable analogue of biserial correlation: Implications for short-cut item analysis. *Journal of Educational Measurement, 2*(1), 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. doi:10.1177/001316446602600307 Rubia, J. M. de la. (2022). Note on rank-biserial correlation when there are ties. *Open Journal of Statistics, 12*(5), 597–622. doi:10.4236/ojs.2022.125036 Somers, R. H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review, 27*(6), 799–811. doi:10.2307/2090408 Willson, V. L. (1976). Critical values of the rank-biserial correlation coefficient. *Educational and Psychological Measurement, 36*(2), 297–300. doi:10.1177/001316447603600207 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4} >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels) -0.018712759581144402 >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"] >>> ordinal = [4, 3, 1, 6, 5, 7, 2] >>> r_rank_biserial_is(binary, ordinal, version='glass') 0.6666666666666667 ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(ordField) is list: ordField = pd.Series(ordField) #combine as one dataframe df = pd.concat([catField, ordField], axis=1) df = df.dropna() df.columns=['cat', 'score'] #replace the ordinal values if levels is provided if levels is not None: df['score'] = df['score'].map(levels).astype('Int8') df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) else: df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) n1 = len(scoresCat1) n2 = len(scoresCat2) n = n1 + n2 #combine this into one long list allScores = scoresCat1 + scoresCat2 #get the ranks allRanks = rankdata(allScores) #get the ranks per category cat1Ranks = allRanks[0:n1] cat2Ranks = allRanks[n1:n] r1 = sum(cat1Ranks) r2 = sum(cat2Ranks) r1Avg = r1/n1 r2Avg = r2/n2 if version=='glass': rb = 2*(r1Avg - r2Avg)/n elif version=='cureton': #bracket ties b = 0 for i in set(cat1Ranks): b += sum(cat2Ranks == i) * sum(cat1Ranks == i) # rb using Cureton rb = (r1Avg - (n + 1) / 2) / (n2 / 2 - (b / 2) / n1) return rb