Module stikpetP.correlations.cor_rank_biserial_is
Expand source code
import pandas as pd
from scipy.stats import rankdata
def r_rank_biserial_is(catField, ordField, categories=None, levels=None):
'''
(Glass) Rank Biserial Correlation / Cliff Delta
-----------------------------------------------
This function will calculate Rank biserial correlation coefficient (independent-samples)
Parameters
----------
catField : pandas series
data with categories for the rows
ordField : pandas series
data with the scores (ordinal field)
categories : list or dictionary, optional
the two categories to use from catField. If not set the first two found will be used
levels : list or dictionary, optional
the scores in order
Returns
-------
rb : (Glass) Rank Biserial Correlation / Cliff Delta value
Notes
-----
The formula used is (Glass, 1966, p. 626):
$$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$
With:
$$\\bar{R}_i=\\frac{R_i}{n_i}$$
*Symbols used:*
* \\(\\bar{R}_i\\) the average of ranks in category i
* \\(R_i\\) the sum of ranks in category i
* \\(n\\) the total sample size
* \\(n_i\\) the number of scores in category i
Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same.
The rank biserial can be converted to a Cohen d (using the **es_convert()** function), and then the rules-of-thumb for Cohen d could be used (**th_cohen_d()**)
See Also
--------
stikpetP.effect_sizes.convert_es.es_convert : to convert to Cohen d, use `fr="rb", to="cohend"`.
stikpetP.other.thumb_cohen_d.th_cohen_d : rules of thumb for Cohen d
References
----------
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494
Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. https://doi.org/10.1007/BF02289138
Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. https://doi.org/10.1177/001316446602600307
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4}
>>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels)
-0.018712759581144402
>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"]
>>> ordinal = [4, 3, 1, 6, 5, 7, 2]
>>> r_rank_biserial_is(binary, ordinal)
0.6666666666666667
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(ordField) is list:
ordField = pd.Series(ordField)
#combine as one dataframe
df = pd.concat([catField, ordField], axis=1)
df = df.dropna()
#replace the ordinal values if levels is provided
if levels is not None:
pd.set_option('future.no_silent_downcasting', True)
df.iloc[:,1] = df.iloc[:,1].replace(levels)
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
else:
df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] )
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
n1 = len(scoresCat1)
n2 = len(scoresCat2)
n = n1 + n2
#combine this into one long list
allScores = scoresCat1 + scoresCat2
#get the ranks
allRanks = rankdata(allScores)
#get the ranks per category
cat1Ranks = allRanks[0:n1]
cat2Ranks = allRanks[n1:n]
r1 = sum(cat1Ranks)
r2 = sum(cat2Ranks)
r1Avg = r1/n1
r2Avg = r2/n2
rb = 2*(r1Avg - r2Avg)/n
return rb
Functions
def r_rank_biserial_is(catField, ordField, categories=None, levels=None)
-
(Glass) Rank Biserial Correlation / Cliff Delta
This function will calculate Rank biserial correlation coefficient (independent-samples)
Parameters
catField
:pandas series
- data with categories for the rows
ordField
:pandas series
- data with the scores (ordinal field)
categories
:list
ordictionary
, optional- the two categories to use from catField. If not set the first two found will be used
levels
:list
ordictionary
, optional- the scores in order
Returns
rb
:(Glass) Rank Biserial Correlation / Cliff Delta value
Notes
The formula used is (Glass, 1966, p. 626): r_b = \frac{2\times\left(\bar{R}_1 - \bar{R}_2\right)}{n}
With: \bar{R}_i=\frac{R_i}{n_i}
Symbols used:
- \bar{R}_i the average of ranks in category i
- R_i the sum of ranks in category i
- n the total sample size
- n_i the number of scores in category i
Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same.
The rank biserial can be converted to a Cohen d (using the es_convert() function), and then the rules-of-thumb for Cohen d could be used (th_cohen_d())
See Also
es_convert()
- to convert to Cohen d, use
fr="rb", to="cohend"
.
stikpetP.other.thumb_cohen_d.th_cohen_d : rules of thumb for Cohen d
References
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494
Cureton, E. E. (1956). Rank-biserial correlation. Psychometrika, 21(3), 287–290. https://doi.org/10.1007/BF02289138
Glass, G. V. (1966). Note on rank biserial correlation. Educational and Psychological Measurement, 26(3), 623–631. https://doi.org/10.1177/001316446602600307
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4} >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels) -0.018712759581144402
>>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"] >>> ordinal = [4, 3, 1, 6, 5, 7, 2] >>> r_rank_biserial_is(binary, ordinal) 0.6666666666666667
Expand source code
def r_rank_biserial_is(catField, ordField, categories=None, levels=None): ''' (Glass) Rank Biserial Correlation / Cliff Delta ----------------------------------------------- This function will calculate Rank biserial correlation coefficient (independent-samples) Parameters ---------- catField : pandas series data with categories for the rows ordField : pandas series data with the scores (ordinal field) categories : list or dictionary, optional the two categories to use from catField. If not set the first two found will be used levels : list or dictionary, optional the scores in order Returns ------- rb : (Glass) Rank Biserial Correlation / Cliff Delta value Notes ----- The formula used is (Glass, 1966, p. 626): $$r_b = \\frac{2\\times\\left(\\bar{R}_1 - \\bar{R}_2\\right)}{n}$$ With: $$\\bar{R}_i=\\frac{R_i}{n_i}$$ *Symbols used:* * \\(\\bar{R}_i\\) the average of ranks in category i * \\(R_i\\) the sum of ranks in category i * \\(n\\) the total sample size * \\(n_i\\) the number of scores in category i Glass (1966) showed that the formula was the same as that of the rank biserial from Cureton (1956). Cliff's delta (Cliff, 1993, p. 495) is actually also the same. The rank biserial can be converted to a Cohen d (using the **es_convert()** function), and then the rules-of-thumb for Cohen d could be used (**th_cohen_d()**) See Also -------- stikpetP.effect_sizes.convert_es.es_convert : to convert to Cohen d, use `fr="rb", to="cohend"`. stikpetP.other.thumb_cohen_d.th_cohen_d : rules of thumb for Cohen d References ---------- Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. *Psychological Bulletin, 114*(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494 Cureton, E. E. (1956). Rank-biserial correlation. *Psychometrika, 21*(3), 287–290. https://doi.org/10.1007/BF02289138 Glass, G. V. (1966). Note on rank biserial correlation. *Educational and Psychological Measurement, 26*(3), 623–631. https://doi.org/10.1177/001316446602600307 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> myLevels = {'Not scientific at all': 1, 'Not too scientific': 2, 'Pretty scientific': 3, 'Very scientific': 4} >>> r_rank_biserial_is(df1['sex'], df1['accntsci'], levels=myLevels) -0.018712759581144402 >>> binary = ["apple", "apple", "apple", "peer", "peer", "peer", "peer"] >>> ordinal = [4, 3, 1, 6, 5, 7, 2] >>> r_rank_biserial_is(binary, ordinal) 0.6666666666666667 ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(ordField) is list: ordField = pd.Series(ordField) #combine as one dataframe df = pd.concat([catField, ordField], axis=1) df = df.dropna() #replace the ordinal values if levels is provided if levels is not None: pd.set_option('future.no_silent_downcasting', True) df.iloc[:,1] = df.iloc[:,1].replace(levels) df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) else: df.iloc[:,1] = pd.to_numeric(df.iloc[:,1] ) #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category scoresCat1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) scoresCat2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) n1 = len(scoresCat1) n2 = len(scoresCat2) n = n1 + n2 #combine this into one long list allScores = scoresCat1 + scoresCat2 #get the ranks allRanks = rankdata(allScores) #get the ranks per category cat1Ranks = allRanks[0:n1] cat2Ranks = allRanks[n1:n] r1 = sum(cat1Ranks) r2 = sum(cat2Ranks) r1Avg = r1/n1 r2Avg = r2/n2 rb = 2*(r1Avg - r2Avg)/n return rb