Module stikpetP.correlations.cor_biserial
Expand source code
import pandas as pd
from statistics import NormalDist
def r_biserial(catField, scaleField, categories=None):
'''
Biserial Correlation Coefficient
--------------------------------
This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.
The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/Biserial.html)
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
Returns
-------
Pandas dataframe with:
* *cat. 0*, the category that was used as category 0
* *cat. 1*, the category that was used as category 1
* *n1/n*, the proportion of scores in the category 1
* *mean 0*, the arithmetic mean of the scores from category 0
* *mean 0*, the arithmetic mean of the scores from category 1
* *r_b*, the biserial correlation coefficient
Notes
-----
The formula used is (Tate, 1955a, p. 1087):
$$r_b = \\frac{p \\times q \\times \\left(\\bar{x}_2 - \\bar{x}_1\\right)}{\\sigma_x \\times p_{z_p}}$$
With:
$$p = \\frac{n_1}{n}, q = \\frac{n_0}{n}$$
$$\\bar{x}_0 = \\frac{\\sum_{i=1}^{n_0} x_{i,0}}{n_0}$$
$$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
$$\\sigma = \\sqrt{\\frac{SS}{n}}$$
$$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$
$$z_p = \\Phi^{-1}\\left(p\\right)$$
$$p_{z_p} = \\phi\\left(z_p\\right)$$
Symbols used:
* \\(n_0\\), the sample size of the first category
* \\(n_1\\), the sample size of the second category
* \\(n\\), the total sample size, i.e. \\(n = n_0 + n_1\\)
* \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)
The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is:
$$r_b = \\frac{\\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x}}{\\frac{p_{z_p}}{p}}$$
Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has:
$$r_b = \\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x} \\times \\frac{p}{p_{z_p}}$$
If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores:
$$r_b = \\frac{\\sigma_{bx}}{\\sigma_x \\times p_{z_p}}$$
Not too surprising, since it can be shown that \\(\\sigma_{bx} = p \\times q \\times \\left(\\bar{x}_1 - \\bar{x}_0\\right)\\)
Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial:
$$r_b = r_{pb} \\times \\frac{\\sigma_b}{p_{z_p}}$$
Note that all of these should give the same result.
Before, After and Alternatives
------------------------------
Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.
To get some rule-of-thumb use [th_biserial()](../other/thumb_biserial.html).
Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
References
----------
ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm
Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. *Research Synthesis Methods, 8*(2), 161–180. https://doi.org/10.1002/jrsm.1218
Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. *Biometrika, 7*(1/2), 96–105. https://doi.org/10.2307/2345365
Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
Tate, R. F. (1955a). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207
Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. *Biometrika, 42*(1/2), 205–216. https://doi.org/10.2307/2333437
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation
>>> import pandas as pd
>>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
cat. 0 cat. 1 n1/n mean 0 mean 1 r_b
0 Male Female 0.268293 59.766667 53.727273 -0.170971
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat0 = categories[0]
cat1 = categories[1]
else:
cat0 = df.iloc[:,0].value_counts().index[0]
cat1 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x0 = df.iloc[:,1][df.iloc[:,0] == cat0]
x1 = df.iloc[:,1][df.iloc[:,0] == cat1]
A = pd.concat([x0, x1])
# sample sizes
n0 = len(x0)
n1 = len(x1)
n = len(A)
# sample proportions
p = n1/n
q = n0/n
# means and overall population standard deviation
m0 = x0.mean()
m1 = x1.mean()
s = A.std(ddof=0)
# the normal distribution part
z_p = NormalDist().inv_cdf(p)
p_zp = NormalDist().pdf(z_p)
# biserial correlation
r_b = r_b = p*q*(m1 - m0)/(s * p_zp)
#the results
colnames = ["cat. 0", "cat. 1", 'n1/n', 'mean 0', 'mean 1', 'r_b']
results = pd.DataFrame([[cat0, cat1, p, m0, m1, r_b]], columns=colnames)
return results
Functions
def r_biserial(catField, scaleField, categories=None)
-
Biserial Correlation Coefficient
This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.
The coefficient is also described at PeterStatistics.com
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
Returns
Pandas dataframe with:
- cat. 0, the category that was used as category 0
- cat. 1, the category that was used as category 1
- n1/n, the proportion of scores in the category 1
- mean 0, the arithmetic mean of the scores from category 0
- mean 0, the arithmetic mean of the scores from category 1
- r_b, the biserial correlation coefficient
Notes
The formula used is (Tate, 1955a, p. 1087): r_b = \frac{p \times q \times \left(\bar{x}_2 - \bar{x}_1\right)}{\sigma_x \times p_{z_p}}
With: p = \frac{n_1}{n}, q = \frac{n_0}{n} \bar{x}_0 = \frac{\sum_{i=1}^{n_0} x_{i,0}}{n_0} \bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1} \sigma = \sqrt{\frac{SS}{n}} SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2 z_p = \Phi^{-1}\left(p\right) p_{z_p} = \phi\left(z_p\right)
Symbols used:
- n_0, the sample size of the first category
- n_1, the sample size of the second category
- n, the total sample size, i.e. n = n_0 + n_1
- x_{i,j} is the i-th score in category j
The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is: r_b = \frac{\frac{\bar{x}_1 - \bar{x}}{\sigma_x}}{\frac{p_{z_p}}{p}}
Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has: r_b = \frac{\bar{x}_1 - \bar{x}}{\sigma_x} \times \frac{p}{p_{z_p}}
If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores: r_b = \frac{\sigma_{bx}}{\sigma_x \times p_{z_p}}
Not too surprising, since it can be shown that \sigma_{bx} = p \times q \times \left(\bar{x}_1 - \bar{x}_0\right)
Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial: r_b = r_{pb} \times \frac{\sigma_b}{p_{z_p}}
Note that all of these should give the same result.
Before, After and Alternatives
Before the effect size you might want to run a test. Various options include ts_student_t_os for One-Sample Student t-Test, ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or ts_z_os for One-Sample Z Test.
To get some rule-of-thumb use th_biserial().
Alternative effect sizes include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta
or the correlation coefficients: biserial, point-biserial
References
ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm
Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. Research Synthesis Methods, 8(2), 161–180. https://doi.org/10.1002/jrsm.1218
Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. Biometrika, 7(1/2), 96–105. https://doi.org/10.2307/2345365
Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. Biometrika, 10(2/3), 384–390. https://doi.org/10.2307/2331789
Tate, R. F. (1955a). Applications of correlation models for biserial data. Journal of the American Statistical Association, 50(272), 1078–1095. https://doi.org/10.2307/2281207
Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. Biometrika, 42(1/2), 205–216. https://doi.org/10.2307/2333437
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation >>> import pandas as pd >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade']) cat. 0 cat. 1 n1/n mean 0 mean 1 r_b 0 Male Female 0.268293 59.766667 53.727273 -0.170971
Expand source code
def r_biserial(catField, scaleField, categories=None): ''' Biserial Correlation Coefficient -------------------------------- This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa). With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0. There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source. The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/Biserial.html) Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. Returns ------- Pandas dataframe with: * *cat. 0*, the category that was used as category 0 * *cat. 1*, the category that was used as category 1 * *n1/n*, the proportion of scores in the category 1 * *mean 0*, the arithmetic mean of the scores from category 0 * *mean 0*, the arithmetic mean of the scores from category 1 * *r_b*, the biserial correlation coefficient Notes ----- The formula used is (Tate, 1955a, p. 1087): $$r_b = \\frac{p \\times q \\times \\left(\\bar{x}_2 - \\bar{x}_1\\right)}{\\sigma_x \\times p_{z_p}}$$ With: $$p = \\frac{n_1}{n}, q = \\frac{n_0}{n}$$ $$\\bar{x}_0 = \\frac{\\sum_{i=1}^{n_0} x_{i,0}}{n_0}$$ $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\\sigma = \\sqrt{\\frac{SS}{n}}$$ $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$ $$z_p = \\Phi^{-1}\\left(p\\right)$$ $$p_{z_p} = \\phi\\left(z_p\\right)$$ Symbols used: * \\(n_0\\), the sample size of the first category * \\(n_1\\), the sample size of the second category * \\(n\\), the total sample size, i.e. \\(n = n_0 + n_1\\) * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\) The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is: $$r_b = \\frac{\\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x}}{\\frac{p_{z_p}}{p}}$$ Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has: $$r_b = \\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x} \\times \\frac{p}{p_{z_p}}$$ If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores: $$r_b = \\frac{\\sigma_{bx}}{\\sigma_x \\times p_{z_p}}$$ Not too surprising, since it can be shown that \\(\\sigma_{bx} = p \\times q \\times \\left(\\bar{x}_1 - \\bar{x}_0\\right)\\) Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial: $$r_b = r_{pb} \\times \\frac{\\sigma_b}{p_{z_p}}$$ Note that all of these should give the same result. Before, After and Alternatives ------------------------------ Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test. To get some rule-of-thumb use [th_biserial()](../other/thumb_biserial.html). Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html) or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html) References ---------- ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. *Research Synthesis Methods, 8*(2), 161–180. https://doi.org/10.1002/jrsm.1218 Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. *Biometrika, 7*(1/2), 96–105. https://doi.org/10.2307/2345365 Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789 Tate, R. F. (1955a). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207 Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. *Biometrika, 42*(1/2), 205–216. https://doi.org/10.2307/2333437 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation >>> import pandas as pd >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade']) cat. 0 cat. 1 n1/n mean 0 mean 1 r_b 0 Male Female 0.268293 59.766667 53.727273 -0.170971 ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat0 = categories[0] cat1 = categories[1] else: cat0 = df.iloc[:,0].value_counts().index[0] cat1 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x0 = df.iloc[:,1][df.iloc[:,0] == cat0] x1 = df.iloc[:,1][df.iloc[:,0] == cat1] A = pd.concat([x0, x1]) # sample sizes n0 = len(x0) n1 = len(x1) n = len(A) # sample proportions p = n1/n q = n0/n # means and overall population standard deviation m0 = x0.mean() m1 = x1.mean() s = A.std(ddof=0) # the normal distribution part z_p = NormalDist().inv_cdf(p) p_zp = NormalDist().pdf(z_p) # biserial correlation r_b = r_b = p*q*(m1 - m0)/(s * p_zp) #the results colnames = ["cat. 0", "cat. 1", 'n1/n', 'mean 0', 'mean 1', 'r_b'] results = pd.DataFrame([[cat0, cat1, p, m0, m1, r_b]], columns=colnames) return results