Module `stikpetP.correlations.cor_biserial`

Expand source code

import pandas as pd
from statistics import NormalDist

def r_biserial(catField, scaleField, categories=None):
    '''
    Biserial Correlation Coefficient
    --------------------------------
    This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
    
    As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
    
    With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
    
    There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.

    The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/Biserial.html)
 
    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
        
    Returns
    -------
    Pandas dataframe with:

    * *cat. 0*, the category that was used as category 0
    * *cat. 1*, the category that was used as category 1
    * *n1/n*, the proportion of scores in the category 1
    * *mean 0*, the arithmetic mean of the scores from category 0
    * *mean 0*, the arithmetic mean of the scores from category 1
    * *r_b*, the biserial correlation coefficient

    Notes
    -----
    The formula used is (Tate, 1955a, p. 1087):
    $$r_b = \\frac{p \\times q \\times \\left(\\bar{x}_2 - \\bar{x}_1\\right)}{\\sigma_x \\times p_{z_p}}$$

    With:
    $$p = \\frac{n_1}{n}, q = \\frac{n_0}{n}$$
    $$\\bar{x}_0 = \\frac{\\sum_{i=1}^{n_0} x_{i,0}}{n_0}$$
    $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
    $$\\sigma = \\sqrt{\\frac{SS}{n}}$$
    $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$
    $$z_p = \\Phi^{-1}\\left(p\\right)$$
    $$p_{z_p} = \\phi\\left(z_p\\right)$$

    Symbols used:

    * \\(n_0\\), the sample size of the first category
    * \\(n_1\\), the sample size of the second category
    * \\(n\\), the total sample size, i.e. \\(n = n_0 + n_1\\)
    * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)

    The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is:
    $$r_b = \\frac{\\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x}}{\\frac{p_{z_p}}{p}}$$

    Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has:
    $$r_b = \\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x} \\times \\frac{p}{p_{z_p}}$$

    If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores:
    $$r_b = \\frac{\\sigma_{bx}}{\\sigma_x \\times p_{z_p}}$$

    Not too surprising, since it can be shown that \\(\\sigma_{bx} = p \\times q \\times \\left(\\bar{x}_1 - \\bar{x}_0\\right)\\)

    Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial:
    $$r_b = r_{pb} \\times \\frac{\\sigma_b}{p_{z_p}}$$

    Note that all of these should give the same result.

    Before, After and Alternatives
    ------------------------------
    Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.

    To get some rule-of-thumb use [th_biserial()](../other/thumb_biserial.html).
    
    Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
    
    or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
    
    References
    ----------
    ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm

    Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. *Research Synthesis Methods, 8*(2), 161–180. https://doi.org/10.1002/jrsm.1218
    
    Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. *Biometrika, 7*(1/2), 96–105. https://doi.org/10.2307/2345365
    
    Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
    
    Tate, R. F. (1955a). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207
    
    Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. *Biometrika, 42*(1/2), 205–216. https://doi.org/10.2307/2333437

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation
    >>> import pandas as pd
    >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
      cat. 0  cat. 1      n1/n     mean 0     mean 1       r_b
    0   Male  Female  0.268293  59.766667  53.727273 -0.170971
    
    '''
    
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat0 = categories[0]
        cat1 = categories[1]
    else:
        cat0 = df.iloc[:,0].value_counts().index[0]
        cat1 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x0 = df.iloc[:,1][df.iloc[:,0] == cat0]
    x1 = df.iloc[:,1][df.iloc[:,0] == cat1]    
    A = pd.concat([x0, x1])

    # sample sizes
    n0 = len(x0)
    n1 = len(x1)
    n = len(A)

    # sample proportions
    p = n1/n
    q = n0/n

    # means and overall population standard deviation
    m0 = x0.mean()
    m1 = x1.mean()
    s = A.std(ddof=0)

    # the normal distribution part
    z_p = NormalDist().inv_cdf(p)
    p_zp = NormalDist().pdf(z_p)
    
    # biserial correlation
    r_b = r_b = p*q*(m1 - m0)/(s * p_zp)

    #the results
    colnames = ["cat. 0", "cat. 1", 'n1/n', 'mean 0', 'mean 1', 'r_b']
    results = pd.DataFrame([[cat0, cat1, p, m0, m1, r_b]], columns=colnames)
    
    return results

Functions

def r_biserial(catField, scaleField, categories=None)

Biserial Correlation Coefficient

This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).

As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).

With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.

There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.

The coefficient is also described at PeterStatistics.com

Parameters

catField : dataframe or list: the categorical data
scaleField : dataframe or list: the scores
categories : list, optional: to indicate which two categories of catField to use, otherwise first two found will be used.

Returns

Pandas dataframe with:

cat. 0, the category that was used as category 0
cat. 1, the category that was used as category 1
n1/n, the proportion of scores in the category 1
mean 0, the arithmetic mean of the scores from category 0
mean 0, the arithmetic mean of the scores from category 1
r_b, the biserial correlation coefficient

Notes

The formula used is (Tate, 1955a, p. 1087): $r_b = \frac{p \times q \times \left(\bar{x}_2 - \bar{x}_1\right)}{\sigma_x \times p_{z_p}}$

With: $p = \frac{n_1}{n}, q = \frac{n_0}{n}$ $\bar{x}_0 = \frac{\sum_{i=1}^{n_0} x_{i,0}}{n_0}$ $\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$ $\sigma = \sqrt{\frac{SS}{n}}$ $SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2$ $z_p = \Phi^{-1}\left(p\right)$ $p_{z_p} = \phi\left(z_p\right)$

Symbols used:

$n_0$ , the sample size of the first category
$n_1$ , the sample size of the second category
$n$ , the total sample size, i.e. $n = n_0 + n_1$
$x_{i,j}$ is the $i$ -th score in category $j$

The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is: $r_b = \frac{\frac{\bar{x}_1 - \bar{x}}{\sigma_x}}{\frac{p_{z_p}}{p}}$

Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has: $r_b = \frac{\bar{x}_1 - \bar{x}}{\sigma_x} \times \frac{p}{p_{z_p}}$

If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores: $r_b = \frac{\sigma_{bx}}{\sigma_x \times p_{z_p}}$

Not too surprising, since it can be shown that $\sigma_{bx} = p \times q \times \left(\bar{x}_1 - \bar{x}_0\right)$

Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial: $r_b = r_{pb} \times \frac{\sigma_b}{p_{z_p}}$

Note that all of these should give the same result.

Before, After and Alternatives

Before the effect size you might want to run a test. Various options include ts_student_t_os for One-Sample Student t-Test, ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or ts_z_os for One-Sample Z Test.

To get some rule-of-thumb use th_biserial().

Alternative effect sizes include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta

or the correlation coefficients: biserial, point-biserial

References

ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm

Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. Research Synthesis Methods, 8(2), 161–180. https://doi.org/10.1002/jrsm.1218

Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. Biometrika, 7(1/2), 96–105. https://doi.org/10.2307/2345365

Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. Biometrika, 10(2/3), 384–390. https://doi.org/10.2307/2331789

Tate, R. F. (1955a). Applications of correlation models for biserial data. Journal of the American Statistical Association, 50(272), 1078–1095. https://doi.org/10.2307/2281207

Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. Biometrika, 42(1/2), 205–216. https://doi.org/10.2307/2333437

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation
>>> import pandas as pd
>>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
  cat. 0  cat. 1      n1/n     mean 0     mean 1       r_b
0   Male  Female  0.268293  59.766667  53.727273 -0.170971

Expand source code

def r_biserial(catField, scaleField, categories=None):
    '''
    Biserial Correlation Coefficient
    --------------------------------
    This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
    
    As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
    
    With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
    
    There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.

    The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/Biserial.html)
 
    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
        
    Returns
    -------
    Pandas dataframe with:

    * *cat. 0*, the category that was used as category 0
    * *cat. 1*, the category that was used as category 1
    * *n1/n*, the proportion of scores in the category 1
    * *mean 0*, the arithmetic mean of the scores from category 0
    * *mean 0*, the arithmetic mean of the scores from category 1
    * *r_b*, the biserial correlation coefficient

    Notes
    -----
    The formula used is (Tate, 1955a, p. 1087):
    $$r_b = \\frac{p \\times q \\times \\left(\\bar{x}_2 - \\bar{x}_1\\right)}{\\sigma_x \\times p_{z_p}}$$

    With:
    $$p = \\frac{n_1}{n}, q = \\frac{n_0}{n}$$
    $$\\bar{x}_0 = \\frac{\\sum_{i=1}^{n_0} x_{i,0}}{n_0}$$
    $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
    $$\\sigma = \\sqrt{\\frac{SS}{n}}$$
    $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$
    $$z_p = \\Phi^{-1}\\left(p\\right)$$
    $$p_{z_p} = \\phi\\left(z_p\\right)$$

    Symbols used:

    * \\(n_0\\), the sample size of the first category
    * \\(n_1\\), the sample size of the second category
    * \\(n\\), the total sample size, i.e. \\(n = n_0 + n_1\\)
    * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)

    The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is:
    $$r_b = \\frac{\\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x}}{\\frac{p_{z_p}}{p}}$$

    Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has:
    $$r_b = \\frac{\\bar{x}_1 - \\bar{x}}{\\sigma_x} \\times \\frac{p}{p_{z_p}}$$

    If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores:
    $$r_b = \\frac{\\sigma_{bx}}{\\sigma_x \\times p_{z_p}}$$

    Not too surprising, since it can be shown that \\(\\sigma_{bx} = p \\times q \\times \\left(\\bar{x}_1 - \\bar{x}_0\\right)\\)

    Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial:
    $$r_b = r_{pb} \\times \\frac{\\sigma_b}{p_{z_p}}$$

    Note that all of these should give the same result.

    Before, After and Alternatives
    ------------------------------
    Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.

    To get some rule-of-thumb use [th_biserial()](../other/thumb_biserial.html).
    
    Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
    
    or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
    
    References
    ----------
    ChangingMinds. (n.d.). Biserial Correlation Coefficient. Retrieved July 18, 2025, from https://changingminds.org/explanations/research/analysis/biserial.htm

    Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta‐analysis. *Research Synthesis Methods, 8*(2), 161–180. https://doi.org/10.1002/jrsm.1218
    
    Pearson, K. (1909). On a new method of determining correlation between a measured character A, and a character B. *Biometrika, 7*(1/2), 96–105. https://doi.org/10.2307/2345365
    
    Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
    
    Tate, R. F. (1955a). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207
    
    Tate, R. F. (1955b). The theory of correlation between two continuous variables when one is dichotomized. *Biometrika, 42*(1/2), 205–216. https://doi.org/10.2307/2333437

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> # WARNING: Example is only to show results, this example is actually not suitable for biserial correlation
    >>> import pandas as pd
    >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
      cat. 0  cat. 1      n1/n     mean 0     mean 1       r_b
    0   Male  Female  0.268293  59.766667  53.727273 -0.170971
    
    '''
    
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat0 = categories[0]
        cat1 = categories[1]
    else:
        cat0 = df.iloc[:,0].value_counts().index[0]
        cat1 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x0 = df.iloc[:,1][df.iloc[:,0] == cat0]
    x1 = df.iloc[:,1][df.iloc[:,0] == cat1]    
    A = pd.concat([x0, x1])

    # sample sizes
    n0 = len(x0)
    n1 = len(x1)
    n = len(A)

    # sample proportions
    p = n1/n
    q = n0/n

    # means and overall population standard deviation
    m0 = x0.mean()
    m1 = x1.mean()
    s = A.std(ddof=0)

    # the normal distribution part
    z_p = NormalDist().inv_cdf(p)
    p_zp = NormalDist().pdf(z_p)
    
    # biserial correlation
    r_b = r_b = p*q*(m1 - m0)/(s * p_zp)

    #the results
    colnames = ["cat. 0", "cat. 1", 'n1/n', 'mean 0', 'mean 1', 'r_b']
    results = pd.DataFrame([[cat0, cat1, p, m0, m1, r_b]], columns=colnames)
    
    return results