Module stikpetP.correlations.cor_point_biserial

Expand source code
import pandas as pd

def r_point_biserial(catField, scaleField, categories=None):
    '''
    Point-Biserial Correlation Coefficient
    -------------------------------------
    This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).
    
    As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
    
    With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
    
    Note that if the two categories come from a so-called latent normally distributed variable, the *biserial correlation* might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation.

    The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/PointBiserialCorrelation.html)

    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
        
    Returns
    -------
    Pandas dataframe with:

    * *cat. 1*, the category that was used as category 1
    * *cat. 2*, the category that was used as category 2
    * *mean 1*, the arithmetic mean of the scores from category 1
    * *mean 2*, the arithmetic mean of the scores from category 2
    * *r_pb*, the point-biserial correlation coefficient

    Notes
    -----
    The formula used is (Tate, 1955, p. 1081):
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sigma_x} \\times \\sqrt{p \\times q}$$

    With:
    $$p = \\frac{n_1}{n}, q = \\frac{n_2}{n}$$
    $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
    $$\\bar{x}_2 = \\frac{\\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$
    $$\\sigma = \\sqrt{\\frac{SS}{n}}$$
    $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$

    Symbols used:

    * \\(n_1\\), the sample size of the first category
    * \\(n_2\\), the sample size of the second category
    * \\(n\\), the total sample size, i.e. \\(n = n_1 + n_2\\)
    * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)

    The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is:
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}}{\\sigma_x} \\times \\frac{\\sqrt{p \\times q}}{q}$$

    Tate also gave another formula (Tate, 1954, p. 606):
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sqrt{SS}} \\times \\frac{n_1 \\times n_2}{n}$$

    Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples:
    $$r_{pb} = \\sqrt{\\frac{t^2}{t^2 + df}}$$

    As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.

    Note that all of these should give the same result.

    Before, After and Alternatives
    ------------------------------
    Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.

    To get some rule-of-thumb use [th_point_biserial()](../other/thumb_point_biserial.html).
    
    Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
    
    or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
    
    References
    ----------
    Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. *Psychological Bulletin, 70*(4), 245–251. https://doi.org/10.1037/h0026258
    
    Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
    
    Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. *The Annals of Mathematical Statistics, 25*(3), 603–607. https://doi.org/10.1214/aoms/1177728730
    
    Tate, R. F. (1955). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> import pandas as pd
    >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
      cat. 1  cat. 2     mean 1     mean 2      r_pb
    0   Male  Female  59.766667  53.727273 -0.127183
    
    '''
    
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = df.iloc[:,1][df.iloc[:,0] == cat1]
    x2 = df.iloc[:,1][df.iloc[:,0] == cat2]    
    combined = pd.concat([x1, x2])

    # sample sizes
    n1 = len(x1)
    n2 = len(x2)
    n = len(combined)

    # sample proportions
    p = n1/n
    q = n2/n

    # means and overall population standard deviation
    m1 = x1.mean()
    m2 = x2.mean()
    s = combined.std(ddof=0)

    # point-biserial correlation
    r_pb = (m2 - m1)/s * (p*q)**0.5

    #the results
    colnames = ["cat. 1", "cat. 2", 'mean 1', 'mean 2', 'r_pb']
    results = pd.DataFrame([[cat1, cat2, m1, m2, r_pb]], columns=colnames)
    
    return results

Functions

def r_point_biserial(catField, scaleField, categories=None)

Point-Biserial Correlation Coefficient

This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).

As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).

With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.

Note that if the two categories come from a so-called latent normally distributed variable, the biserial correlation might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation.

The coefficient is also described at PeterStatistics.com

Parameters

catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.

Returns

Pandas dataframe with:
 
  • cat. 1, the category that was used as category 1
  • cat. 2, the category that was used as category 2
  • mean 1, the arithmetic mean of the scores from category 1
  • mean 2, the arithmetic mean of the scores from category 2
  • r_pb, the point-biserial correlation coefficient

Notes

The formula used is (Tate, 1955, p. 1081): r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sigma_x} \times \sqrt{p \times q}

With: p = \frac{n_1}{n}, q = \frac{n_2}{n} \bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1} \bar{x}_2 = \frac{\sum_{i=1}^{n_2} x_{i,2}}{n_2} \sigma = \sqrt{\frac{SS}{n}} SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2

Symbols used:

  • n_1, the sample size of the first category
  • n_2, the sample size of the second category
  • n, the total sample size, i.e. n = n_1 + n_2
  • x_{i,j} is the i-th score in category j

The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is: r_{pb} = \frac{\bar{x}_2 - \bar{x}}{\sigma_x} \times \frac{\sqrt{p \times q}}{q}

Tate also gave another formula (Tate, 1954, p. 606): r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{SS}} \times \frac{n_1 \times n_2}{n}

Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples: r_{pb} = \sqrt{\frac{t^2}{t^2 + df}}

As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.

Note that all of these should give the same result.

Before, After and Alternatives

Before the effect size you might want to run a test. Various options include ts_student_t_os for One-Sample Student t-Test, ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or ts_z_os for One-Sample Z Test.

To get some rule-of-thumb use th_point_biserial().

Alternative effect sizes include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta

or the correlation coefficients: biserial, point-biserial

References

Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin, 70(4), 245–251. https://doi.org/10.1037/h0026258

Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. Biometrika, 10(2/3), 384–390. https://doi.org/10.2307/2331789

Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. The Annals of Mathematical Statistics, 25(3), 603–607. https://doi.org/10.1214/aoms/1177728730

Tate, R. F. (1955). Applications of correlation models for biserial data. Journal of the American Statistical Association, 50(272), 1078–1095. https://doi.org/10.2307/2281207

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> import pandas as pd
>>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
  cat. 1  cat. 2     mean 1     mean 2      r_pb
0   Male  Female  59.766667  53.727273 -0.127183
Expand source code
def r_point_biserial(catField, scaleField, categories=None):
    '''
    Point-Biserial Correlation Coefficient
    -------------------------------------
    This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).
    
    As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
    
    With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
    
    Note that if the two categories come from a so-called latent normally distributed variable, the *biserial correlation* might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation.

    The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/PointBiserialCorrelation.html)

    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
        
    Returns
    -------
    Pandas dataframe with:

    * *cat. 1*, the category that was used as category 1
    * *cat. 2*, the category that was used as category 2
    * *mean 1*, the arithmetic mean of the scores from category 1
    * *mean 2*, the arithmetic mean of the scores from category 2
    * *r_pb*, the point-biserial correlation coefficient

    Notes
    -----
    The formula used is (Tate, 1955, p. 1081):
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sigma_x} \\times \\sqrt{p \\times q}$$

    With:
    $$p = \\frac{n_1}{n}, q = \\frac{n_2}{n}$$
    $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
    $$\\bar{x}_2 = \\frac{\\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$
    $$\\sigma = \\sqrt{\\frac{SS}{n}}$$
    $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$

    Symbols used:

    * \\(n_1\\), the sample size of the first category
    * \\(n_2\\), the sample size of the second category
    * \\(n\\), the total sample size, i.e. \\(n = n_1 + n_2\\)
    * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)

    The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is:
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}}{\\sigma_x} \\times \\frac{\\sqrt{p \\times q}}{q}$$

    Tate also gave another formula (Tate, 1954, p. 606):
    $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sqrt{SS}} \\times \\frac{n_1 \\times n_2}{n}$$

    Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples:
    $$r_{pb} = \\sqrt{\\frac{t^2}{t^2 + df}}$$

    As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.

    Note that all of these should give the same result.

    Before, After and Alternatives
    ------------------------------
    Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.

    To get some rule-of-thumb use [th_point_biserial()](../other/thumb_point_biserial.html).
    
    Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
    
    or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
    
    References
    ----------
    Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. *Psychological Bulletin, 70*(4), 245–251. https://doi.org/10.1037/h0026258
    
    Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
    
    Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. *The Annals of Mathematical Statistics, 25*(3), 603–607. https://doi.org/10.1214/aoms/1177728730
    
    Tate, R. F. (1955). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> import pandas as pd
    >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
      cat. 1  cat. 2     mean 1     mean 2      r_pb
    0   Male  Female  59.766667  53.727273 -0.127183
    
    '''
    
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = df.iloc[:,1][df.iloc[:,0] == cat1]
    x2 = df.iloc[:,1][df.iloc[:,0] == cat2]    
    combined = pd.concat([x1, x2])

    # sample sizes
    n1 = len(x1)
    n2 = len(x2)
    n = len(combined)

    # sample proportions
    p = n1/n
    q = n2/n

    # means and overall population standard deviation
    m1 = x1.mean()
    m2 = x2.mean()
    s = combined.std(ddof=0)

    # point-biserial correlation
    r_pb = (m2 - m1)/s * (p*q)**0.5

    #the results
    colnames = ["cat. 1", "cat. 2", 'mean 1', 'mean 2', 'r_pb']
    results = pd.DataFrame([[cat1, cat2, m1, m2, r_pb]], columns=colnames)
    
    return results