Module stikpetP.correlations.cor_point_biserial
Expand source code
import pandas as pd
def r_point_biserial(catField, scaleField, categories=None):
'''
Point-Biserial Correlation Coefficient
-------------------------------------
This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
Note that if the two categories come from a so-called latent normally distributed variable, the *biserial correlation* might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation.
The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/PointBiserialCorrelation.html)
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
Returns
-------
Pandas dataframe with:
* *cat. 1*, the category that was used as category 1
* *cat. 2*, the category that was used as category 2
* *mean 1*, the arithmetic mean of the scores from category 1
* *mean 2*, the arithmetic mean of the scores from category 2
* *r_pb*, the point-biserial correlation coefficient
Notes
-----
The formula used is (Tate, 1955, p. 1081):
$$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sigma_x} \\times \\sqrt{p \\times q}$$
With:
$$p = \\frac{n_1}{n}, q = \\frac{n_2}{n}$$
$$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$
$$\\bar{x}_2 = \\frac{\\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$
$$\\sigma = \\sqrt{\\frac{SS}{n}}$$
$$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$
Symbols used:
* \\(n_1\\), the sample size of the first category
* \\(n_2\\), the sample size of the second category
* \\(n\\), the total sample size, i.e. \\(n = n_1 + n_2\\)
* \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\)
The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is:
$$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}}{\\sigma_x} \\times \\frac{\\sqrt{p \\times q}}{q}$$
Tate also gave another formula (Tate, 1954, p. 606):
$$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sqrt{SS}} \\times \\frac{n_1 \\times n_2}{n}$$
Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples:
$$r_{pb} = \\sqrt{\\frac{t^2}{t^2 + df}}$$
As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.
Note that all of these should give the same result.
Before, After and Alternatives
------------------------------
Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test.
To get some rule-of-thumb use [th_point_biserial()](../other/thumb_point_biserial.html).
Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html)
or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html)
References
----------
Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. *Psychological Bulletin, 70*(4), 245–251. https://doi.org/10.1037/h0026258
Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789
Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. *The Annals of Mathematical Statistics, 25*(3), 603–607. https://doi.org/10.1214/aoms/1177728730
Tate, R. F. (1955). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> import pandas as pd
>>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade'])
cat. 1 cat. 2 mean 1 mean 2 r_pb
0 Male Female 59.766667 53.727273 -0.127183
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = df.iloc[:,1][df.iloc[:,0] == cat1]
x2 = df.iloc[:,1][df.iloc[:,0] == cat2]
combined = pd.concat([x1, x2])
# sample sizes
n1 = len(x1)
n2 = len(x2)
n = len(combined)
# sample proportions
p = n1/n
q = n2/n
# means and overall population standard deviation
m1 = x1.mean()
m2 = x2.mean()
s = combined.std(ddof=0)
# point-biserial correlation
r_pb = (m2 - m1)/s * (p*q)**0.5
#the results
colnames = ["cat. 1", "cat. 2", 'mean 1', 'mean 2', 'r_pb']
results = pd.DataFrame([[cat1, cat2, m1, m2, r_pb]], columns=colnames)
return results
Functions
def r_point_biserial(catField, scaleField, categories=None)
-
Point-Biserial Correlation Coefficient
This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
Note that if the two categories come from a so-called latent normally distributed variable, the biserial correlation might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation.
The coefficient is also described at PeterStatistics.com
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
Returns
Pandas dataframe with:
- cat. 1, the category that was used as category 1
- cat. 2, the category that was used as category 2
- mean 1, the arithmetic mean of the scores from category 1
- mean 2, the arithmetic mean of the scores from category 2
- r_pb, the point-biserial correlation coefficient
Notes
The formula used is (Tate, 1955, p. 1081): r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sigma_x} \times \sqrt{p \times q}
With: p = \frac{n_1}{n}, q = \frac{n_2}{n} \bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1} \bar{x}_2 = \frac{\sum_{i=1}^{n_2} x_{i,2}}{n_2} \sigma = \sqrt{\frac{SS}{n}} SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2
Symbols used:
- n_1, the sample size of the first category
- n_2, the sample size of the second category
- n, the total sample size, i.e. n = n_1 + n_2
- x_{i,j} is the i-th score in category j
The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is: r_{pb} = \frac{\bar{x}_2 - \bar{x}}{\sigma_x} \times \frac{\sqrt{p \times q}}{q}
Tate also gave another formula (Tate, 1954, p. 606): r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{SS}} \times \frac{n_1 \times n_2}{n}
Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples: r_{pb} = \sqrt{\frac{t^2}{t^2 + df}}
As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.
Note that all of these should give the same result.
Before, After and Alternatives
Before the effect size you might want to run a test. Various options include ts_student_t_os for One-Sample Student t-Test, ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or ts_z_os for One-Sample Z Test.
To get some rule-of-thumb use th_point_biserial().
Alternative effect sizes include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta
or the correlation coefficients: biserial, point-biserial
References
Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin, 70(4), 245–251. https://doi.org/10.1037/h0026258
Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. Biometrika, 10(2/3), 384–390. https://doi.org/10.2307/2331789
Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. The Annals of Mathematical Statistics, 25(3), 603–607. https://doi.org/10.1214/aoms/1177728730
Tate, R. F. (1955). Applications of correlation models for biserial data. Journal of the American Statistical Association, 50(272), 1078–1095. https://doi.org/10.2307/2281207
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> import pandas as pd >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade']) cat. 1 cat. 2 mean 1 mean 2 r_pb 0 Male Female 59.766667 53.727273 -0.127183
Expand source code
def r_point_biserial(catField, scaleField, categories=None): ''' Point-Biserial Correlation Coefficient ------------------------------------- This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603). As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa). With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0. Note that if the two categories come from a so-called latent normally distributed variable, the *biserial correlation* might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age). A separate function is available on the biserial correlation. The coefficient is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Correlations/PointBiserialCorrelation.html) Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. Returns ------- Pandas dataframe with: * *cat. 1*, the category that was used as category 1 * *cat. 2*, the category that was used as category 2 * *mean 1*, the arithmetic mean of the scores from category 1 * *mean 2*, the arithmetic mean of the scores from category 2 * *r_pb*, the point-biserial correlation coefficient Notes ----- The formula used is (Tate, 1955, p. 1081): $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sigma_x} \\times \\sqrt{p \\times q}$$ With: $$p = \\frac{n_1}{n}, q = \\frac{n_2}{n}$$ $$\\bar{x}_1 = \\frac{\\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\\bar{x}_2 = \\frac{\\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$ $$\\sigma = \\sqrt{\\frac{SS}{n}}$$ $$SS = \\sum_{j=1}^{2} \\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}\\right)^2$$ Symbols used: * \\(n_1\\), the sample size of the first category * \\(n_2\\), the sample size of the second category * \\(n\\), the total sample size, i.e. \\(n = n_1 + n_2\\) * \\(x_{i,j}\\) is the \\(i\\)-th score in category \\(j\\) The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is: $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}}{\\sigma_x} \\times \\frac{\\sqrt{p \\times q}}{q}$$ Tate also gave another formula (Tate, 1954, p. 606): $$r_{pb} = \\frac{\\bar{x}_2 - \\bar{x}_1}{\\sqrt{SS}} \\times \\frac{n_1 \\times n_2}{n}$$ Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples: $$r_{pb} = \\sqrt{\\frac{t^2}{t^2 + df}}$$ As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores. Note that all of these should give the same result. Before, After and Alternatives ------------------------------ Before the effect size you might want to run a test. Various options include [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test, [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test, or [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test. To get some rule-of-thumb use [th_point_biserial()](../other/thumb_point_biserial.html). Alternative effect sizes include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html) or the correlation coefficients: [biserial](../correlations/cor_biserial.html), [point-biserial](../effect_sizes/cor_point_biserial.html) References ---------- Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. *Psychological Bulletin, 70*(4), 245–251. https://doi.org/10.1037/h0026258 Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. *Biometrika, 10*(2/3), 384–390. https://doi.org/10.2307/2331789 Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. *The Annals of Mathematical Statistics, 25*(3), 603–607. https://doi.org/10.1214/aoms/1177728730 Tate, R. F. (1955). Applications of correlation models for biserial data. *Journal of the American Statistical Association, 50*(272), 1078–1095. https://doi.org/10.2307/2281207 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> import pandas as pd >>> dfr = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_point_biserial(dfr['Gen_Gender'], dfr['Over_Grade']) cat. 1 cat. 2 mean 1 mean 2 r_pb 0 Male Female 59.766667 53.727273 -0.127183 ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = df.iloc[:,1][df.iloc[:,0] == cat1] x2 = df.iloc[:,1][df.iloc[:,0] == cat2] combined = pd.concat([x1, x2]) # sample sizes n1 = len(x1) n2 = len(x2) n = len(combined) # sample proportions p = n1/n q = n2/n # means and overall population standard deviation m1 = x1.mean() m2 = x2.mean() s = combined.std(ddof=0) # point-biserial correlation r_pb = (m2 - m1)/s * (p*q)**0.5 #the results colnames = ["cat. 1", "cat. 2", 'mean 1', 'mean 2', 'r_pb'] results = pd.DataFrame([[cat1, cat2, m1, m2, r_pb]], columns=colnames) return results