Module stikpetP.correlations.cor_tetrachoric
Expand source code
import numpy as np
from scipy.stats import multivariate_normal
from statistics import NormalDist
import math
import pandas as pd
from ..other.table_cross import tab_cross
def r_tetrachoric(field1, field2, categories1=None, categories2=None, method="divgi"):
'''
Tetrachoric Correlation Coefficient
-----------------------------------
In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).
This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.
This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price.
Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function.
Parameters
----------
field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used
method : {"divgi", "search", "brown", "kirk"}, optional
method to use (see details)
Returns
-------
Tetrachoric Correlation Coefficient
Notes
-----
The "search" method does a binary search for rt using the bivariate normal distribution from the *scipy's* function **multivariate_normal()**.
"kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet
"brown" will use Brown (1977) - Algorithm AS 116
"divgi" will use Divgi (1979) algorithm
Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html
A separate function for each method is also available, which each require as input the four cell values.
Before, After and Alternatives
-----------------------
Before determining the effect size measure, you might want to use:
* [tab_cross](../other/table_cross.html#tab_cross) to get a cross table and/or
* [ts_fisher](../tests/test_fisher.html#ts_fisher) to perform a Fisher exact test
Alternative effect size measures can be found in the [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin) documentation.
Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The [th_pearson_r](../other/thumb_pearson_r.html#th_pearson_r) function could be used for this.
References
----------
Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. *Applied Statistics, 26*(3), 343. https://doi.org/10.2307/2346985
Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. *Psychometrika, 44*(2), 169–172. https://doi.org/10.1007/BF02293968
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. *Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. *Psychometrika, 38*(2), 259–268. https://doi.org/10.1007/BF02291118
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
-0.2104257288112121
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown")
0.21042572904332063
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk")
-0.20502692146733825
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi")
-0.2104257288112121
'''
# determine sample cross table
tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
# cell values of sample cross table
a = tab.iloc[0,0]
b = tab.iloc[0,1]
c = tab.iloc[1,0]
d = tab.iloc[1,1]
if method=="search":
rho = r_tetrachor_iter(a, b, c, d)
elif method=="brown":
rho = r_tetrachor_brown(a, b, c, d)
elif method=="kirk":
rho = r_tetrachor_kirk(a, b, c, d)
elif method=="divgi":
rho = r_tetrachor_divgi(a, b, c, d)
return rho
def r_tetrachor_iter(a, b, c, d, niters=50):
#The totals
R1 = a + b
R2 = c + d
C1 = a + c
C2 = b + d
n = R1 + R2
aProp = a/n
r1Prop = R1/n
c1Prop = C1/n
r1Zscore = NormalDist().inv_cdf(r1Prop)
c1Zscore = NormalDist().inv_cdf(c1Prop)
step = 0.1
#Initial start
rhoBest = -0.9
rho = rhoBest
cov = np.array([[1, rho], [rho, 1]])
p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
dif1 = abs(aProp - p)
dif1
#First iteration
rho = rho + step
cov = np.array([[1, rho], [rho, 1]])
p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
dif2 = abs(aProp - p)
dif2
#iterations
for i in range(0, niters):
if dif2<dif1:
rhoBest = rho
dif1 = dif2
else:
rho = rho - 2*step
step = step/2
cov = np.array([[1, rho], [rho, 1]])
p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
dif1 = abs(aProp - p)
rho = rho + step
cov = np.array([[1, rho], [rho, 1]])
p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
dif2 = abs(aProp - p)
return (rho)
def r_tetrachor_brown(a, b, c, d):
# P. Stikker adaptation of Fortran code from:
# Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. doi: 10.2307/2346985
x = [0.9972638618, 0.9856115115, 0.9647622556, 0.9349060759,
0.8963211558, 0.8493676137, 0.794483796, 0.7321821187,
0.6630442669, 0.5877157572, 0.5068999089, 0.4213512761,
0.3318686023, 0.2392873623, 0.1444719616, 0.0483076657]
w = [0.00701861, 0.0162743947, 0.0253920653, 0.0342738629,
0.042835898, 0.0509980593, 0.0586840935, 0.0658222228,
0.0723457941, 0.0781938958, 0.0833119242, 0.087652093,
0.0911738787, 0.0938443991, 0.0956387201, 0.096540085]
sqt2pi = 2.50662827
rlimit = 0.9999
rcut = 0.95
uplim = 5
const = 1 * 10 ** (-60)
chalf = 1 * 10 ** (-30)
conv = 1 * 10 ** (-8)
citer = 1 * 10 ** (-6)
niter = 25
# INITIALIZATION
r = 0
itype = 0
# CHECK IF ANY CELL FREQUENCY IS NEGATIVE
if a < 0 or b < 0 or c < 0 or d < 0:
print("error in input")
r = 2
else:
# CHECK IF ANY FREQUENCY IS ZERO AND SET KDELTA
kdelta = 1
delta = 0
if a == 0 or d == 0:
kdelta = 2
if b == 0 or c == 0:
kdelta = kdelta + 2
# KDELTA=4 MEANS TABLE HAS ZERO ROW OR COLUMN, RUN IS TERMINATED
# DELTA IS 0.0, 0.5 OR -0.5 ACCORDING TO WHICH CELL IS ZERO
if kdelta == 4:
# GO TO 92
print("error row or column total is zero")
r = 2
else:
if kdelta != 1:
if kdelta == 2:
# GOTO 1
# 1
delta = 0.5
if a == 0 and d == 0:
r = -1
# GOTO 4
else:
if kdelta == 3:
# GOTO 2
# 2
delta = -0.5
if b == 0 and c == 0:
r = 1
# 4
if r != 0:
itype = 3
# STORE FREQUENCIES IN AA, BB, CC AND DD
aa = a + delta
bb = b - delta
cc = c - delta
dd = d + delta
tot = aa + bb + cc + dd
# CHECK IF CORREIATION IS NEGATIVE, ZERO, POSITIVE
# COMPUTE PROBILITIES OF QUADRANT AND OF MARGINALS PRBMAA AND PROBAC CHOSEN SO THAT CORREIATION IS POSITIVE. KSIGN INDICATES WHETHER QUADRANTS HAVE BEEN SWITCHED
if aa * dd - bb * cc < 0:
# GO TO 7
# 7
probaa = bb / tot
probac = (bb + dd) / tot
ksign = 2
else:
if aa * dd - bb * cc == 0:
# GO TO 5
# 5
itype = 4
# 6
probaa = aa / tot
probac = (aa + cc) / tot
ksign = 1
# GO TO 8
# 8
probab = (aa + bb) / tot
# COMPUTE NORMALDEVIATES FOR THE MARGINAL FREQUENCIES
# SINCE NO MARGINAL CAN BE ZERO, IE IS NOT CHECKED
zac = NormalDist().inv_cdf(probac)
zab = NormalDist().inv_cdf(probab)
ss = math.exp(-0.5 * (zac ** 2 + zab ** 2)) / (2 * math.pi)
# WHEN R IS 0.0, 1.0 OR -1.0, TRANSFER TO COMUTE SDZERO
if r != 0 or itype > 0:
# GOTO 85
skip85 = False
else:
# WHEN MARGINALS ARE EQUAL, COSINE EVALUATION IS USED
if a == d and b == c:
# GOTO 60
# 60
rr = -1 * cos(2 * math.pi * probaa)
itype = 2
else:
# INITIAL ESTIMATE OF CORRELATION IS YULES Y
rr = (math.sqrt(aa * dd) - math.sqrt(bb * cc)) ** 2 / abs(aa * dd - bb * cc)
iter = 0
# IF RR EXCEEDS RCUT, GAUSSIANQUADRATURE IS USED
loop10 = True
while loop10:
loop10 = False
# 10
if rr > rcut:
# GOTO 40
skip40 = False
else:
# TETRACHORIC SERIES IS COMPUTED
# INITIALIALIZATION
va = 1
vb = zac
wa = 1
wb = zab
term = 1
iterm = 0
sum = probab * probac
deriv = 0
sr = ss
# 15
loop15 = True
while loop15:
loop15 = False
if abs(sr) <= const:
# RESCALE TERMS TO AVOID OVERFLOWS AND UNDERFLOWS
sr = sr / const
va = va * chalf
vb = vb * chalf
wa = wa * chalf
wb = wb * chalf
# FORM SUM AND DERIV OF SERIES
# 20
dr = sr * va * wa
sr = sr * rr / term
cof = sr * va * wa
# ITERM COUNTS NO. OF CONSECUTIVE TERMS .LT. CONV
iterm = iterm + 1
if abs(cof) > conv:
iterm = 0
sum = sum + cof
deriv = deriv + dr
vaa = va
waa = wa
va = vb
wa = wb
vb = zac * va - term * vaa
wb = zab * wa - term * waa
term = term + 1
if iterm < 2 or term < 6:
loop15 = True
# CHECK IF ITERATION CONVERGED
if abs(sum - probaa) > citer:
# GOTO 25
# CALCULATE NEXT ESTIMATE OF CORRELATION
# 25
iter = iter + 1
# IF TOO MANY ITERATIONS, RUN IS TERMINATED
if iter >= niter:
# GOTO 93
skip40 = True
skip70 = True
skip85 = False
else:
delta = (sum - probaa) / deriv
rrprev = rr
rr = rr - delta
if iter == 1:
rr = rr + 0.5 * delta
if rr > rlimit:
rr = rlimit
if rr < 0:
rr = 0
# GOTO 10
loop10 = True
skip40 = False
else:
# ITERATION HAS CONVERGED,SET ITYPE
itype = term
# GOTO 70
skip40 = True
skip70 = False
if not skip40:
# GAUSSIAN QUADRATURE
# 40
if iter <= 0:
# INITIALIZATION IF THIS IS FIRST ITERATION
sum = probab * probac
rrprev = 0
# INITIALIZATION
# 41
sumprv = probab - sum
prob = bb / tot
if ksign == 2:
prob = aa / tot
itype = 1
# LOOP TO FIND ESTIMATE OF CORRELATION
# COMPUTATION OF INTEGRAL(SUM) BY QUADRATURE
loop42 = True
while loop42:
loop42 = False
# 42
rrsq = math.sqrt(1 - rr ** 2)
amid = 0.5 * (uplim + zac)
xlen = uplim - amid
sum = 0
for iquad in range(1, 16 + 1, 1):
xla = amid + x[iquad - 1] * xlen
xlb = amid - x[iquad - 1] * xlen
# TO AVOID UNDERFLOWS, TEMPA AND TEMPB ARE USED
tempa = (zab - rr * xla) / rrsq
if tempa >= -6:
sum = sum + w[iquad - 1] * math.exp(-0.5 * xla ** 2) * NormalDist().cdf(tempa)
tempb = (zab - rr * xlb) / rrsq
if tempb >= -6:
sum = sum + w[iquad - 1] * math.exp(-0.5 * xlb ** 2) * NormalDist().cdf(tempb)
# 44
sum = sum * xlen / sqt2pi
# CHECX IF ITERATION HAS CONVERGED
if abs(prob - sum) <= citer:
# GOTO 70
skip70 = False
else:
iter = iter + 1
# IF TOO MANY ITERATIONS, RUN IS TERMINATED
if iter >= niter:
# GOTO 93
skip70 = True
skip85 = False
# ESTIMATE CORRELATIONFOR NEXT ITERATION BY LINEAR INTERPOLATION
rrest = ((prob - sum) * rrprev - (prob - sumprv) * rr) / (sumprv - sum)
# IS ESTIMATE POSITIVE AND LESS THAN UPPER LIMIT
if rrest > rlimit:
rrest = rlimit
if rrest < 0:
rrest = 0
rrprev = rr
rr = rrest
sumprv = sum
# IF ESTIMATE HAS SAME VALUE ON TWO ITERATIONS, STOP ITERATION
if rr == rrprev:
# GOTO 70
skip70 = False
else:
# GOTO 42
loop42 = True
if not skip70:
# COMPUTE SDR
# 70
r = rr
rrsq = math.sqrt(1 - r ** 2)
if kdelta > 1:
itype = -itype
if ksign != 1:
r = -r
zac = -zac
# 71
pdf = math.exp(-0.5 * (zac ** 2 - 2 * r * zac * zab + zab ** 2) / rrsq ** 2) / (2 * math.pi * rrsq)
pac = NormalDist().cdf((zac - r * zab) / rrsq) - 0.5
pab = NormalDist().cdf((zab - r * zac) / rrsq) - 0.5
sdr = (aa + dd) * (bb + cc) / 4 + pab ** 2 * (aa + cc) * (bb + dd) + pac ** 2 * (aa + bb) * (cc + dd) + 2 * pab * pac * (aa * dd - bb * cc) - pab * (aa * bb - cc * dd) - pac * (aa * cc - bb * dd)
if sdr < 0:
sdr = 0
sdr = math.sqrt(sdr) / (tot * pdf * math.sqrt(tot))
skip85 = False
if not skip85:
# 85
sdzero = math.sqrt((aa + bb) * (aa + cc) * (bb + dd) * (cc + dd) / tot) / (tot ** 2 * ss)
if r == 0:
sdr = 0
return r
def r_tetrachor_kirk(af, bf, cf, df):
# P. Stikker adaptation of Fortran IV code from:
# Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. doi: 10.1007/BF02291118
# Since:
# P = the joint proportion for which both marginal values are < .5
# We might have to swop the rows and/or columns
if af + bf > cf + df:
# swop rows
oldTemp = af
af = cf
cf = oldTemp
oldTemp = bf
bf = df
df = oldTemp
if af + cf > bf + df:
# swop columns
oldTemp = af
af = bf
bf = oldTemp
oldTemp = cf
cf = df
df = oldTemp
# input parameters of original function
n = af + bf + cf + df
p = af / n
fm1 = (af + bf) / n
fm2 = (af + cf) / n
# value settings
c1 = 0.019855071751232
c2 = 0.101666761293187
c3 = 0.237233795041835
c4 = 0.408282678752175
c5 = 0.591717321247825
c6 = 0.762766204958165
c7 = 0.898333238706813
c8 = 0.980144928248768
c1sq = 0.000394223874247
c2sq = 0.010336130351846
c3sq = 0.056279873509951
c4sq = 0.166694745769052
c5sq = 0.350129388264702
c6sq = 0.581812283426281
c7sq = 0.807002607765472
c8sq = 0.960684080371783
w1 = 0.050614268145188
w2 = 0.111190517226687
w3 = 0.156853322938943
w4 = 0.181341891689181
rpi = 0.398942280401433
rtpi = 2.506628274631
r = 0
# EVALUATION OF H AND K
j = 1
# CONVERGENCE CRITERION FOR H AND K
eps = 1 * 10 ** (-5)
x = fm1
con = (0.5 - fm1) * rtpi
# GO TO 200
loop200 = True
while loop200:
loop200 = False
if x > 0.5:
# GO TO 600
# 600
r = 4
# GO TO 1000
else:
# HASTINGS' APPROXIMATION
# 250
e = math.sqrt(-2 * math.log(x))
e1 = (0.010328 * e + 0.802853) * e + 2.515517
e2 = ((0.001308 * e + 0.189269) * e + 1.432788) * e + 1
est = e - e1 / e2
loop300 = True
while loop300:
loop300 = False
# 300
old = est
# ITERATION LOOP
loop350 = True
i = 1
while loop350 and i <= 20:
loop350 = False
# 350
xx = old * old
if j == 3 and abs(old) >= 1:
# GO TO 550
# 550
if r != 2:
r = 2
old = 0.97 * sgn(a)
# GO TO 350
i = i + 1
loop350 = True
else:
if j == 3:
# GO TO 400
# 400
# R INTEGRAND EVALUATION
fnum = old * (w1 * (hEfn2(h2k2, hk2, x, xx, c1, c1sq) + hEfn2(h2k2, hk2, x, xx, c8, c8sq)) + w2 * (hEfn2(h2k2, hk2, x, xx, c2, c2sq) + hEfn2(h2k2, hk2, x, xx, c7, c7sq)) + w3 * (hEfn2(h2k2, hk2, x, xx, c3, c3sq) + hEfn2(h2k2, hk2, x, xx, c6, c6sq)) + w4 * (hEfn2(h2k2, hk2, x, xx, c4, c4sq) + hEfn2(h2k2, hk2, x, xx, c5, c5sq)))
fnew = old - (fnum - con) / hEfn2(h2k2, hk2, x, xx, 1, 1)
else:
# H AND K INTEGRAND EVALUATION FOR GAUSSIAN QUADRATURE
fnum = old * (w1 * (hEfn1(xx, c1sq) + hEfn1(xx, c8sq)) + w2 * (hEfn1(xx, c2sq) + hEfn1(xx, c7sq)) + w3 * (hEfn1(xx, c3sq) + hEfn1(xx, c6sq)) + w4 * (hEfn1(xx, c4sq) + hEfn1(xx, c5sq)))
fnew = old - (fnum - con) / hEfn1(xx, 1)
# GO TO 450
# TEST FOR CONVERGENCE
# 450
if abs(old - fnew) > eps:
# GO TO 500
# 500
old = fnew
i = i + 1
if i > 20:
r = 3
# GO TO 1000
else:
loop350 = True
else:
if j == 1:
# GO TO 100
# 100
j = 2
h = fnew
h2 = h * h
zh = rpi * math.exp(-1 * h2 / 2)
x = fm2
con = (0.5 - fm2) * rtpi
# GO TO 200
loop200 = True
else:
if j == 2:
# GO TO 150
# 150
fk = fnew
fk2 = fk * fk
h2k2 = h2 + fk2
hk2 = h * fk * 2
zk = rpi * math.exp(-1 * fk2 / 2)
j = 3
# STARTING ESTIMATE FOR R
a = p - fm1 * fm2
con = a * 6.28318530717959
zhk = zh * zk
est = 2 * (math.sqrt(abs(hk2 * a / zhk + 1)) - 1) / hk2
if abs(hk2) <= 1 * 10 ** (-8):
est = a / zhk
if abs(est) > 0.8:
est = 0.8 * sgn(a)
# CONVERGENCE CRITERION FOR R
eps = 1 * 10 ** (-4)
# GO TO 300
loop300 = True
else:
# GO TO 650
r = fnew
# GO TO 1000
# 200
# 1000
rt = r
return rt
def hEfn1(xx, w):
f = math.exp(-1 * xx * w / 2)
return f
def hEfn2(h2k2, hk2, x, xx, v, w):
f = math.exp((-1 * h2k2 + hk2 * v * x) / (2 * (1 - w * xx))) / math.sqrt(1 - w * xx)
return f
def sgn(n):
if n == 0: return 0
elif n < 0: return -1
else: return 1
def r_tetrachor_divgi(a, b, c, d, iter=10):
# Divgi Algorithm
# Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. doi: 10.1007/BF02293968
r1 = a + b
c1 = a + c
n = r1 + c + d
p1 = r1 / n
p2 = c1 / n
h = NormalDist().inv_cdf(p1)
k = NormalDist().inv_cdf(p2)
if abs(h) > abs(k):
hAdj = abs(h)
kAdj = abs(k)
else:
hAdj = abs(k)
kAdj = abs(h)
odRat = (a * d) / (b * c)
dA = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.12454 - 0.27102 * (1 - hAdj / math.sqrt(hAdj ** 2 + kAdj ** 2))))
dB = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.82281 - 1.03514 * kAdj / math.sqrt(hAdj ** 2 + kAdj ** 2)))
dC = 0.07557 * hAdj + (hAdj - kAdj) ** 2 * (0.51141 / (hAdj + 2.05793) - 0.07557 / hAdj)
dD = kAdj * (0.79289 + 4.28981 / (1 + 3.30231 * hAdj))
alp = dA + dB * (-1 + 1 / (1 + dC * (math.log(odRat) - dD) ** 2))
rt = math.cos(math.pi / (1 + odRat ** alp))
for i in range(1, iter + 1, 1):
cov = np.array([[1, rt], [rt, 1]])
l = multivariate_normal.cdf(np.array([h, k]), cov=cov)
ld = math.exp(-(h ** 2 - 2 * rt * h * k + k ** 2) / (2 * (1 - rt ** 2))) / (2 * math.pi * math.sqrt(1 - rt ** 2))
rt = rt - (l - float(a) / n) / ld
rt = rt * sgn(h) * sgn(k)
return rt
Functions
def hEfn1(xx, w)
-
Expand source code
def hEfn1(xx, w): f = math.exp(-1 * xx * w / 2) return f
def hEfn2(h2k2, hk2, x, xx, v, w)
-
Expand source code
def hEfn2(h2k2, hk2, x, xx, v, w): f = math.exp((-1 * h2k2 + hk2 * v * x) / (2 * (1 - w * xx))) / math.sqrt(1 - w * xx) return f
def r_tetrachor_brown(a, b, c, d)
-
Expand source code
def r_tetrachor_brown(a, b, c, d): # P. Stikker adaptation of Fortran code from: # Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. doi: 10.2307/2346985 x = [0.9972638618, 0.9856115115, 0.9647622556, 0.9349060759, 0.8963211558, 0.8493676137, 0.794483796, 0.7321821187, 0.6630442669, 0.5877157572, 0.5068999089, 0.4213512761, 0.3318686023, 0.2392873623, 0.1444719616, 0.0483076657] w = [0.00701861, 0.0162743947, 0.0253920653, 0.0342738629, 0.042835898, 0.0509980593, 0.0586840935, 0.0658222228, 0.0723457941, 0.0781938958, 0.0833119242, 0.087652093, 0.0911738787, 0.0938443991, 0.0956387201, 0.096540085] sqt2pi = 2.50662827 rlimit = 0.9999 rcut = 0.95 uplim = 5 const = 1 * 10 ** (-60) chalf = 1 * 10 ** (-30) conv = 1 * 10 ** (-8) citer = 1 * 10 ** (-6) niter = 25 # INITIALIZATION r = 0 itype = 0 # CHECK IF ANY CELL FREQUENCY IS NEGATIVE if a < 0 or b < 0 or c < 0 or d < 0: print("error in input") r = 2 else: # CHECK IF ANY FREQUENCY IS ZERO AND SET KDELTA kdelta = 1 delta = 0 if a == 0 or d == 0: kdelta = 2 if b == 0 or c == 0: kdelta = kdelta + 2 # KDELTA=4 MEANS TABLE HAS ZERO ROW OR COLUMN, RUN IS TERMINATED # DELTA IS 0.0, 0.5 OR -0.5 ACCORDING TO WHICH CELL IS ZERO if kdelta == 4: # GO TO 92 print("error row or column total is zero") r = 2 else: if kdelta != 1: if kdelta == 2: # GOTO 1 # 1 delta = 0.5 if a == 0 and d == 0: r = -1 # GOTO 4 else: if kdelta == 3: # GOTO 2 # 2 delta = -0.5 if b == 0 and c == 0: r = 1 # 4 if r != 0: itype = 3 # STORE FREQUENCIES IN AA, BB, CC AND DD aa = a + delta bb = b - delta cc = c - delta dd = d + delta tot = aa + bb + cc + dd # CHECK IF CORREIATION IS NEGATIVE, ZERO, POSITIVE # COMPUTE PROBILITIES OF QUADRANT AND OF MARGINALS PRBMAA AND PROBAC CHOSEN SO THAT CORREIATION IS POSITIVE. KSIGN INDICATES WHETHER QUADRANTS HAVE BEEN SWITCHED if aa * dd - bb * cc < 0: # GO TO 7 # 7 probaa = bb / tot probac = (bb + dd) / tot ksign = 2 else: if aa * dd - bb * cc == 0: # GO TO 5 # 5 itype = 4 # 6 probaa = aa / tot probac = (aa + cc) / tot ksign = 1 # GO TO 8 # 8 probab = (aa + bb) / tot # COMPUTE NORMALDEVIATES FOR THE MARGINAL FREQUENCIES # SINCE NO MARGINAL CAN BE ZERO, IE IS NOT CHECKED zac = NormalDist().inv_cdf(probac) zab = NormalDist().inv_cdf(probab) ss = math.exp(-0.5 * (zac ** 2 + zab ** 2)) / (2 * math.pi) # WHEN R IS 0.0, 1.0 OR -1.0, TRANSFER TO COMUTE SDZERO if r != 0 or itype > 0: # GOTO 85 skip85 = False else: # WHEN MARGINALS ARE EQUAL, COSINE EVALUATION IS USED if a == d and b == c: # GOTO 60 # 60 rr = -1 * cos(2 * math.pi * probaa) itype = 2 else: # INITIAL ESTIMATE OF CORRELATION IS YULES Y rr = (math.sqrt(aa * dd) - math.sqrt(bb * cc)) ** 2 / abs(aa * dd - bb * cc) iter = 0 # IF RR EXCEEDS RCUT, GAUSSIANQUADRATURE IS USED loop10 = True while loop10: loop10 = False # 10 if rr > rcut: # GOTO 40 skip40 = False else: # TETRACHORIC SERIES IS COMPUTED # INITIALIALIZATION va = 1 vb = zac wa = 1 wb = zab term = 1 iterm = 0 sum = probab * probac deriv = 0 sr = ss # 15 loop15 = True while loop15: loop15 = False if abs(sr) <= const: # RESCALE TERMS TO AVOID OVERFLOWS AND UNDERFLOWS sr = sr / const va = va * chalf vb = vb * chalf wa = wa * chalf wb = wb * chalf # FORM SUM AND DERIV OF SERIES # 20 dr = sr * va * wa sr = sr * rr / term cof = sr * va * wa # ITERM COUNTS NO. OF CONSECUTIVE TERMS .LT. CONV iterm = iterm + 1 if abs(cof) > conv: iterm = 0 sum = sum + cof deriv = deriv + dr vaa = va waa = wa va = vb wa = wb vb = zac * va - term * vaa wb = zab * wa - term * waa term = term + 1 if iterm < 2 or term < 6: loop15 = True # CHECK IF ITERATION CONVERGED if abs(sum - probaa) > citer: # GOTO 25 # CALCULATE NEXT ESTIMATE OF CORRELATION # 25 iter = iter + 1 # IF TOO MANY ITERATIONS, RUN IS TERMINATED if iter >= niter: # GOTO 93 skip40 = True skip70 = True skip85 = False else: delta = (sum - probaa) / deriv rrprev = rr rr = rr - delta if iter == 1: rr = rr + 0.5 * delta if rr > rlimit: rr = rlimit if rr < 0: rr = 0 # GOTO 10 loop10 = True skip40 = False else: # ITERATION HAS CONVERGED,SET ITYPE itype = term # GOTO 70 skip40 = True skip70 = False if not skip40: # GAUSSIAN QUADRATURE # 40 if iter <= 0: # INITIALIZATION IF THIS IS FIRST ITERATION sum = probab * probac rrprev = 0 # INITIALIZATION # 41 sumprv = probab - sum prob = bb / tot if ksign == 2: prob = aa / tot itype = 1 # LOOP TO FIND ESTIMATE OF CORRELATION # COMPUTATION OF INTEGRAL(SUM) BY QUADRATURE loop42 = True while loop42: loop42 = False # 42 rrsq = math.sqrt(1 - rr ** 2) amid = 0.5 * (uplim + zac) xlen = uplim - amid sum = 0 for iquad in range(1, 16 + 1, 1): xla = amid + x[iquad - 1] * xlen xlb = amid - x[iquad - 1] * xlen # TO AVOID UNDERFLOWS, TEMPA AND TEMPB ARE USED tempa = (zab - rr * xla) / rrsq if tempa >= -6: sum = sum + w[iquad - 1] * math.exp(-0.5 * xla ** 2) * NormalDist().cdf(tempa) tempb = (zab - rr * xlb) / rrsq if tempb >= -6: sum = sum + w[iquad - 1] * math.exp(-0.5 * xlb ** 2) * NormalDist().cdf(tempb) # 44 sum = sum * xlen / sqt2pi # CHECX IF ITERATION HAS CONVERGED if abs(prob - sum) <= citer: # GOTO 70 skip70 = False else: iter = iter + 1 # IF TOO MANY ITERATIONS, RUN IS TERMINATED if iter >= niter: # GOTO 93 skip70 = True skip85 = False # ESTIMATE CORRELATIONFOR NEXT ITERATION BY LINEAR INTERPOLATION rrest = ((prob - sum) * rrprev - (prob - sumprv) * rr) / (sumprv - sum) # IS ESTIMATE POSITIVE AND LESS THAN UPPER LIMIT if rrest > rlimit: rrest = rlimit if rrest < 0: rrest = 0 rrprev = rr rr = rrest sumprv = sum # IF ESTIMATE HAS SAME VALUE ON TWO ITERATIONS, STOP ITERATION if rr == rrprev: # GOTO 70 skip70 = False else: # GOTO 42 loop42 = True if not skip70: # COMPUTE SDR # 70 r = rr rrsq = math.sqrt(1 - r ** 2) if kdelta > 1: itype = -itype if ksign != 1: r = -r zac = -zac # 71 pdf = math.exp(-0.5 * (zac ** 2 - 2 * r * zac * zab + zab ** 2) / rrsq ** 2) / (2 * math.pi * rrsq) pac = NormalDist().cdf((zac - r * zab) / rrsq) - 0.5 pab = NormalDist().cdf((zab - r * zac) / rrsq) - 0.5 sdr = (aa + dd) * (bb + cc) / 4 + pab ** 2 * (aa + cc) * (bb + dd) + pac ** 2 * (aa + bb) * (cc + dd) + 2 * pab * pac * (aa * dd - bb * cc) - pab * (aa * bb - cc * dd) - pac * (aa * cc - bb * dd) if sdr < 0: sdr = 0 sdr = math.sqrt(sdr) / (tot * pdf * math.sqrt(tot)) skip85 = False if not skip85: # 85 sdzero = math.sqrt((aa + bb) * (aa + cc) * (bb + dd) * (cc + dd) / tot) / (tot ** 2 * ss) if r == 0: sdr = 0 return r
def r_tetrachor_divgi(a, b, c, d, iter=10)
-
Expand source code
def r_tetrachor_divgi(a, b, c, d, iter=10): # Divgi Algorithm # Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. doi: 10.1007/BF02293968 r1 = a + b c1 = a + c n = r1 + c + d p1 = r1 / n p2 = c1 / n h = NormalDist().inv_cdf(p1) k = NormalDist().inv_cdf(p2) if abs(h) > abs(k): hAdj = abs(h) kAdj = abs(k) else: hAdj = abs(k) kAdj = abs(h) odRat = (a * d) / (b * c) dA = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.12454 - 0.27102 * (1 - hAdj / math.sqrt(hAdj ** 2 + kAdj ** 2)))) dB = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.82281 - 1.03514 * kAdj / math.sqrt(hAdj ** 2 + kAdj ** 2))) dC = 0.07557 * hAdj + (hAdj - kAdj) ** 2 * (0.51141 / (hAdj + 2.05793) - 0.07557 / hAdj) dD = kAdj * (0.79289 + 4.28981 / (1 + 3.30231 * hAdj)) alp = dA + dB * (-1 + 1 / (1 + dC * (math.log(odRat) - dD) ** 2)) rt = math.cos(math.pi / (1 + odRat ** alp)) for i in range(1, iter + 1, 1): cov = np.array([[1, rt], [rt, 1]]) l = multivariate_normal.cdf(np.array([h, k]), cov=cov) ld = math.exp(-(h ** 2 - 2 * rt * h * k + k ** 2) / (2 * (1 - rt ** 2))) / (2 * math.pi * math.sqrt(1 - rt ** 2)) rt = rt - (l - float(a) / n) / ld rt = rt * sgn(h) * sgn(k) return rt
def r_tetrachor_iter(a, b, c, d, niters=50)
-
Expand source code
def r_tetrachor_iter(a, b, c, d, niters=50): #The totals R1 = a + b R2 = c + d C1 = a + c C2 = b + d n = R1 + R2 aProp = a/n r1Prop = R1/n c1Prop = C1/n r1Zscore = NormalDist().inv_cdf(r1Prop) c1Zscore = NormalDist().inv_cdf(c1Prop) step = 0.1 #Initial start rhoBest = -0.9 rho = rhoBest cov = np.array([[1, rho], [rho, 1]]) p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov) dif1 = abs(aProp - p) dif1 #First iteration rho = rho + step cov = np.array([[1, rho], [rho, 1]]) p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov) dif2 = abs(aProp - p) dif2 #iterations for i in range(0, niters): if dif2<dif1: rhoBest = rho dif1 = dif2 else: rho = rho - 2*step step = step/2 cov = np.array([[1, rho], [rho, 1]]) p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov) dif1 = abs(aProp - p) rho = rho + step cov = np.array([[1, rho], [rho, 1]]) p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov) dif2 = abs(aProp - p) return (rho)
def r_tetrachor_kirk(af, bf, cf, df)
-
Expand source code
def r_tetrachor_kirk(af, bf, cf, df): # P. Stikker adaptation of Fortran IV code from: # Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. doi: 10.1007/BF02291118 # Since: # P = the joint proportion for which both marginal values are < .5 # We might have to swop the rows and/or columns if af + bf > cf + df: # swop rows oldTemp = af af = cf cf = oldTemp oldTemp = bf bf = df df = oldTemp if af + cf > bf + df: # swop columns oldTemp = af af = bf bf = oldTemp oldTemp = cf cf = df df = oldTemp # input parameters of original function n = af + bf + cf + df p = af / n fm1 = (af + bf) / n fm2 = (af + cf) / n # value settings c1 = 0.019855071751232 c2 = 0.101666761293187 c3 = 0.237233795041835 c4 = 0.408282678752175 c5 = 0.591717321247825 c6 = 0.762766204958165 c7 = 0.898333238706813 c8 = 0.980144928248768 c1sq = 0.000394223874247 c2sq = 0.010336130351846 c3sq = 0.056279873509951 c4sq = 0.166694745769052 c5sq = 0.350129388264702 c6sq = 0.581812283426281 c7sq = 0.807002607765472 c8sq = 0.960684080371783 w1 = 0.050614268145188 w2 = 0.111190517226687 w3 = 0.156853322938943 w4 = 0.181341891689181 rpi = 0.398942280401433 rtpi = 2.506628274631 r = 0 # EVALUATION OF H AND K j = 1 # CONVERGENCE CRITERION FOR H AND K eps = 1 * 10 ** (-5) x = fm1 con = (0.5 - fm1) * rtpi # GO TO 200 loop200 = True while loop200: loop200 = False if x > 0.5: # GO TO 600 # 600 r = 4 # GO TO 1000 else: # HASTINGS' APPROXIMATION # 250 e = math.sqrt(-2 * math.log(x)) e1 = (0.010328 * e + 0.802853) * e + 2.515517 e2 = ((0.001308 * e + 0.189269) * e + 1.432788) * e + 1 est = e - e1 / e2 loop300 = True while loop300: loop300 = False # 300 old = est # ITERATION LOOP loop350 = True i = 1 while loop350 and i <= 20: loop350 = False # 350 xx = old * old if j == 3 and abs(old) >= 1: # GO TO 550 # 550 if r != 2: r = 2 old = 0.97 * sgn(a) # GO TO 350 i = i + 1 loop350 = True else: if j == 3: # GO TO 400 # 400 # R INTEGRAND EVALUATION fnum = old * (w1 * (hEfn2(h2k2, hk2, x, xx, c1, c1sq) + hEfn2(h2k2, hk2, x, xx, c8, c8sq)) + w2 * (hEfn2(h2k2, hk2, x, xx, c2, c2sq) + hEfn2(h2k2, hk2, x, xx, c7, c7sq)) + w3 * (hEfn2(h2k2, hk2, x, xx, c3, c3sq) + hEfn2(h2k2, hk2, x, xx, c6, c6sq)) + w4 * (hEfn2(h2k2, hk2, x, xx, c4, c4sq) + hEfn2(h2k2, hk2, x, xx, c5, c5sq))) fnew = old - (fnum - con) / hEfn2(h2k2, hk2, x, xx, 1, 1) else: # H AND K INTEGRAND EVALUATION FOR GAUSSIAN QUADRATURE fnum = old * (w1 * (hEfn1(xx, c1sq) + hEfn1(xx, c8sq)) + w2 * (hEfn1(xx, c2sq) + hEfn1(xx, c7sq)) + w3 * (hEfn1(xx, c3sq) + hEfn1(xx, c6sq)) + w4 * (hEfn1(xx, c4sq) + hEfn1(xx, c5sq))) fnew = old - (fnum - con) / hEfn1(xx, 1) # GO TO 450 # TEST FOR CONVERGENCE # 450 if abs(old - fnew) > eps: # GO TO 500 # 500 old = fnew i = i + 1 if i > 20: r = 3 # GO TO 1000 else: loop350 = True else: if j == 1: # GO TO 100 # 100 j = 2 h = fnew h2 = h * h zh = rpi * math.exp(-1 * h2 / 2) x = fm2 con = (0.5 - fm2) * rtpi # GO TO 200 loop200 = True else: if j == 2: # GO TO 150 # 150 fk = fnew fk2 = fk * fk h2k2 = h2 + fk2 hk2 = h * fk * 2 zk = rpi * math.exp(-1 * fk2 / 2) j = 3 # STARTING ESTIMATE FOR R a = p - fm1 * fm2 con = a * 6.28318530717959 zhk = zh * zk est = 2 * (math.sqrt(abs(hk2 * a / zhk + 1)) - 1) / hk2 if abs(hk2) <= 1 * 10 ** (-8): est = a / zhk if abs(est) > 0.8: est = 0.8 * sgn(a) # CONVERGENCE CRITERION FOR R eps = 1 * 10 ** (-4) # GO TO 300 loop300 = True else: # GO TO 650 r = fnew # GO TO 1000 # 200 # 1000 rt = r return rt
def r_tetrachoric(field1, field2, categories1=None, categories2=None, method='divgi')
-
Tetrachoric Correlation Coefficient
In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).
This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.
This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price.
Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function.
Parameters
field1
:pandas series
- data with categories for the rows
field2
:pandas series
- data with categories for the columns
categories1
:list
ordictionary
, optional- the two categories to use from field1. If not set the first two found will be used
categories2
:list
ordictionary
, optional- the two categories to use from field2. If not set the first two found will be used
method
:{"divgi", "search", "brown", "kirk"}
, optional- method to use (see details)
Returns
Tetrachoric Correlation Coefficient
Notes
The "search" method does a binary search for rt using the bivariate normal distribution from the scipy's function multivariate_normal().
"kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet
"brown" will use Brown (1977) - Algorithm AS 116
"divgi" will use Divgi (1979) algorithm
Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html
A separate function for each method is also available, which each require as input the four cell values.
Before, After and Alternatives
Before determining the effect size measure, you might want to use: * tab_cross to get a cross table and/or * ts_fisher to perform a Fisher exact test
Alternative effect size measures can be found in the es_bin_bin documentation.
Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The th_pearson_r function could be used for this.
References
Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. https://doi.org/10.2307/2346985
Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. https://doi.org/10.1007/BF02293968
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using GPower 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. https://doi.org/10.1007/BF02291118
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"]) -0.2104257288112121
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown") 0.21042572904332063
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk") -0.20502692146733825
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi") -0.2104257288112121
Expand source code
def r_tetrachoric(field1, field2, categories1=None, categories2=None, method="divgi"): ''' Tetrachoric Correlation Coefficient ----------------------------------- In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372). This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value. This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price. Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function. Parameters ---------- field1 : pandas series data with categories for the rows field2 : pandas series data with categories for the columns categories1 : list or dictionary, optional the two categories to use from field1. If not set the first two found will be used categories2 : list or dictionary, optional the two categories to use from field2. If not set the first two found will be used method : {"divgi", "search", "brown", "kirk"}, optional method to use (see details) Returns ------- Tetrachoric Correlation Coefficient Notes ----- The "search" method does a binary search for rt using the bivariate normal distribution from the *scipy's* function **multivariate_normal()**. "kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet "brown" will use Brown (1977) - Algorithm AS 116 "divgi" will use Divgi (1979) algorithm Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html A separate function for each method is also available, which each require as input the four cell values. Before, After and Alternatives ----------------------- Before determining the effect size measure, you might want to use: * [tab_cross](../other/table_cross.html#tab_cross) to get a cross table and/or * [ts_fisher](../tests/test_fisher.html#ts_fisher) to perform a Fisher exact test Alternative effect size measures can be found in the [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin) documentation. Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The [th_pearson_r](../other/thumb_pearson_r.html#th_pearson_r) function could be used for this. References ---------- Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. *Applied Statistics, 26*(3), 343. https://doi.org/10.2307/2346985 Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. *Psychometrika, 44*(2), 169–172. https://doi.org/10.1007/BF02293968 Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. *Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149 Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. *Psychometrika, 38*(2), 259–268. https://doi.org/10.1007/BF02291118 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"]) -0.2104257288112121 >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown") 0.21042572904332063 >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk") -0.20502692146733825 >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi") -0.2104257288112121 ''' # determine sample cross table tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude") # cell values of sample cross table a = tab.iloc[0,0] b = tab.iloc[0,1] c = tab.iloc[1,0] d = tab.iloc[1,1] if method=="search": rho = r_tetrachor_iter(a, b, c, d) elif method=="brown": rho = r_tetrachor_brown(a, b, c, d) elif method=="kirk": rho = r_tetrachor_kirk(a, b, c, d) elif method=="divgi": rho = r_tetrachor_divgi(a, b, c, d) return rho
def sgn(n)
-
Expand source code
def sgn(n): if n == 0: return 0 elif n < 0: return -1 else: return 1