Module stikpetP.correlations.cor_tetrachoric

Expand source code
import numpy as np
from scipy.stats import multivariate_normal
from statistics import NormalDist
import math
import pandas as pd
from ..other.table_cross import tab_cross

def r_tetrachoric(field1, field2, categories1=None, categories2=None, method="divgi"):
    '''
    Tetrachoric Correlation Coefficient
    -----------------------------------
    In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).
    
    This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.
    
    This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price.
    
    Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used
    method : {"divgi", "search", "brown", "kirk"}, optional
        method to use (see details)
    
    Returns
    -------
    Tetrachoric Correlation Coefficient
    
    Notes
    -----
    The "search" method does a binary search for rt using the bivariate normal distribution from the *scipy's* function **multivariate_normal()**.
    
    "kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet
    
    "brown" will use Brown (1977) - Algorithm AS 116
    
    "divgi" will use Divgi (1979) algorithm
    
    Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html
    
    A separate function for each method is also available, which each require as input the four cell values.

    Before, After and Alternatives
    -----------------------
    Before determining the effect size measure, you might want to use:
    * [tab_cross](../other/table_cross.html#tab_cross) to get a cross table and/or 
    * [ts_fisher](../tests/test_fisher.html#ts_fisher) to perform a Fisher exact test

    Alternative effect size measures can be found in the [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin) documentation.

    Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The [th_pearson_r](../other/thumb_pearson_r.html#th_pearson_r) function could be used for this.
    
    References
    ----------
    Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. *Applied Statistics, 26*(3), 343. https://doi.org/10.2307/2346985
    
    Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. *Psychometrika, 44*(2), 169–172. https://doi.org/10.1007/BF02293968

    Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. *Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
    
    Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. *Psychometrika, 38*(2), 259–268. https://doi.org/10.1007/BF02291118
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    -0.2104257288112121
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown")
    0.21042572904332063
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk")
    -0.20502692146733825
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi")
    -0.2104257288112121
    
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    
    if method=="search":
        rho = r_tetrachor_iter(a, b, c, d)
    elif method=="brown":
        rho = r_tetrachor_brown(a, b, c, d)
    elif method=="kirk":
        rho = r_tetrachor_kirk(a, b, c, d)
    elif method=="divgi":
        rho = r_tetrachor_divgi(a, b, c, d)
    return rho

def r_tetrachor_iter(a, b, c, d, niters=50):
    
    #The totals
    R1 = a + b
    R2 = c + d
    C1 = a + c
    C2 = b + d
    n = R1 + R2
    
    
    aProp = a/n
    r1Prop = R1/n
    c1Prop = C1/n

    r1Zscore = NormalDist().inv_cdf(r1Prop)
    c1Zscore = NormalDist().inv_cdf(c1Prop)


    step = 0.1

    #Initial start
    rhoBest = -0.9
    rho = rhoBest
    cov = np.array([[1, rho], [rho, 1]])
    p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
    dif1 = abs(aProp - p)
    dif1

    #First iteration
    rho = rho + step
    cov = np.array([[1, rho], [rho, 1]])
    p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
    dif2 = abs(aProp - p)
    dif2

    #iterations
    for i in range(0, niters):
        if dif2<dif1:
            rhoBest = rho
            dif1 = dif2
        else:
            rho = rho - 2*step
            step = step/2

            cov = np.array([[1, rho], [rho, 1]])
            p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
            dif1 = abs(aProp - p)

        rho = rho + step
        cov = np.array([[1, rho], [rho, 1]])
        p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
        dif2 = abs(aProp - p)
    
    return (rho)

def r_tetrachor_brown(a, b, c, d):
    # P. Stikker adaptation of Fortran code from:
    # Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. doi: 10.2307/2346985
    
    x = [0.9972638618, 0.9856115115, 0.9647622556, 0.9349060759, 
         0.8963211558, 0.8493676137, 0.794483796, 0.7321821187,
         0.6630442669, 0.5877157572, 0.5068999089, 0.4213512761,
         0.3318686023, 0.2392873623, 0.1444719616, 0.0483076657]
    w = [0.00701861, 0.0162743947, 0.0253920653, 0.0342738629,
         0.042835898, 0.0509980593, 0.0586840935, 0.0658222228,
         0.0723457941, 0.0781938958, 0.0833119242, 0.087652093,
         0.0911738787, 0.0938443991, 0.0956387201, 0.096540085]
    sqt2pi = 2.50662827
    rlimit = 0.9999
    rcut = 0.95
    uplim = 5
    const = 1 * 10 ** (-60)
    chalf = 1 * 10 ** (-30)
    conv = 1 * 10 ** (-8)
    citer = 1 * 10 ** (-6)
    niter = 25
    
    # INITIALIZATION
    r = 0
    itype = 0
    
    # CHECK IF ANY CELL FREQUENCY IS NEGATIVE
    if a < 0 or b < 0 or c < 0 or d < 0:
        print("error in input")
        r = 2
    else:
        # CHECK IF ANY FREQUENCY IS ZERO AND SET KDELTA
        kdelta = 1
        delta = 0
        if a == 0 or d == 0:
            kdelta = 2
        if b == 0 or c == 0:
            kdelta = kdelta + 2
        
        # KDELTA=4 MEANS TABLE HAS ZERO ROW OR COLUMN, RUN IS TERMINATED
        # DELTA IS 0.0, 0.5 OR -0.5 ACCORDING TO WHICH CELL IS ZERO
        if kdelta == 4:
            
            # GO TO 92
            print("error row or column total is zero")
            r = 2
        else:
            if kdelta != 1:
                if kdelta == 2:
                    
                    # GOTO 1
                    # 1
                    delta = 0.5
                    if a == 0 and d == 0:
                        r = -1
                    
                    # GOTO 4
                else:
                    if kdelta == 3:
                        
                        # GOTO 2
                        # 2
                        delta = -0.5
                        if b == 0 and c == 0:
                            r = 1
            
            # 4
            if r != 0:
                itype = 3
            
            # STORE FREQUENCIES IN AA, BB, CC AND DD
            aa = a + delta
            bb = b - delta
            cc = c - delta
            dd = d + delta
            tot = aa + bb + cc + dd
            
            # CHECK IF CORREIATION IS NEGATIVE, ZERO, POSITIVE
            # COMPUTE PROBILITIES OF QUADRANT AND OF MARGINALS PRBMAA AND PROBAC CHOSEN SO THAT CORREIATION IS POSITIVE.  KSIGN INDICATES WHETHER QUADRANTS HAVE BEEN SWITCHED
            if aa * dd - bb * cc < 0:
                
                # GO TO 7
                # 7
                probaa = bb / tot
                probac = (bb + dd) / tot
                ksign = 2
            else:
                if aa * dd - bb * cc == 0:
                    
                    # GO TO 5
                    # 5
                    itype = 4
                
                # 6
                probaa = aa / tot
                probac = (aa + cc) / tot
                ksign = 1
                
                # GO TO 8
            # 8
            probab = (aa + bb) / tot
            
            # COMPUTE NORMALDEVIATES FOR THE MARGINAL FREQUENCIES
            # SINCE NO MARGINAL CAN BE ZERO, IE IS NOT CHECKED
            zac = NormalDist().inv_cdf(probac)
            zab = NormalDist().inv_cdf(probab)
            ss = math.exp(-0.5 * (zac ** 2 + zab ** 2)) / (2 * math.pi)
            
            # WHEN R IS 0.0, 1.0 OR -1.0, TRANSFER TO COMUTE SDZERO
            if r != 0 or itype > 0:
                
                # GOTO 85
                skip85 = False
            else:
                
                # WHEN MARGINALS ARE EQUAL, COSINE EVALUATION IS USED
                if a == d and b == c:
                    
                    # GOTO 60
                    # 60
                    rr = -1 * cos(2 * math.pi * probaa)
                    itype = 2
                else:
                    
                    # INITIAL ESTIMATE OF CORRELATION IS YULES Y
                    rr = (math.sqrt(aa * dd) - math.sqrt(bb * cc)) ** 2 / abs(aa * dd - bb * cc)
                    iter = 0
                    
                    # IF RR EXCEEDS RCUT, GAUSSIANQUADRATURE IS USED
                    loop10 = True
                    while loop10:
                        loop10 = False
                        
                        # 10
                        if rr > rcut:
                            
                            # GOTO 40
                            skip40 = False
                        else:
                            
                            # TETRACHORIC SERIES IS COMPUTED
                            # INITIALIALIZATION
                            va = 1
                            vb = zac
                            wa = 1
                            wb = zab
                            term = 1
                            iterm = 0
                            sum = probab * probac
                            deriv = 0
                            sr = ss
                            
                            # 15
                            loop15 = True
                            while loop15:
                                loop15 = False
                                if abs(sr) <= const:
                                    # RESCALE TERMS TO AVOID OVERFLOWS AND UNDERFLOWS
                                    sr = sr / const
                                    va = va * chalf
                                    vb = vb * chalf
                                    wa = wa * chalf
                                    wb = wb * chalf
                                
                                # FORM SUM AND DERIV OF SERIES
                                # 20
                                dr = sr * va * wa
                                sr = sr * rr / term
                                cof = sr * va * wa
                                
                                # ITERM COUNTS NO. OF CONSECUTIVE TERMS .LT. CONV
                                iterm = iterm + 1
                                if abs(cof) > conv:
                                    iterm = 0
                                sum = sum + cof
                                deriv = deriv + dr
                                vaa = va
                                waa = wa
                                va = vb
                                wa = wb
                                vb = zac * va - term * vaa
                                wb = zab * wa - term * waa
                                term = term + 1
                                if iterm < 2 or term < 6:
                                    loop15 = True
                            
                            # CHECK IF ITERATION CONVERGED
                            if abs(sum - probaa) > citer:
                                
                                # GOTO 25
                                # CALCULATE NEXT ESTIMATE OF CORRELATION
                                # 25
                                iter = iter + 1
                                
                                # IF TOO MANY ITERATIONS, RUN IS TERMINATED
                                if iter >= niter:
                                    
                                    # GOTO 93
                                    skip40 = True
                                    skip70 = True
                                    skip85 = False
                                else:
                                    delta = (sum - probaa) / deriv
                                    rrprev = rr
                                    rr = rr - delta
                                    if iter == 1:
                                        rr = rr + 0.5 * delta
                                    if rr > rlimit:
                                        rr = rlimit
                                    if rr < 0:
                                        rr = 0
                                    
                                    # GOTO 10
                                    loop10 = True
                                    skip40 = False
                            else:
                                
                                # ITERATION HAS CONVERGED,SET ITYPE
                                itype = term
                                
                                # GOTO 70
                                skip40 = True
                                skip70 = False
                    if not skip40:
                        
                        # GAUSSIAN QUADRATURE
                        # 40
                        if iter <= 0:
                            # INITIALIZATION IF THIS IS FIRST ITERATION
                            sum = probab * probac
                            rrprev = 0
                        
                        # INITIALIZATION
                        # 41
                        sumprv = probab - sum
                        prob = bb / tot
                        
                        if ksign == 2:
                            prob = aa / tot
                        itype = 1
                        
                        # LOOP TO FIND ESTIMATE OF CORRELATION
                        # COMPUTATION OF INTEGRAL(SUM) BY QUADRATURE
                        loop42 = True
                        while loop42:
                            loop42 = False
                            
                            # 42
                            rrsq = math.sqrt(1 - rr ** 2)
                            amid = 0.5 * (uplim + zac)
                            xlen = uplim - amid
                            sum = 0
                            for iquad in range(1, 16 + 1, 1):
                                xla = amid + x[iquad - 1] * xlen
                                xlb = amid - x[iquad - 1] * xlen
                                
                                # TO AVOID UNDERFLOWS, TEMPA AND TEMPB ARE USED
                                tempa = (zab - rr * xla) / rrsq
                                if tempa >= -6:
                                    sum = sum + w[iquad - 1] * math.exp(-0.5 * xla ** 2) * NormalDist().cdf(tempa)
                                tempb = (zab - rr * xlb) / rrsq
                                if tempb >= -6:
                                    sum = sum + w[iquad - 1] * math.exp(-0.5 * xlb ** 2) * NormalDist().cdf(tempb)
                            
                            # 44
                            sum = sum * xlen / sqt2pi
                            
                            # CHECX IF ITERATION HAS CONVERGED
                            if abs(prob - sum) <= citer:
                                
                                # GOTO 70
                                skip70 = False
                            else:
                                iter = iter + 1
                            
                            # IF TOO MANY ITERATIONS, RUN IS TERMINATED
                            if iter >= niter:
                                
                                # GOTO 93
                                skip70 = True
                                skip85 = False
                            
                            # ESTIMATE CORRELATIONFOR NEXT ITERATION BY LINEAR INTERPOLATION
                            rrest = ((prob - sum) * rrprev - (prob - sumprv) * rr) / (sumprv - sum)
                            
                            # IS ESTIMATE POSITIVE AND LESS THAN UPPER LIMIT
                            if rrest > rlimit:
                                rrest = rlimit
                            if rrest < 0:
                                rrest = 0
                            rrprev = rr
                            rr = rrest
                            sumprv = sum
                            
                            # IF ESTIMATE HAS SAME VALUE ON TWO ITERATIONS, STOP ITERATION
                            if rr == rrprev:
                                
                                # GOTO 70
                                skip70 = False
                            else:
                                
                                # GOTO 42
                                loop42 = True
                if not skip70:
                    
                    # COMPUTE SDR
                    # 70
                    r = rr
                    rrsq = math.sqrt(1 - r ** 2)
                    if kdelta > 1:
                        itype = -itype
                    if ksign != 1:
                        
                        r = -r
                        zac = -zac
                    
                    # 71
                    pdf = math.exp(-0.5 * (zac ** 2 - 2 * r * zac * zab + zab ** 2) / rrsq ** 2) / (2 * math.pi * rrsq)
                    pac = NormalDist().cdf((zac - r * zab) / rrsq) - 0.5
                    pab = NormalDist().cdf((zab - r * zac) / rrsq) - 0.5
                    sdr = (aa + dd) * (bb + cc) / 4 + pab ** 2 * (aa + cc) * (bb + dd) + pac ** 2 * (aa + bb) * (cc + dd) + 2 * pab * pac * (aa * dd - bb * cc) - pab * (aa * bb - cc * dd) - pac * (aa * cc - bb * dd)
                    if sdr < 0:
                        sdr = 0
                    sdr = math.sqrt(sdr) / (tot * pdf * math.sqrt(tot))
                    skip85 = False
            if not skip85:
                
                # 85
                sdzero = math.sqrt((aa + bb) * (aa + cc) * (bb + dd) * (cc + dd) / tot) / (tot ** 2 * ss)
                if r == 0:
                    sdr = 0
    
    return r

def r_tetrachor_kirk(af, bf, cf, df):
    # P. Stikker adaptation of Fortran IV code from:
    # Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. doi: 10.1007/BF02291118

    # Since:
    # P = the joint proportion for which both marginal values are < .5
    # We might have to swop the rows and/or columns
    if af + bf > cf + df:
        
        # swop rows
        oldTemp = af
        af = cf
        cf = oldTemp
        oldTemp = bf
        bf = df
        df = oldTemp
    if af + cf > bf + df:
        
        # swop columns
        oldTemp = af
        af = bf
        bf = oldTemp
        oldTemp = cf
        cf = df
        df = oldTemp
    
    # input parameters of original function
    n = af + bf + cf + df
    p = af / n
    fm1 = (af + bf) / n
    fm2 = (af + cf) / n
    
    # value settings
    c1 = 0.019855071751232
    c2 = 0.101666761293187
    c3 = 0.237233795041835
    c4 = 0.408282678752175
    c5 = 0.591717321247825
    c6 = 0.762766204958165
    c7 = 0.898333238706813
    c8 = 0.980144928248768
    c1sq = 0.000394223874247
    c2sq = 0.010336130351846
    c3sq = 0.056279873509951
    c4sq = 0.166694745769052
    c5sq = 0.350129388264702
    c6sq = 0.581812283426281
    c7sq = 0.807002607765472
    c8sq = 0.960684080371783
    w1 = 0.050614268145188
    w2 = 0.111190517226687
    w3 = 0.156853322938943
    w4 = 0.181341891689181
    rpi = 0.398942280401433
    rtpi = 2.506628274631
    r = 0
    
    # EVALUATION OF H AND K
    j = 1
    
    # CONVERGENCE CRITERION FOR H AND K
    eps = 1 * 10 ** (-5)
    x = fm1
    con = (0.5 - fm1) * rtpi
    
    # GO TO 200
    loop200 = True
    while loop200:
        loop200 = False
        if x > 0.5:
            
            # GO TO 600
            # 600
            r = 4
            
            # GO TO 1000
        else:
            
            # HASTINGS' APPROXIMATION
            # 250
            e = math.sqrt(-2 * math.log(x))
            e1 = (0.010328 * e + 0.802853) * e + 2.515517
            e2 = ((0.001308 * e + 0.189269) * e + 1.432788) * e + 1
            est = e - e1 / e2
            loop300 = True
            while loop300:
                loop300 = False
                
                # 300
                old = est
                
                # ITERATION LOOP
                loop350 = True
                i = 1
                while loop350 and i <= 20:
                    loop350 = False
                    
                    # 350
                    xx = old * old
                    if j == 3 and abs(old) >= 1:
                        
                        # GO TO 550
                        # 550
                        if r != 2:
                            r = 2
                            old = 0.97 * sgn(a)
                            
                            # GO TO 350
                            i = i + 1
                            loop350 = True
                    else:
                        if j == 3:
                            
                            # GO TO 400
                            # 400
                            # R INTEGRAND EVALUATION
                            fnum = old * (w1 * (hEfn2(h2k2, hk2, x, xx, c1, c1sq) + hEfn2(h2k2, hk2, x, xx, c8, c8sq)) + w2 * (hEfn2(h2k2, hk2, x, xx, c2, c2sq) + hEfn2(h2k2, hk2, x, xx, c7, c7sq)) + w3 * (hEfn2(h2k2, hk2, x, xx, c3, c3sq) + hEfn2(h2k2, hk2, x, xx, c6, c6sq)) + w4 * (hEfn2(h2k2, hk2, x, xx, c4, c4sq) + hEfn2(h2k2, hk2, x, xx, c5, c5sq)))
                            fnew = old - (fnum - con) / hEfn2(h2k2, hk2, x, xx, 1, 1)
                        else:
                            
                            # H AND K INTEGRAND EVALUATION FOR GAUSSIAN QUADRATURE
                            fnum = old * (w1 * (hEfn1(xx, c1sq) + hEfn1(xx, c8sq)) + w2 * (hEfn1(xx, c2sq) + hEfn1(xx, c7sq)) + w3 * (hEfn1(xx, c3sq) + hEfn1(xx, c6sq)) + w4 * (hEfn1(xx, c4sq) + hEfn1(xx, c5sq)))
                            fnew = old - (fnum - con) / hEfn1(xx, 1)
                            
                            # GO TO 450
                        # TEST FOR CONVERGENCE
                        # 450
                        if abs(old - fnew) > eps:
                            
                            # GO TO 500
                            # 500
                            old = fnew
                            i = i + 1
                            if i > 20:
                                r = 3
                                
                                # GO TO 1000
                            else:
                                loop350 = True
                        else:
                            if j == 1:
                                
                                # GO TO 100
                                # 100
                                j = 2
                                h = fnew
                                h2 = h * h
                                zh = rpi * math.exp(-1 * h2 / 2)
                                x = fm2
                                con = (0.5 - fm2) * rtpi
                                
                                # GO TO 200
                                loop200 = True
                            else:
                                if j == 2:
                                    
                                    # GO TO 150
                                    # 150
                                    fk = fnew
                                    fk2 = fk * fk
                                    h2k2 = h2 + fk2
                                    hk2 = h * fk * 2
                                    zk = rpi * math.exp(-1 * fk2 / 2)
                                    j = 3
                                    
                                    # STARTING ESTIMATE FOR R
                                    a = p - fm1 * fm2
                                    con = a * 6.28318530717959
                                    zhk = zh * zk
                                    est = 2 * (math.sqrt(abs(hk2 * a / zhk + 1)) - 1) / hk2
                                    if abs(hk2) <= 1 * 10 ** (-8):
                                        est = a / zhk
                                    if abs(est) > 0.8:
                                        est = 0.8 * sgn(a)
                                    
                                    # CONVERGENCE CRITERION FOR R
                                    eps = 1 * 10 ** (-4)
                                    
                                    # GO TO 300
                                    loop300 = True
                                else:
                                    
                                    # GO TO 650
                                    r = fnew
                                    
                                    # GO TO 1000
    # 200
    # 1000
    rt = r
    
    return rt

def hEfn1(xx, w):
    f = math.exp(-1 * xx * w / 2)    
    return f

def hEfn2(h2k2, hk2, x, xx, v, w):
    f = math.exp((-1 * h2k2 + hk2 * v * x) / (2 * (1 - w * xx))) / math.sqrt(1 - w * xx)    
    return f

def sgn(n):
    if n == 0: return 0
    elif n < 0: return -1
    else: return 1

def r_tetrachor_divgi(a, b, c, d, iter=10):
    # Divgi Algorithm
    # Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. doi: 10.1007/BF02293968

    r1 = a + b
    c1 = a + c
    n = r1 + c + d
    p1 = r1 / n
    p2 = c1 / n
    h = NormalDist().inv_cdf(p1)
    k = NormalDist().inv_cdf(p2)
    if abs(h) > abs(k):
        hAdj = abs(h)
        kAdj = abs(k)
    else:
        hAdj = abs(k)
        kAdj = abs(h)
    odRat = (a * d) / (b * c)
    dA = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.12454 - 0.27102 * (1 - hAdj / math.sqrt(hAdj ** 2 + kAdj ** 2))))
    dB = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.82281 - 1.03514 * kAdj / math.sqrt(hAdj ** 2 + kAdj ** 2)))
    dC = 0.07557 * hAdj + (hAdj - kAdj) ** 2 * (0.51141 / (hAdj + 2.05793) - 0.07557 / hAdj)
    dD = kAdj * (0.79289 + 4.28981 / (1 + 3.30231 * hAdj))
    alp = dA + dB * (-1 + 1 / (1 + dC * (math.log(odRat) - dD) ** 2))
    rt = math.cos(math.pi / (1 + odRat ** alp))
    for i in range(1, iter + 1, 1):
        cov = np.array([[1, rt], [rt, 1]])
        l = multivariate_normal.cdf(np.array([h, k]), cov=cov)
        ld = math.exp(-(h ** 2 - 2 * rt * h * k + k ** 2) / (2 * (1 - rt ** 2))) / (2 * math.pi * math.sqrt(1 - rt ** 2))
        rt = rt - (l - float(a) / n) / ld
    rt = rt * sgn(h) * sgn(k)
    
    return rt

Functions

def hEfn1(xx, w)
Expand source code
def hEfn1(xx, w):
    f = math.exp(-1 * xx * w / 2)    
    return f
def hEfn2(h2k2, hk2, x, xx, v, w)
Expand source code
def hEfn2(h2k2, hk2, x, xx, v, w):
    f = math.exp((-1 * h2k2 + hk2 * v * x) / (2 * (1 - w * xx))) / math.sqrt(1 - w * xx)    
    return f
def r_tetrachor_brown(a, b, c, d)
Expand source code
def r_tetrachor_brown(a, b, c, d):
    # P. Stikker adaptation of Fortran code from:
    # Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. doi: 10.2307/2346985
    
    x = [0.9972638618, 0.9856115115, 0.9647622556, 0.9349060759, 
         0.8963211558, 0.8493676137, 0.794483796, 0.7321821187,
         0.6630442669, 0.5877157572, 0.5068999089, 0.4213512761,
         0.3318686023, 0.2392873623, 0.1444719616, 0.0483076657]
    w = [0.00701861, 0.0162743947, 0.0253920653, 0.0342738629,
         0.042835898, 0.0509980593, 0.0586840935, 0.0658222228,
         0.0723457941, 0.0781938958, 0.0833119242, 0.087652093,
         0.0911738787, 0.0938443991, 0.0956387201, 0.096540085]
    sqt2pi = 2.50662827
    rlimit = 0.9999
    rcut = 0.95
    uplim = 5
    const = 1 * 10 ** (-60)
    chalf = 1 * 10 ** (-30)
    conv = 1 * 10 ** (-8)
    citer = 1 * 10 ** (-6)
    niter = 25
    
    # INITIALIZATION
    r = 0
    itype = 0
    
    # CHECK IF ANY CELL FREQUENCY IS NEGATIVE
    if a < 0 or b < 0 or c < 0 or d < 0:
        print("error in input")
        r = 2
    else:
        # CHECK IF ANY FREQUENCY IS ZERO AND SET KDELTA
        kdelta = 1
        delta = 0
        if a == 0 or d == 0:
            kdelta = 2
        if b == 0 or c == 0:
            kdelta = kdelta + 2
        
        # KDELTA=4 MEANS TABLE HAS ZERO ROW OR COLUMN, RUN IS TERMINATED
        # DELTA IS 0.0, 0.5 OR -0.5 ACCORDING TO WHICH CELL IS ZERO
        if kdelta == 4:
            
            # GO TO 92
            print("error row or column total is zero")
            r = 2
        else:
            if kdelta != 1:
                if kdelta == 2:
                    
                    # GOTO 1
                    # 1
                    delta = 0.5
                    if a == 0 and d == 0:
                        r = -1
                    
                    # GOTO 4
                else:
                    if kdelta == 3:
                        
                        # GOTO 2
                        # 2
                        delta = -0.5
                        if b == 0 and c == 0:
                            r = 1
            
            # 4
            if r != 0:
                itype = 3
            
            # STORE FREQUENCIES IN AA, BB, CC AND DD
            aa = a + delta
            bb = b - delta
            cc = c - delta
            dd = d + delta
            tot = aa + bb + cc + dd
            
            # CHECK IF CORREIATION IS NEGATIVE, ZERO, POSITIVE
            # COMPUTE PROBILITIES OF QUADRANT AND OF MARGINALS PRBMAA AND PROBAC CHOSEN SO THAT CORREIATION IS POSITIVE.  KSIGN INDICATES WHETHER QUADRANTS HAVE BEEN SWITCHED
            if aa * dd - bb * cc < 0:
                
                # GO TO 7
                # 7
                probaa = bb / tot
                probac = (bb + dd) / tot
                ksign = 2
            else:
                if aa * dd - bb * cc == 0:
                    
                    # GO TO 5
                    # 5
                    itype = 4
                
                # 6
                probaa = aa / tot
                probac = (aa + cc) / tot
                ksign = 1
                
                # GO TO 8
            # 8
            probab = (aa + bb) / tot
            
            # COMPUTE NORMALDEVIATES FOR THE MARGINAL FREQUENCIES
            # SINCE NO MARGINAL CAN BE ZERO, IE IS NOT CHECKED
            zac = NormalDist().inv_cdf(probac)
            zab = NormalDist().inv_cdf(probab)
            ss = math.exp(-0.5 * (zac ** 2 + zab ** 2)) / (2 * math.pi)
            
            # WHEN R IS 0.0, 1.0 OR -1.0, TRANSFER TO COMUTE SDZERO
            if r != 0 or itype > 0:
                
                # GOTO 85
                skip85 = False
            else:
                
                # WHEN MARGINALS ARE EQUAL, COSINE EVALUATION IS USED
                if a == d and b == c:
                    
                    # GOTO 60
                    # 60
                    rr = -1 * cos(2 * math.pi * probaa)
                    itype = 2
                else:
                    
                    # INITIAL ESTIMATE OF CORRELATION IS YULES Y
                    rr = (math.sqrt(aa * dd) - math.sqrt(bb * cc)) ** 2 / abs(aa * dd - bb * cc)
                    iter = 0
                    
                    # IF RR EXCEEDS RCUT, GAUSSIANQUADRATURE IS USED
                    loop10 = True
                    while loop10:
                        loop10 = False
                        
                        # 10
                        if rr > rcut:
                            
                            # GOTO 40
                            skip40 = False
                        else:
                            
                            # TETRACHORIC SERIES IS COMPUTED
                            # INITIALIALIZATION
                            va = 1
                            vb = zac
                            wa = 1
                            wb = zab
                            term = 1
                            iterm = 0
                            sum = probab * probac
                            deriv = 0
                            sr = ss
                            
                            # 15
                            loop15 = True
                            while loop15:
                                loop15 = False
                                if abs(sr) <= const:
                                    # RESCALE TERMS TO AVOID OVERFLOWS AND UNDERFLOWS
                                    sr = sr / const
                                    va = va * chalf
                                    vb = vb * chalf
                                    wa = wa * chalf
                                    wb = wb * chalf
                                
                                # FORM SUM AND DERIV OF SERIES
                                # 20
                                dr = sr * va * wa
                                sr = sr * rr / term
                                cof = sr * va * wa
                                
                                # ITERM COUNTS NO. OF CONSECUTIVE TERMS .LT. CONV
                                iterm = iterm + 1
                                if abs(cof) > conv:
                                    iterm = 0
                                sum = sum + cof
                                deriv = deriv + dr
                                vaa = va
                                waa = wa
                                va = vb
                                wa = wb
                                vb = zac * va - term * vaa
                                wb = zab * wa - term * waa
                                term = term + 1
                                if iterm < 2 or term < 6:
                                    loop15 = True
                            
                            # CHECK IF ITERATION CONVERGED
                            if abs(sum - probaa) > citer:
                                
                                # GOTO 25
                                # CALCULATE NEXT ESTIMATE OF CORRELATION
                                # 25
                                iter = iter + 1
                                
                                # IF TOO MANY ITERATIONS, RUN IS TERMINATED
                                if iter >= niter:
                                    
                                    # GOTO 93
                                    skip40 = True
                                    skip70 = True
                                    skip85 = False
                                else:
                                    delta = (sum - probaa) / deriv
                                    rrprev = rr
                                    rr = rr - delta
                                    if iter == 1:
                                        rr = rr + 0.5 * delta
                                    if rr > rlimit:
                                        rr = rlimit
                                    if rr < 0:
                                        rr = 0
                                    
                                    # GOTO 10
                                    loop10 = True
                                    skip40 = False
                            else:
                                
                                # ITERATION HAS CONVERGED,SET ITYPE
                                itype = term
                                
                                # GOTO 70
                                skip40 = True
                                skip70 = False
                    if not skip40:
                        
                        # GAUSSIAN QUADRATURE
                        # 40
                        if iter <= 0:
                            # INITIALIZATION IF THIS IS FIRST ITERATION
                            sum = probab * probac
                            rrprev = 0
                        
                        # INITIALIZATION
                        # 41
                        sumprv = probab - sum
                        prob = bb / tot
                        
                        if ksign == 2:
                            prob = aa / tot
                        itype = 1
                        
                        # LOOP TO FIND ESTIMATE OF CORRELATION
                        # COMPUTATION OF INTEGRAL(SUM) BY QUADRATURE
                        loop42 = True
                        while loop42:
                            loop42 = False
                            
                            # 42
                            rrsq = math.sqrt(1 - rr ** 2)
                            amid = 0.5 * (uplim + zac)
                            xlen = uplim - amid
                            sum = 0
                            for iquad in range(1, 16 + 1, 1):
                                xla = amid + x[iquad - 1] * xlen
                                xlb = amid - x[iquad - 1] * xlen
                                
                                # TO AVOID UNDERFLOWS, TEMPA AND TEMPB ARE USED
                                tempa = (zab - rr * xla) / rrsq
                                if tempa >= -6:
                                    sum = sum + w[iquad - 1] * math.exp(-0.5 * xla ** 2) * NormalDist().cdf(tempa)
                                tempb = (zab - rr * xlb) / rrsq
                                if tempb >= -6:
                                    sum = sum + w[iquad - 1] * math.exp(-0.5 * xlb ** 2) * NormalDist().cdf(tempb)
                            
                            # 44
                            sum = sum * xlen / sqt2pi
                            
                            # CHECX IF ITERATION HAS CONVERGED
                            if abs(prob - sum) <= citer:
                                
                                # GOTO 70
                                skip70 = False
                            else:
                                iter = iter + 1
                            
                            # IF TOO MANY ITERATIONS, RUN IS TERMINATED
                            if iter >= niter:
                                
                                # GOTO 93
                                skip70 = True
                                skip85 = False
                            
                            # ESTIMATE CORRELATIONFOR NEXT ITERATION BY LINEAR INTERPOLATION
                            rrest = ((prob - sum) * rrprev - (prob - sumprv) * rr) / (sumprv - sum)
                            
                            # IS ESTIMATE POSITIVE AND LESS THAN UPPER LIMIT
                            if rrest > rlimit:
                                rrest = rlimit
                            if rrest < 0:
                                rrest = 0
                            rrprev = rr
                            rr = rrest
                            sumprv = sum
                            
                            # IF ESTIMATE HAS SAME VALUE ON TWO ITERATIONS, STOP ITERATION
                            if rr == rrprev:
                                
                                # GOTO 70
                                skip70 = False
                            else:
                                
                                # GOTO 42
                                loop42 = True
                if not skip70:
                    
                    # COMPUTE SDR
                    # 70
                    r = rr
                    rrsq = math.sqrt(1 - r ** 2)
                    if kdelta > 1:
                        itype = -itype
                    if ksign != 1:
                        
                        r = -r
                        zac = -zac
                    
                    # 71
                    pdf = math.exp(-0.5 * (zac ** 2 - 2 * r * zac * zab + zab ** 2) / rrsq ** 2) / (2 * math.pi * rrsq)
                    pac = NormalDist().cdf((zac - r * zab) / rrsq) - 0.5
                    pab = NormalDist().cdf((zab - r * zac) / rrsq) - 0.5
                    sdr = (aa + dd) * (bb + cc) / 4 + pab ** 2 * (aa + cc) * (bb + dd) + pac ** 2 * (aa + bb) * (cc + dd) + 2 * pab * pac * (aa * dd - bb * cc) - pab * (aa * bb - cc * dd) - pac * (aa * cc - bb * dd)
                    if sdr < 0:
                        sdr = 0
                    sdr = math.sqrt(sdr) / (tot * pdf * math.sqrt(tot))
                    skip85 = False
            if not skip85:
                
                # 85
                sdzero = math.sqrt((aa + bb) * (aa + cc) * (bb + dd) * (cc + dd) / tot) / (tot ** 2 * ss)
                if r == 0:
                    sdr = 0
    
    return r
def r_tetrachor_divgi(a, b, c, d, iter=10)
Expand source code
def r_tetrachor_divgi(a, b, c, d, iter=10):
    # Divgi Algorithm
    # Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. doi: 10.1007/BF02293968

    r1 = a + b
    c1 = a + c
    n = r1 + c + d
    p1 = r1 / n
    p2 = c1 / n
    h = NormalDist().inv_cdf(p1)
    k = NormalDist().inv_cdf(p2)
    if abs(h) > abs(k):
        hAdj = abs(h)
        kAdj = abs(k)
    else:
        hAdj = abs(k)
        kAdj = abs(h)
    odRat = (a * d) / (b * c)
    dA = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.12454 - 0.27102 * (1 - hAdj / math.sqrt(hAdj ** 2 + kAdj ** 2))))
    dB = 0.5 / (1 + (hAdj ** 2 + kAdj ** 2) * (0.82281 - 1.03514 * kAdj / math.sqrt(hAdj ** 2 + kAdj ** 2)))
    dC = 0.07557 * hAdj + (hAdj - kAdj) ** 2 * (0.51141 / (hAdj + 2.05793) - 0.07557 / hAdj)
    dD = kAdj * (0.79289 + 4.28981 / (1 + 3.30231 * hAdj))
    alp = dA + dB * (-1 + 1 / (1 + dC * (math.log(odRat) - dD) ** 2))
    rt = math.cos(math.pi / (1 + odRat ** alp))
    for i in range(1, iter + 1, 1):
        cov = np.array([[1, rt], [rt, 1]])
        l = multivariate_normal.cdf(np.array([h, k]), cov=cov)
        ld = math.exp(-(h ** 2 - 2 * rt * h * k + k ** 2) / (2 * (1 - rt ** 2))) / (2 * math.pi * math.sqrt(1 - rt ** 2))
        rt = rt - (l - float(a) / n) / ld
    rt = rt * sgn(h) * sgn(k)
    
    return rt
def r_tetrachor_iter(a, b, c, d, niters=50)
Expand source code
def r_tetrachor_iter(a, b, c, d, niters=50):
    
    #The totals
    R1 = a + b
    R2 = c + d
    C1 = a + c
    C2 = b + d
    n = R1 + R2
    
    
    aProp = a/n
    r1Prop = R1/n
    c1Prop = C1/n

    r1Zscore = NormalDist().inv_cdf(r1Prop)
    c1Zscore = NormalDist().inv_cdf(c1Prop)


    step = 0.1

    #Initial start
    rhoBest = -0.9
    rho = rhoBest
    cov = np.array([[1, rho], [rho, 1]])
    p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
    dif1 = abs(aProp - p)
    dif1

    #First iteration
    rho = rho + step
    cov = np.array([[1, rho], [rho, 1]])
    p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
    dif2 = abs(aProp - p)
    dif2

    #iterations
    for i in range(0, niters):
        if dif2<dif1:
            rhoBest = rho
            dif1 = dif2
        else:
            rho = rho - 2*step
            step = step/2

            cov = np.array([[1, rho], [rho, 1]])
            p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
            dif1 = abs(aProp - p)

        rho = rho + step
        cov = np.array([[1, rho], [rho, 1]])
        p = multivariate_normal.cdf(np.array([r1Zscore, c1Zscore]), cov=cov)
        dif2 = abs(aProp - p)
    
    return (rho)
def r_tetrachor_kirk(af, bf, cf, df)
Expand source code
def r_tetrachor_kirk(af, bf, cf, df):
    # P. Stikker adaptation of Fortran IV code from:
    # Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. doi: 10.1007/BF02291118

    # Since:
    # P = the joint proportion for which both marginal values are < .5
    # We might have to swop the rows and/or columns
    if af + bf > cf + df:
        
        # swop rows
        oldTemp = af
        af = cf
        cf = oldTemp
        oldTemp = bf
        bf = df
        df = oldTemp
    if af + cf > bf + df:
        
        # swop columns
        oldTemp = af
        af = bf
        bf = oldTemp
        oldTemp = cf
        cf = df
        df = oldTemp
    
    # input parameters of original function
    n = af + bf + cf + df
    p = af / n
    fm1 = (af + bf) / n
    fm2 = (af + cf) / n
    
    # value settings
    c1 = 0.019855071751232
    c2 = 0.101666761293187
    c3 = 0.237233795041835
    c4 = 0.408282678752175
    c5 = 0.591717321247825
    c6 = 0.762766204958165
    c7 = 0.898333238706813
    c8 = 0.980144928248768
    c1sq = 0.000394223874247
    c2sq = 0.010336130351846
    c3sq = 0.056279873509951
    c4sq = 0.166694745769052
    c5sq = 0.350129388264702
    c6sq = 0.581812283426281
    c7sq = 0.807002607765472
    c8sq = 0.960684080371783
    w1 = 0.050614268145188
    w2 = 0.111190517226687
    w3 = 0.156853322938943
    w4 = 0.181341891689181
    rpi = 0.398942280401433
    rtpi = 2.506628274631
    r = 0
    
    # EVALUATION OF H AND K
    j = 1
    
    # CONVERGENCE CRITERION FOR H AND K
    eps = 1 * 10 ** (-5)
    x = fm1
    con = (0.5 - fm1) * rtpi
    
    # GO TO 200
    loop200 = True
    while loop200:
        loop200 = False
        if x > 0.5:
            
            # GO TO 600
            # 600
            r = 4
            
            # GO TO 1000
        else:
            
            # HASTINGS' APPROXIMATION
            # 250
            e = math.sqrt(-2 * math.log(x))
            e1 = (0.010328 * e + 0.802853) * e + 2.515517
            e2 = ((0.001308 * e + 0.189269) * e + 1.432788) * e + 1
            est = e - e1 / e2
            loop300 = True
            while loop300:
                loop300 = False
                
                # 300
                old = est
                
                # ITERATION LOOP
                loop350 = True
                i = 1
                while loop350 and i <= 20:
                    loop350 = False
                    
                    # 350
                    xx = old * old
                    if j == 3 and abs(old) >= 1:
                        
                        # GO TO 550
                        # 550
                        if r != 2:
                            r = 2
                            old = 0.97 * sgn(a)
                            
                            # GO TO 350
                            i = i + 1
                            loop350 = True
                    else:
                        if j == 3:
                            
                            # GO TO 400
                            # 400
                            # R INTEGRAND EVALUATION
                            fnum = old * (w1 * (hEfn2(h2k2, hk2, x, xx, c1, c1sq) + hEfn2(h2k2, hk2, x, xx, c8, c8sq)) + w2 * (hEfn2(h2k2, hk2, x, xx, c2, c2sq) + hEfn2(h2k2, hk2, x, xx, c7, c7sq)) + w3 * (hEfn2(h2k2, hk2, x, xx, c3, c3sq) + hEfn2(h2k2, hk2, x, xx, c6, c6sq)) + w4 * (hEfn2(h2k2, hk2, x, xx, c4, c4sq) + hEfn2(h2k2, hk2, x, xx, c5, c5sq)))
                            fnew = old - (fnum - con) / hEfn2(h2k2, hk2, x, xx, 1, 1)
                        else:
                            
                            # H AND K INTEGRAND EVALUATION FOR GAUSSIAN QUADRATURE
                            fnum = old * (w1 * (hEfn1(xx, c1sq) + hEfn1(xx, c8sq)) + w2 * (hEfn1(xx, c2sq) + hEfn1(xx, c7sq)) + w3 * (hEfn1(xx, c3sq) + hEfn1(xx, c6sq)) + w4 * (hEfn1(xx, c4sq) + hEfn1(xx, c5sq)))
                            fnew = old - (fnum - con) / hEfn1(xx, 1)
                            
                            # GO TO 450
                        # TEST FOR CONVERGENCE
                        # 450
                        if abs(old - fnew) > eps:
                            
                            # GO TO 500
                            # 500
                            old = fnew
                            i = i + 1
                            if i > 20:
                                r = 3
                                
                                # GO TO 1000
                            else:
                                loop350 = True
                        else:
                            if j == 1:
                                
                                # GO TO 100
                                # 100
                                j = 2
                                h = fnew
                                h2 = h * h
                                zh = rpi * math.exp(-1 * h2 / 2)
                                x = fm2
                                con = (0.5 - fm2) * rtpi
                                
                                # GO TO 200
                                loop200 = True
                            else:
                                if j == 2:
                                    
                                    # GO TO 150
                                    # 150
                                    fk = fnew
                                    fk2 = fk * fk
                                    h2k2 = h2 + fk2
                                    hk2 = h * fk * 2
                                    zk = rpi * math.exp(-1 * fk2 / 2)
                                    j = 3
                                    
                                    # STARTING ESTIMATE FOR R
                                    a = p - fm1 * fm2
                                    con = a * 6.28318530717959
                                    zhk = zh * zk
                                    est = 2 * (math.sqrt(abs(hk2 * a / zhk + 1)) - 1) / hk2
                                    if abs(hk2) <= 1 * 10 ** (-8):
                                        est = a / zhk
                                    if abs(est) > 0.8:
                                        est = 0.8 * sgn(a)
                                    
                                    # CONVERGENCE CRITERION FOR R
                                    eps = 1 * 10 ** (-4)
                                    
                                    # GO TO 300
                                    loop300 = True
                                else:
                                    
                                    # GO TO 650
                                    r = fnew
                                    
                                    # GO TO 1000
    # 200
    # 1000
    rt = r
    
    return rt
def r_tetrachoric(field1, field2, categories1=None, categories2=None, method='divgi')

Tetrachoric Correlation Coefficient

In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).

This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.

This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price.

Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function.

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used
method : {"divgi", "search", "brown", "kirk"}, optional
method to use (see details)

Returns

Tetrachoric Correlation Coefficient
 

Notes

The "search" method does a binary search for rt using the bivariate normal distribution from the scipy's function multivariate_normal().

"kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet

"brown" will use Brown (1977) - Algorithm AS 116

"divgi" will use Divgi (1979) algorithm

Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html

A separate function for each method is also available, which each require as input the four cell values.

Before, After and Alternatives

Before determining the effect size measure, you might want to use: * tab_cross to get a cross table and/or * ts_fisher to perform a Fisher exact test

Alternative effect size measures can be found in the es_bin_bin documentation.

Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The th_pearson_r function could be used for this.

References

Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics, 26(3), 343. https://doi.org/10.2307/2346985

Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2), 169–172. https://doi.org/10.1007/BF02293968

Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using GPower 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149

Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. Psychometrika, 38(2), 259–268. https://doi.org/10.1007/BF02291118

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
-0.2104257288112121
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown")
0.21042572904332063
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk")
-0.20502692146733825
>>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi")
-0.2104257288112121
Expand source code
def r_tetrachoric(field1, field2, categories1=None, categories2=None, method="divgi"):
    '''
    Tetrachoric Correlation Coefficient
    -----------------------------------
    In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).
    
    This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.
    
    This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price.
    
    Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979), available in this function.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used
    method : {"divgi", "search", "brown", "kirk"}, optional
        method to use (see details)
    
    Returns
    -------
    Tetrachoric Correlation Coefficient
    
    Notes
    -----
    The "search" method does a binary search for rt using the bivariate normal distribution from the *scipy's* function **multivariate_normal()**.
    
    "kirk" will use Kirk (1973) Fortran TET8 procedure, adapted by stikpet
    
    "brown" will use Brown (1977) - Algorithm AS 116
    
    "divgi" will use Divgi (1979) algorithm
    
    Flow charts of these algorithms can be found at https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/BinBin/BinBin-2b-EffectSize.html
    
    A separate function for each method is also available, which each require as input the four cell values.

    Before, After and Alternatives
    -----------------------
    Before determining the effect size measure, you might want to use:
    * [tab_cross](../other/table_cross.html#tab_cross) to get a cross table and/or 
    * [ts_fisher](../tests/test_fisher.html#ts_fisher) to perform a Fisher exact test

    Alternative effect size measures can be found in the [es_bin_bin](../effect_sizes/eff_size_bin_bin.html#es_bin_bin) documentation.

    Faul et al. (2009, p. 1150) write: “Cohen’s (1988) conventions for correlations in the framework of the bivariate normal model may serve as rough reference points” (p. 1150). This most likely suggests the thresholds from Cohen for the Pearson correlation of 0.10, 0.30 and 0.50. The [th_pearson_r](../other/thumb_pearson_r.html#th_pearson_r) function could be used for this.
    
    References
    ----------
    Brown, M. B. (1977). Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. *Applied Statistics, 26*(3), 343. https://doi.org/10.2307/2346985
    
    Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. *Psychometrika, 44*(2), 169–172. https://doi.org/10.1007/BF02293968

    Faul, F., Erdfelder, E., Buchner, A., & Lang, A.G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. *Behavior Research Methods, 41*(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
    
    Kirk, D. B. (1973). On the numerical approximation of the bivariate normal (tetrachoric) correlation coefficient. *Psychometrika, 38*(2), 259–268. https://doi.org/10.1007/BF02291118
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    -0.2104257288112121
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="brown")
    0.21042572904332063
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="kirk")
    -0.20502692146733825
    
    >>> r_tetrachoric(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], method="divgi")
    -0.2104257288112121
    
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    
    if method=="search":
        rho = r_tetrachor_iter(a, b, c, d)
    elif method=="brown":
        rho = r_tetrachor_brown(a, b, c, d)
    elif method=="kirk":
        rho = r_tetrachor_kirk(a, b, c, d)
    elif method=="divgi":
        rho = r_tetrachor_divgi(a, b, c, d)
    return rho
def sgn(n)
Expand source code
def sgn(n):
    if n == 0: return 0
    elif n < 0: return -1
    else: return 1