Module `stikpetP.tests.test_powerdivergence_gof`

Expand source code

import pandas as pd
import numpy as np
from scipy.stats import chi2

def ts_powerdivergence_gof(data, expCounts=None, cc=None, lambd=2/3):
    '''
    Power Divergence GoF Test
    -------------------------
    
    A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis).
    
    There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test ( $\\chi^2$ ), but also an exact multinomial, G-test ( $G^2$ ), Freeman-Tukey ( $T^2$ ), Neyman ( $NM^2$ ), Mod-Log Likelihood ( $GM^2$ ), and Freeman-Tukey-Read test are possible.
    
    Cressie and Read (1984, p. 463) noticed how the $\\chi^2$, $G^2$, $T^2$, $NM^2$ and $GM^2$ can all be captured with one general formula. The additional variable lambda ( $\\lambda$ ) was then investigated, and they settled on a $\\lambda$ of 2/3.
    
    By setting $\\lambda$ to different values, we get the different tests:
    
    * $\\lambda = 1$, Pearson chi-square
    * $\\lambda = 0$, G/Wilks/Likelihood-Ratio
    * $\\lambda = -\\frac{1}{2}$, Freeman-Tukey
    * $\\lambda = -1$, Mod-Log-Likelihood
    * $\\lambda = -2$, Neyman
    * $\\lambda = \\frac{2}{3}$, Cressie-Read

    This function is shown in this [YouTube video](https://youtu.be/bvlDT3LdC8Y) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/PowerDivergence.html)
    
    Parameters
    ----------
    data :  list or pandas data series 
        the data
        
    expCount : pandas dataframe, optional 
        the categories and expected counts
        
    cc : {None, "yates", "yates2", "pearson", "williams"}, optional 
        which continuity correction to use. Default is None
        
    lambd : {"cressie-read", float, "g", "mod-log", "freeman-tukey", "neyman", "pearson"}, optional 
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *n*, the sample size
        * *k*, the number of categories
        * *statistic*, the test statistic (chi-square value)
        * *df*, degrees of freedom
        * *p-value*, significance (p-value)
        * *minExp*, the minimum expected count
        * *percBelow5*, the percentage of categories with an expected count below 5
        * *test used*, description of the test used
    
    Notes
    -----
    The formula used is (Cressie & Read, 1984, p. 442):
    $$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{k}F_{i}\\times ln\\left(\\frac{F_{i}}{E_{i}}\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{k} E_{i}\\times ln\\left(\\frac{E_{i}}{F_{i}}\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{k} F_{i}\\times\\left(\\left(\\frac{F_{i}}{E_{i}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$
    $$df = k-1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$
    
    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i} = \\frac{n}{k}$$
    
    *Symbols used:*
    
    * $k$ the number of categories
    * $F_{i}$ the observed count of category i
    * $E_{i}$ the expected count of category i
    * $n$ the sum of all counts
    * $\\chi^2\\left(\\dots\\right)$ the chi-square cumulative density function
    
    Cressie and Read (1984, p. 463) suggest to use $\\lambda = \\frac{2}{3}$,  which is therefor the default in this function.
    
    The **Pearson chi-square statistic** can be obtained by setting $\\lambda = 1$. 
    
    The **Freeman-Tukey test** will be same as setting lambda to $-\\frac{1}{2}$. 
    
    **Neyman test** formula will be same as setting lambda to $-2$.
    
    The Yates continuity correction (cc="yates") is calculated using (Yates, 1934, p. 222):
    $$F_i^\\ast  = \\begin{cases} F_i - 0.5 & \\text{ if } F_i > E_i \\\\ F_i + 0.5 & \\text{ if } F_i < E_i \\\\ F_i & \\text{ if } F_i = E_i \\end{cases}$$

    In some cases the Yates correction is slightly changed to (yates2) (Allen, 1990, p. 523):
    $$F_i^\\ast  = \\begin{cases} F_i - 0.5 & \\text{ if } F_i - 0.5 > E_i \\\\ F_i + 0.5 & \\text{ if } F_i + 0.5 < E_i \\\\ F_i & \\text{ else } \\end{cases}$$
    
    Note that the Yates correction is usually only considered if there are only two categories. Some also argue this correction is too conservative (see for details Haviland (1990)).
    
    The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{adj}^2 = \\chi_{C}^{2}\\times\\frac{n - 1}{n}$$
    
    The Williams correction (cc="williams") is calculated using (Williams, 1976, p. 36):
    $$\\chi_{adj}^2 = \\frac{\\chi_{C}^2}{q}$$
    
    With:
    $$q = 1 + \\frac{k^2 - 1}{6\\times n\\times df}$$
    
    The formula is also used by McDonald (2014, p. 87)
    
    Before, After and Alternatives
    ------------------------------
    Before this an impression using a frequency table or a visualisation might be helpful:
    * [tab_frequency](../other/table_frequency.html#tab_frequency)
    * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
    * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
    * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
    * [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart
    * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
    
    After this you might an effect size measure:
    * [es_cohen_w](../effect_sizes/eff_size_cohen_w.html#es_cohen_w) for Cohen w
    * [es_cramer_v_gof](../effect_sizes/eff_size_cramer_v_gof.html#es_cramer_v_gof) for Cramer's V for Goodness-of-Fit
    * [es_fei](../effect_sizes/eff_size_fei.html#es_fei) for Fei
    * [es_jbm_e](../effect_sizes/eff_size_jbm_e.html#es_jbm_e) for Johnston-Berry-Mielke E

    or perform a post-hoc test:
    * [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test
    * [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests
    * [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests
    * [ph_residual_gof_gof](../other/poho_residual_gof_gof.html#ph_residual_gof_gof) for Residuals Using Goodness-of-Fit Tests

    Alternative tests:
    * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
    * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
    * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
    * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
    * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
    * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
    * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
    
    References 
    ----------
    Allen, A. O. (1990). *Probability, statistics, and queueing theory with computer science applications* (2nd ed.). Academic Press.
    
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.
    
    Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x
    
    Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756
    
    Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2 × 2 contingency tables. *Statistics in Medicine, 9*(4), 363–367. doi:10.1002/sim.4780090403
    
    Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030
    
    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
    
    Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897
    
    Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360
    
    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    ---------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    
    Example 1: pandas series
    >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['mar1']
    >>> ts_powerdivergence_gof(ex1)
          n  k    statistic  df        p-value  minExp  percBelow5                          test used
    0  1941  5  1187.351716   4  8.792881e-256   388.2         0.0  Cressie-Read goodness-of-fit test
    
    Example 2: pandas series with various settings
    >>> ex2 = df1['mar1']
    >>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]})
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts)
          n  k   statistic  df        p-value  minExp  percBelow5                          test used
    0  1760  4  957.932468   3  2.402524e-207   440.0         0.0  Cressie-Read goodness-of-fit test
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="pearson")
          n  k   statistic  df        p-value  minExp  percBelow5                                                  test used
    0  1760  4  957.388188   3  3.153068e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Pearson correction
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="williams")
          n  k   statistic  df        p-value  minExp  percBelow5                                                   test used
    0  1760  4  957.479116   3  3.013070e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Williams correction
    
    Example 3: a list
    >>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
    >>> ts_powerdivergence_gof(ex3)
        n  k  statistic  df   p-value  minExp  percBelow5                          test used
    0  19  4   3.180061   3  0.364688    4.75       100.0  Cressie-Read goodness-of-fit test
    
    '''
    #Set correction factor to 1 (no correction)
    corFactor = 1
    
    #Test Used
    if lambd == 2/3 or lambd == "cressie-read":
        lambd = 2/3
        testUsed = "Cressie-Read goodness-of-fit test"
    
    elif lambd==0 or lambd == "g":
        lambd=0
        testUsed = "likelihood-ratio goodness-of-fit test"
        
    elif lambd==-1 or lambd == "mod-log":
        lambd=-1
        testUsed = "mod-log likelihood ratio goodness-of-fit test"
        
    elif lambd==1 or lambd=="pearson":
        lambd=1
        testUsed = "Pearson goodness-of-fit test"
    
    elif lambd==-0.5 or lambd=="freeman-tukey":
        lambd=-0.5
        testUsed = "Freeman-Tukey goodness-of-fit test"
    elif lambd==-2 or lambd=="neyman":
        lambd=-2
        testUsed = "Neyman goodness-of-fit test"
    else:
        testUsed = "power divergence goodness-of-fit test with lambda = " + str(lambd)
        
        
    if type(data) == list:
        data = pd.Series(data)
        
    #Set correction factor to 1 (no correction)
    corFactor = 1    
    
    #The test itself        
    freqs = data.value_counts()
    k = len(freqs)

    #Determine expected counts if not provided
    if expCounts is None:
        expCounts = [sum(freqs)/len(freqs)]* k
        expCounts = pd.Series(expCounts, index=list(freqs.index.values))
    
    else:
        #if expected counts are provided
        ne = 0
        k = len(expCounts)
        #determine sample size of expected counts
        for i in range(0,k):
            ne = ne + expCounts.iloc[i,1]

        #remove categories not provided from observed counts
        for i in freqs.index:
            if i not in list(expCounts.iloc[:,0]):
                freqs = freqs.drop(i)
        # and sort based on the index
        freqs = freqs.sort_index()

        #set the column names
        expCounts.columns = ["category", "count"]
        #sort the expected counts
        expCounts.sort_values(by="category", inplace=True)
        #adjust based on observed count total
        expCounts['count'] = expCounts['count'].astype('float64')
        n = sum(freqs)
        for i in range(0,k):
            expCounts.at[i, 'count'] = float(expCounts.at[i, 'count'] / ne * n)
        
        expCounts = pd.Series(expCounts.iloc[:, 1])

    n = sum(freqs)
    df = k - 1

    #set williams correction factor
    if cc=="williams":
        corFactor = 1/(1 + (k**2 - 1)/(6*n*df))
        testUsed = testUsed + ", and Williams correction"
    
    #adjust frequencies if Yates correction is requested
    if cc=="yates":
        k = len(freqs)
        adjFreq = list(freqs).copy()
        for i in range(0, k):
            if adjFreq[i] > expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] - 0.5
            elif adjFreq[i] < expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] + 0.5

        freqs = pd.Series(adjFreq, index=list(freqs.index.values))
        testUsed = testUsed + ", and Yates correction"

    if cc=="yates2":
        k = len(freqs)
        adjFreq = list(freqs).copy()
        for i in range(0, k):
            if adjFreq[i] - 0.5 > expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] - 0.5
            elif adjFreq[i] + 0.5 < expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] + 0.5

        freqs = pd.Series(adjFreq, index=list(freqs.index.values))
        testUsed = testUsed + ", and Yates correction"
        
    #determine the test statistic
    
    freqs=list(freqs)
    if lambd==0:
        ts = 2*sum(freqs*np.log(freqs/expCounts))
    elif lambd==-1:
        ts = 2*sum(expCounts*np.log(expCounts/freqs))
    else:
        ts = 2*sum(freqs*((freqs/expCounts)**(lambd) - 1))/(lambd*(lambd + 1))
    
    #set E.S. Pearson correction
    if cc=="pearson":
        corFactor = (n - 1)/n
        testUsed = testUsed + ", and Pearson correction"
    
    #Adjust test statistic
    ts = ts*corFactor
    
    #Determine p-value
    pVal = chi2.sf(ts, df)
    
    #Check minimum expected counts
    #Cells with expected count less than 5
    nbelow = len([x for x in expCounts if x < 5])
    #Number of cells
    ncells = len(expCounts)
    #As proportion
    pBelow = nbelow/ncells
    #the minimum expected count
    minExp = min(expCounts)
    
    #prepare results
    testResults = pd.DataFrame([[n, k, ts, df, pVal, minExp, pBelow*100, testUsed]], columns=["n", "k","statistic", "df", "p-value", "minExp", "percBelow5", "test used"])        
    pd.set_option('display.max_colwidth', None)
    
    return testResults

Functions

def ts_powerdivergence_gof(data, expCounts=None, cc=None, lambd=0.6666666666666666)

Power Divergence Gof Test

A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis).

There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test ( $\chi^2$ ), but also an exact multinomial, G-test ( $G^2$ ), Freeman-Tukey ( $T^2$ ), Neyman ( $NM^2$ ), Mod-Log Likelihood ( $GM^2$ ), and Freeman-Tukey-Read test are possible.

Cressie and Read (1984, p. 463) noticed how the $\chi^2$, $G^2$, $T^2$, $NM^2$ and $GM^2$ can all be captured with one general formula. The additional variable lambda ( $\lambda$ ) was then investigated, and they settled on a $\lambda$ of 2/3.

By setting $\lambda$ to different values, we get the different tests:

$\lambda = 1$, Pearson chi-square
$\lambda = 0$, G/Wilks/Likelihood-Ratio
$\lambda = -\frac{1}{2}$, Freeman-Tukey
$\lambda = -1$, Mod-Log-Likelihood
$\lambda = -2$, Neyman
$\lambda = \frac{2}{3}$, Cressie-Read

This function is shown in this YouTube video and the test is also described at PeterStatistics.com

Parameters

data : list or pandas data series: the data
expCount : pandas dataframe, optional: the categories and expected counts
cc : {None, "yates", "yates2", "pearson", "williams"}, optional: which continuity correction to use. Default is None
lambd : {"cressie-read", float, "g", "mod-log", "freeman-tukey", "neyman", "pearson"}, optional: either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3

Returns

pandas.DataFrame

A dataframe with the following columns:

n, the sample size
k, the number of categories
statistic, the test statistic (chi-square value)
df, degrees of freedom
p-value, significance (p-value)
minExp, the minimum expected count
percBelow5, the percentage of categories with an expected count below 5
test used, description of the test used

Notes

The formula used is (Cressie & Read, 1984, p. 442): $\chi_{C}^{2} = \begin{cases} 2\times\sum_{i=1}^{k}F_{i}\times ln\left(\frac{F_{i}}{E_{i}}\right) & \text{ if } \lambda=0 \\ 2\times\sum_{i=1}^{k} E_{i}\times ln\left(\frac{E_{i}}{F_{i}}\right) & \text{ if } \lambda=-1 \\ \frac{2}{\lambda\times\left(\lambda + 1\right)} \times \sum_{i=1}^{k} F_{i}\times\left(\left(\frac{F_{i}}{E_{i}}\right)^{\lambda} - 1\right) & \text{ else } \end{cases}$ $df = k-1$ $sig. = 1 - \chi^2\left(\chi_{C}^{2},df\right)$

With: $n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j}$ $E_{i} = \frac{n}{k}$

Symbols used:

$k$ the number of categories
$F_{i}$ the observed count of category i
$E_{i}$ the expected count of category i
$n$ the sum of all counts
$\chi^2\left(\dots\right)$ the chi-square cumulative density function

Cressie and Read (1984, p. 463) suggest to use $\lambda = \frac{2}{3}$, which is therefor the default in this function.

The Pearson chi-square statistic can be obtained by setting $\lambda = 1$.

The Freeman-Tukey test will be same as setting lambda to $-\frac{1}{2}$.

Neyman test formula will be same as setting lambda to $-2$.

The Yates continuity correction (cc="yates") is calculated using (Yates, 1934, p. 222): $F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i > E_i \\ F_i + 0.5 & \text{ if } F_i < E_i \\ F_i & \text{ if } F_i = E_i \end{cases}$

In some cases the Yates correction is slightly changed to (yates2) (Allen, 1990, p. 523): $F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i - 0.5 > E_i \\ F_i + 0.5 & \text{ if } F_i + 0.5 < E_i \\ F_i & \text{ else } \end{cases}$

Note that the Yates correction is usually only considered if there are only two categories. Some also argue this correction is too conservative (see for details Haviland (1990)).

The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 157): $\chi_{adj}^2 = \chi_{C}^{2}\times\frac{n - 1}{n}$

The Williams correction (cc="williams") is calculated using (Williams, 1976, p. 36): $\chi_{adj}^2 = \frac{\chi_{C}^2}{q}$

With: $q = 1 + \frac{k^2 - 1}{6\times n\times df}$

The formula is also used by McDonald (2014, p. 87)

Before, After and Alternatives

Before this an impression using a frequency table or a visualisation might be helpful: * tab_frequency * vi_bar_simple for Simple Bar Chart * vi_cleveland_dot_plot for Cleveland Dot Plot * vi_dot_plot for Dot Plot * vi_pareto_chart for Pareto Chart * vi_pie for Pie Chart

After this you might an effect size measure: * es_cohen_w for Cohen w * es_cramer_v_gof for Cramer's V for Goodness-of-Fit * es_fei for Fei * es_jbm_e for Johnston-Berry-Mielke E

or perform a post-hoc test: * ph_pairwise_bin for Pairwise Binary Test * ph_pairwise_gof for Pairwise Goodness-of-Fit Tests * ph_residual_gof_bin for Residuals Tests * ph_residual_gof_gof for Residuals Using Goodness-of-Fit Tests

Alternative tests: * ts_pearson_gof for Pearson Chi-Square Goodness-of-Fit Test * ts_freeman_tukey_gof for Freeman-Tukey Test of Goodness-of-Fit * ts_freeman_tukey_read for Freeman-Tukey-Read Test of Goodness-of-Fit * ts_g_gof for G (Likelihood Ratio) Goodness-of-Fit Test * ts_mod_log_likelihood_gof for Mod-Log Likelihood Test of Goodness-of-Fit * ts_multinomial_gof for Multinomial Goodness-of-Fit Test * ts_neyman_gof for Neyman Test of Goodness-of-Fit

References

Allen, A. O. (1990). Probability, statistics, and queueing theory with computer science applications (2nd ed.). Academic Press.

Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). Discrete multivariate analysis. Springer.

Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x

Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. The Annals of Mathematical Statistics, 21(4), 607–611. doi:10.1214/aoms/1177729756

Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2 × 2 contingency tables. Statistics in Medicine, 9(4), 363–367. doi:10.1002/sim.4780090403

Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030

Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2), 139–167. doi:10.2307/2332518

Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175. doi:10.1080/14786440009463897

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62. doi:10.1214/aoms/1177732360

Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. Biometrika, 63(1), 33–37. doi:10.2307/2335081

Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. doi:10.2307/2983604

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)

Example 1: pandas series

>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['mar1']
>>> ts_powerdivergence_gof(ex1)
      n  k    statistic  df        p-value  minExp  percBelow5                          test used
0  1941  5  1187.351716   4  8.792881e-256   388.2         0.0  Cressie-Read goodness-of-fit test

Example 2: pandas series with various settings

>>> ex2 = df1['mar1']
>>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]})
>>> ts_powerdivergence_gof(ex2, expCounts=eCounts)
      n  k   statistic  df        p-value  minExp  percBelow5                          test used
0  1760  4  957.932468   3  2.402524e-207   440.0         0.0  Cressie-Read goodness-of-fit test
>>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="pearson")
      n  k   statistic  df        p-value  minExp  percBelow5                                                  test used
0  1760  4  957.388188   3  3.153068e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Pearson correction
>>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="williams")
      n  k   statistic  df        p-value  minExp  percBelow5                                                   test used
0  1760  4  957.479116   3  3.013070e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Williams correction

Example 3: a list

>>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
>>> ts_powerdivergence_gof(ex3)
    n  k  statistic  df   p-value  minExp  percBelow5                          test used
0  19  4   3.180061   3  0.364688    4.75       100.0  Cressie-Read goodness-of-fit test

Expand source code

def ts_powerdivergence_gof(data, expCounts=None, cc=None, lambd=2/3):
    '''
    Power Divergence GoF Test
    -------------------------
    
    A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis).
    
    There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test ( $\\chi^2$ ), but also an exact multinomial, G-test ( $G^2$ ), Freeman-Tukey ( $T^2$ ), Neyman ( $NM^2$ ), Mod-Log Likelihood ( $GM^2$ ), and Freeman-Tukey-Read test are possible.
    
    Cressie and Read (1984, p. 463) noticed how the $\\chi^2$, $G^2$, $T^2$, $NM^2$ and $GM^2$ can all be captured with one general formula. The additional variable lambda ( $\\lambda$ ) was then investigated, and they settled on a $\\lambda$ of 2/3.
    
    By setting $\\lambda$ to different values, we get the different tests:
    
    * $\\lambda = 1$, Pearson chi-square
    * $\\lambda = 0$, G/Wilks/Likelihood-Ratio
    * $\\lambda = -\\frac{1}{2}$, Freeman-Tukey
    * $\\lambda = -1$, Mod-Log-Likelihood
    * $\\lambda = -2$, Neyman
    * $\\lambda = \\frac{2}{3}$, Cressie-Read

    This function is shown in this [YouTube video](https://youtu.be/bvlDT3LdC8Y) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/PowerDivergence.html)
    
    Parameters
    ----------
    data :  list or pandas data series 
        the data
        
    expCount : pandas dataframe, optional 
        the categories and expected counts
        
    cc : {None, "yates", "yates2", "pearson", "williams"}, optional 
        which continuity correction to use. Default is None
        
    lambd : {"cressie-read", float, "g", "mod-log", "freeman-tukey", "neyman", "pearson"}, optional 
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *n*, the sample size
        * *k*, the number of categories
        * *statistic*, the test statistic (chi-square value)
        * *df*, degrees of freedom
        * *p-value*, significance (p-value)
        * *minExp*, the minimum expected count
        * *percBelow5*, the percentage of categories with an expected count below 5
        * *test used*, description of the test used
    
    Notes
    -----
    The formula used is (Cressie & Read, 1984, p. 442):
    $$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{k}F_{i}\\times ln\\left(\\frac{F_{i}}{E_{i}}\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{k} E_{i}\\times ln\\left(\\frac{E_{i}}{F_{i}}\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{k} F_{i}\\times\\left(\\left(\\frac{F_{i}}{E_{i}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$
    $$df = k-1$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$
    
    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i} = \\frac{n}{k}$$
    
    *Symbols used:*
    
    * $k$ the number of categories
    * $F_{i}$ the observed count of category i
    * $E_{i}$ the expected count of category i
    * $n$ the sum of all counts
    * $\\chi^2\\left(\\dots\\right)$ the chi-square cumulative density function
    
    Cressie and Read (1984, p. 463) suggest to use $\\lambda = \\frac{2}{3}$,  which is therefor the default in this function.
    
    The **Pearson chi-square statistic** can be obtained by setting $\\lambda = 1$. 
    
    The **Freeman-Tukey test** will be same as setting lambda to $-\\frac{1}{2}$. 
    
    **Neyman test** formula will be same as setting lambda to $-2$.
    
    The Yates continuity correction (cc="yates") is calculated using (Yates, 1934, p. 222):
    $$F_i^\\ast  = \\begin{cases} F_i - 0.5 & \\text{ if } F_i > E_i \\\\ F_i + 0.5 & \\text{ if } F_i < E_i \\\\ F_i & \\text{ if } F_i = E_i \\end{cases}$$

    In some cases the Yates correction is slightly changed to (yates2) (Allen, 1990, p. 523):
    $$F_i^\\ast  = \\begin{cases} F_i - 0.5 & \\text{ if } F_i - 0.5 > E_i \\\\ F_i + 0.5 & \\text{ if } F_i + 0.5 < E_i \\\\ F_i & \\text{ else } \\end{cases}$$
    
    Note that the Yates correction is usually only considered if there are only two categories. Some also argue this correction is too conservative (see for details Haviland (1990)).
    
    The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{adj}^2 = \\chi_{C}^{2}\\times\\frac{n - 1}{n}$$
    
    The Williams correction (cc="williams") is calculated using (Williams, 1976, p. 36):
    $$\\chi_{adj}^2 = \\frac{\\chi_{C}^2}{q}$$
    
    With:
    $$q = 1 + \\frac{k^2 - 1}{6\\times n\\times df}$$
    
    The formula is also used by McDonald (2014, p. 87)
    
    Before, After and Alternatives
    ------------------------------
    Before this an impression using a frequency table or a visualisation might be helpful:
    * [tab_frequency](../other/table_frequency.html#tab_frequency)
    * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
    * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
    * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
    * [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart
    * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
    
    After this you might an effect size measure:
    * [es_cohen_w](../effect_sizes/eff_size_cohen_w.html#es_cohen_w) for Cohen w
    * [es_cramer_v_gof](../effect_sizes/eff_size_cramer_v_gof.html#es_cramer_v_gof) for Cramer's V for Goodness-of-Fit
    * [es_fei](../effect_sizes/eff_size_fei.html#es_fei) for Fei
    * [es_jbm_e](../effect_sizes/eff_size_jbm_e.html#es_jbm_e) for Johnston-Berry-Mielke E

    or perform a post-hoc test:
    * [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test
    * [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests
    * [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests
    * [ph_residual_gof_gof](../other/poho_residual_gof_gof.html#ph_residual_gof_gof) for Residuals Using Goodness-of-Fit Tests

    Alternative tests:
    * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
    * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
    * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
    * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
    * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
    * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
    * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
    
    References 
    ----------
    Allen, A. O. (1990). *Probability, statistics, and queueing theory with computer science applications* (2nd ed.). Academic Press.
    
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.
    
    Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x
    
    Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756
    
    Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2 × 2 contingency tables. *Statistics in Medicine, 9*(4), 363–367. doi:10.1002/sim.4780090403
    
    Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030
    
    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
    
    Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897
    
    Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360
    
    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
    
    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    ---------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    
    Example 1: pandas series
    >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['mar1']
    >>> ts_powerdivergence_gof(ex1)
          n  k    statistic  df        p-value  minExp  percBelow5                          test used
    0  1941  5  1187.351716   4  8.792881e-256   388.2         0.0  Cressie-Read goodness-of-fit test
    
    Example 2: pandas series with various settings
    >>> ex2 = df1['mar1']
    >>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]})
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts)
          n  k   statistic  df        p-value  minExp  percBelow5                          test used
    0  1760  4  957.932468   3  2.402524e-207   440.0         0.0  Cressie-Read goodness-of-fit test
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="pearson")
          n  k   statistic  df        p-value  minExp  percBelow5                                                  test used
    0  1760  4  957.388188   3  3.153068e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Pearson correction
    >>> ts_powerdivergence_gof(ex2, expCounts=eCounts, cc="williams")
          n  k   statistic  df        p-value  minExp  percBelow5                                                   test used
    0  1760  4  957.479116   3  3.013070e-207   440.0         0.0  Cressie-Read goodness-of-fit test, and Williams correction
    
    Example 3: a list
    >>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
    >>> ts_powerdivergence_gof(ex3)
        n  k  statistic  df   p-value  minExp  percBelow5                          test used
    0  19  4   3.180061   3  0.364688    4.75       100.0  Cressie-Read goodness-of-fit test
    
    '''
    #Set correction factor to 1 (no correction)
    corFactor = 1
    
    #Test Used
    if lambd == 2/3 or lambd == "cressie-read":
        lambd = 2/3
        testUsed = "Cressie-Read goodness-of-fit test"
    
    elif lambd==0 or lambd == "g":
        lambd=0
        testUsed = "likelihood-ratio goodness-of-fit test"
        
    elif lambd==-1 or lambd == "mod-log":
        lambd=-1
        testUsed = "mod-log likelihood ratio goodness-of-fit test"
        
    elif lambd==1 or lambd=="pearson":
        lambd=1
        testUsed = "Pearson goodness-of-fit test"
    
    elif lambd==-0.5 or lambd=="freeman-tukey":
        lambd=-0.5
        testUsed = "Freeman-Tukey goodness-of-fit test"
    elif lambd==-2 or lambd=="neyman":
        lambd=-2
        testUsed = "Neyman goodness-of-fit test"
    else:
        testUsed = "power divergence goodness-of-fit test with lambda = " + str(lambd)
        
        
    if type(data) == list:
        data = pd.Series(data)
        
    #Set correction factor to 1 (no correction)
    corFactor = 1    
    
    #The test itself        
    freqs = data.value_counts()
    k = len(freqs)

    #Determine expected counts if not provided
    if expCounts is None:
        expCounts = [sum(freqs)/len(freqs)]* k
        expCounts = pd.Series(expCounts, index=list(freqs.index.values))
    
    else:
        #if expected counts are provided
        ne = 0
        k = len(expCounts)
        #determine sample size of expected counts
        for i in range(0,k):
            ne = ne + expCounts.iloc[i,1]

        #remove categories not provided from observed counts
        for i in freqs.index:
            if i not in list(expCounts.iloc[:,0]):
                freqs = freqs.drop(i)
        # and sort based on the index
        freqs = freqs.sort_index()

        #set the column names
        expCounts.columns = ["category", "count"]
        #sort the expected counts
        expCounts.sort_values(by="category", inplace=True)
        #adjust based on observed count total
        expCounts['count'] = expCounts['count'].astype('float64')
        n = sum(freqs)
        for i in range(0,k):
            expCounts.at[i, 'count'] = float(expCounts.at[i, 'count'] / ne * n)
        
        expCounts = pd.Series(expCounts.iloc[:, 1])

    n = sum(freqs)
    df = k - 1

    #set williams correction factor
    if cc=="williams":
        corFactor = 1/(1 + (k**2 - 1)/(6*n*df))
        testUsed = testUsed + ", and Williams correction"
    
    #adjust frequencies if Yates correction is requested
    if cc=="yates":
        k = len(freqs)
        adjFreq = list(freqs).copy()
        for i in range(0, k):
            if adjFreq[i] > expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] - 0.5
            elif adjFreq[i] < expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] + 0.5

        freqs = pd.Series(adjFreq, index=list(freqs.index.values))
        testUsed = testUsed + ", and Yates correction"

    if cc=="yates2":
        k = len(freqs)
        adjFreq = list(freqs).copy()
        for i in range(0, k):
            if adjFreq[i] - 0.5 > expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] - 0.5
            elif adjFreq[i] + 0.5 < expCounts.iloc[i]:
                adjFreq[i] = adjFreq[i] + 0.5

        freqs = pd.Series(adjFreq, index=list(freqs.index.values))
        testUsed = testUsed + ", and Yates correction"
        
    #determine the test statistic
    
    freqs=list(freqs)
    if lambd==0:
        ts = 2*sum(freqs*np.log(freqs/expCounts))
    elif lambd==-1:
        ts = 2*sum(expCounts*np.log(expCounts/freqs))
    else:
        ts = 2*sum(freqs*((freqs/expCounts)**(lambd) - 1))/(lambd*(lambd + 1))
    
    #set E.S. Pearson correction
    if cc=="pearson":
        corFactor = (n - 1)/n
        testUsed = testUsed + ", and Pearson correction"
    
    #Adjust test statistic
    ts = ts*corFactor
    
    #Determine p-value
    pVal = chi2.sf(ts, df)
    
    #Check minimum expected counts
    #Cells with expected count less than 5
    nbelow = len([x for x in expCounts if x < 5])
    #Number of cells
    ncells = len(expCounts)
    #As proportion
    pBelow = nbelow/ncells
    #the minimum expected count
    minExp = min(expCounts)
    
    #prepare results
    testResults = pd.DataFrame([[n, k, ts, df, pVal, minExp, pBelow*100, testUsed]], columns=["n", "k","statistic", "df", "p-value", "minExp", "percBelow5", "test used"])        
    pd.set_option('display.max_colwidth', None)
    
    return testResults