Module stikpetP.other.poho_residual_gof_gof
Expand source code
import pandas as pd
from ..tests.test_multinomial_gof import ts_multinomial_gof
from ..tests.test_powerdivergence_gof import ts_powerdivergence_gof
from ..tests.test_neyman_gof import ts_neyman_gof
from ..tests.test_mod_log_likelihood_gof import ts_mod_log_likelihood_gof
from ..tests.test_g_gof import ts_g_gof
from ..tests.test_freeman_tukey_read import ts_freeman_tukey_read
from ..tests.test_freeman_tukey_gof import ts_freeman_tukey_gof
from ..tests.test_pearson_gof import ts_pearson_gof
from ..other.p_adjustments import p_adjust
def ph_residual_gof_gof(data, test="pearson", expCount=None, mtc='bonferroni', **kwargs):
'''
Post-Hoc Residuals Using GoF for GoF
----------------------------------------
This function will perform a goodness-of-fit test using each category and collapsing the other categories.
The unadjusted p-values and Bonferroni adjusted p-values are both determined.
This function is shown in this [YouTube video](https://youtu.be/PF4Iuh0BDtQ) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/PostHocAfterGoF.html)
Parameters
----------
data : list or pandas series
test : {"pearson", "freeman-tukey", "freeman-tukey-read", "g", "mod-log-g", "neyman", "powerdivergence", "multinomial"}, optional
test to use
expCount : pandas dataframe, optional
categories and expected counts
mtc : string, optional
any of the methods available in p_adjust() to correct for multiple tests
**kwargs : optional
additional arguments for the specific test that are passed along.
Returns
-------
pandas.DataFrame
A dataframe with the following columns:
- *category*, the label of the category
- *obs. count*, the observed count
- *exp. count*, the expected count
- *statistic*, the chi-square test statistic
- *df*, the degrees of freedom or in case
- *p-value*, the unadjusted significance
- *adj. p-value*, the adjusted significance
- *minExp*, the minimum expected count
- *percBelow5*, the percentage of cells with an expected count below 5
- *test*, description of the test used
- In case of a multinomial test, the same columns except
- *p obs* instead of *statistic*, showing the probability of the observed sample table
- *n combs.*, instead of *df*, showing the number of possible tables
- no *minExp* and *propBelow5* column.
Notes
-----
none.
Before, After and Alternatives
------------------------------
Before this an omnibus test might be helpful, these are also the tests used on each category:
* [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
* [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
* [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
* [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
* [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
* [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
* [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
* [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
After this you might want to add an effect size measure:
* [es_post_hoc_gof](../effect_sizes/eff_size_post_hoc_gof.html#es_post_hoc_gof) for various effect sizes
Alternative post-hoc tests:
* [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test
* [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests
* [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests
More info on the adjustment for multiple testing:
* [p_adjust](../other/p_adjustments.html#p_adjust)
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Examples: get data
>>> import pandas as pd
>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
>>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = gss_df['mar1'];
Example 1 using default settings:
>>> ph_residual_gof_gof(ex1)
category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used
0 MARRIED 972.0 388.2 1097.444745 1.0 1.186466e-240 5.932332e-240 388.2 0.0 Pearson chi-square test of goodness-of-fit
1 NEVER MARRIED 395.0 388.2 0.148892 1.0 6.995961e-01 1.000000e+00 388.2 0.0 Pearson chi-square test of goodness-of-fit
2 DIVORCED 314.0 388.2 17.728104 1.0 2.548337e-05 1.274169e-04 388.2 0.0 Pearson chi-square test of goodness-of-fit
3 WIDOWED 181.0 388.2 138.240082 1.0 6.457789e-32 3.228895e-31 388.2 0.0 Pearson chi-square test of goodness-of-fit
4 SEPARATED 79.0 388.2 307.845956 1.0 6.433895e-69 3.216947e-68 388.2 0.0 Pearson chi-square test of goodness-of-fit
Example 2 using a G test and Holm correction:
>>> ph_residual_gof_gof(ex1, test="g", mtc='holm')
category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used
0 MARRIED 972.0 388.2 870.406786 1.0 2.661640e-191 1.330820e-190 388.2 0.0 G test of goodness-of-fit
1 NEVER MARRIED 395.0 388.2 0.148246 1.0 7.002168e-01 7.002168e-01 388.2 0.0 G test of goodness-of-fit
2 DIVORCED 314.0 388.2 18.674271 1.0 1.550608e-05 3.101215e-05 388.2 0.0 G test of goodness-of-fit
3 WIDOWED 181.0 388.2 164.679720 1.0 1.074669e-37 3.224007e-37 388.2 0.0 G test of goodness-of-fit
4 SEPARATED 79.0 388.2 424.698965 1.0 2.315657e-94 9.262630e-94 388.2 0.0 G test of goodness-of-fit
'''
if type(data) is list:
data = pd.Series(data)
freq = data.value_counts()
if expCount is None:
#assume all to be equal
n = sum(freq)
k = len(freq)
categories = list(freq.index)
expC = [n/k] * k
else:
#check if categories match
nE = 0
n = 0
for i in range(0, len(expCount)):
nE = nE + expCount.iloc[i,1]
n = n + freq[expCount.iloc[i,0]]
expC = []
for i in range(0,len(expCount)):
expC.append(expCount.iloc[i, 1]/nE*n)
k = len(expC)
categories = list(expCount.iloc[:,0])
results = pd.DataFrame()
resRow=0
for i in range(0, k):
cat = categories[i]
results.at[resRow, 0] = cat
n1 = freq[categories[i]]
results.at[resRow, 1] = n1
e1 = expC[i]
results.at[resRow, 2] = e1
tempA = [categories[i]]*n1
tempB = ['all other']*(n - n1)
temp_data = tempA + tempB
exP = pd.DataFrame([[categories[i], e1], ['all other', n - e1]], columns=['category', 'count'])
if test=="pearson":
pair_test_result = ts_pearson_gof(temp_data, expCounts=exP, **kwargs)
elif test=="freeman-tukey":
pair_test_result = ts_freeman_tukey_gof(temp_data, expCounts=exP, **kwargs)
elif test=="freeman-tukey-read":
pair_test_result = ts_freeman_tukey_read(temp_data, expCounts=exP, **kwargs)
elif test=="g":
pair_test_result = ts_g_gof(temp_data, expCounts=exP, **kwargs)
elif test=="mod-log-g":
pair_test_result = ts_mod_log_likelihood_gof(temp_data, expCounts=exP, **kwargs)
elif test=="neyman":
pair_test_result = ts_neyman_gof(temp_data, expCounts=exP, **kwargs)
elif test=="powerdivergence":
pair_test_result = ts_powerdivergence_gof(temp_data, expCounts=exP, **kwargs)
if test=="multinomial":
pair_test_result = ts_multinomial_gof(temp_data, expCounts=exP, **kwargs)
results.at[resRow, 3] = pair_test_result.iloc[0, 0]
results.at[resRow, 4] = pair_test_result.iloc[0, 1]
results.at[resRow, 5] = pair_test_result.iloc[0, 2]
results.at[resRow, 6] = results.at[resRow, 5]
results.at[resRow, 7] = pair_test_result.iloc[0, 3]
else:
results.at[resRow, 3] = pair_test_result.iloc[0, 2]
results.at[resRow, 4] = pair_test_result.iloc[0, 3]
results.at[resRow, 5] = pair_test_result.iloc[0, 4]
results.at[resRow, 6] = results.at[resRow, 5]
results.at[resRow, 7] = pair_test_result.iloc[0, 5]
results.at[resRow, 8] = pair_test_result.iloc[0, 6]
results.at[resRow, 9] = pair_test_result.iloc[0, 7]
resRow = resRow + 1
results.iloc[:,6] = p_adjust(results.iloc[:,6], method=mtc)
if test == "multinomial":
# Set columns for multinomial case
results.columns = [
"category", "obs. count", "exp. count", "p obs", "n combs.",
"p-value", "adj. p-value", "test used"
]
else:
# Set columns for other cases
results.columns = [
"category", "obs. count", "exp. count", "statistic", "df",
"p-value", "adj. p-value", "minExp", "percBelow5", "test used"
]
return results
Functions
def ph_residual_gof_gof(data, test='pearson', expCount=None, mtc='bonferroni', **kwargs)-
Post-Hoc Residuals Using GoF for GoF
This function will perform a goodness-of-fit test using each category and collapsing the other categories.
The unadjusted p-values and Bonferroni adjusted p-values are both determined.
This function is shown in this YouTube video and the test is also described at PeterStatistics.com
Parameters
data:listorpandas seriestest:{"pearson", "freeman-tukey", "freeman-tukey-read", "g", "mod-log-g", "neyman", "powerdivergence", "multinomial"}, optional- test to use
expCount:pandas dataframe, optional- categories and expected counts
mtc:string, optional- any of the methods available in p_adjust() to correct for multiple tests
**kwargs:optional- additional arguments for the specific test that are passed along.
Returns
pandas.DataFrame-
A dataframe with the following columns:
- category, the label of the category
- obs. count, the observed count
- exp. count, the expected count
- statistic, the chi-square test statistic
- df, the degrees of freedom or in case
- p-value, the unadjusted significance
- adj. p-value, the adjusted significance
- minExp, the minimum expected count
- percBelow5, the percentage of cells with an expected count below 5
-
test, description of the test used
-
In case of a multinomial test, the same columns except
- p obs instead of statistic, showing the probability of the observed sample table
- n combs., instead of df, showing the number of possible tables
- no minExp and propBelow5 column.
Notes
none.
Before, After and Alternatives
Before this an omnibus test might be helpful, these are also the tests used on each category: * ts_pearson_gof for Pearson Chi-Square Goodness-of-Fit Test * ts_freeman_tukey_gof for Freeman-Tukey Test of Goodness-of-Fit * ts_freeman_tukey_read for Freeman-Tukey-Read Test of Goodness-of-Fit * ts_g_gof for G (Likelihood Ratio) Goodness-of-Fit Test * ts_mod_log_likelihood_gof for Mod-Log Likelihood Test of Goodness-of-Fit * ts_multinomial_gof for Multinomial Goodness-of-Fit Test * ts_neyman_gof for Neyman Test of Goodness-of-Fit * ts_powerdivergence_gof for Power Divergence GoF Test
After this you might want to add an effect size measure: * es_post_hoc_gof for various effect sizes
Alternative post-hoc tests: * ph_pairwise_bin for Pairwise Binary Test * ph_pairwise_gof for Pairwise Goodness-of-Fit Tests * ph_residual_gof_bin for Residuals Tests
More info on the adjustment for multiple testing: * p_adjust
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Examples: get data
>>> import pandas as pd >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) >>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = gss_df['mar1'];Example 1 using default settings:
>>> ph_residual_gof_gof(ex1) category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used 0 MARRIED 972.0 388.2 1097.444745 1.0 1.186466e-240 5.932332e-240 388.2 0.0 Pearson chi-square test of goodness-of-fit 1 NEVER MARRIED 395.0 388.2 0.148892 1.0 6.995961e-01 1.000000e+00 388.2 0.0 Pearson chi-square test of goodness-of-fit 2 DIVORCED 314.0 388.2 17.728104 1.0 2.548337e-05 1.274169e-04 388.2 0.0 Pearson chi-square test of goodness-of-fit 3 WIDOWED 181.0 388.2 138.240082 1.0 6.457789e-32 3.228895e-31 388.2 0.0 Pearson chi-square test of goodness-of-fit 4 SEPARATED 79.0 388.2 307.845956 1.0 6.433895e-69 3.216947e-68 388.2 0.0 Pearson chi-square test of goodness-of-fitExample 2 using a G test and Holm correction:
>>> ph_residual_gof_gof(ex1, test="g", mtc='holm') category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used 0 MARRIED 972.0 388.2 870.406786 1.0 2.661640e-191 1.330820e-190 388.2 0.0 G test of goodness-of-fit 1 NEVER MARRIED 395.0 388.2 0.148246 1.0 7.002168e-01 7.002168e-01 388.2 0.0 G test of goodness-of-fit 2 DIVORCED 314.0 388.2 18.674271 1.0 1.550608e-05 3.101215e-05 388.2 0.0 G test of goodness-of-fit 3 WIDOWED 181.0 388.2 164.679720 1.0 1.074669e-37 3.224007e-37 388.2 0.0 G test of goodness-of-fit 4 SEPARATED 79.0 388.2 424.698965 1.0 2.315657e-94 9.262630e-94 388.2 0.0 G test of goodness-of-fitExpand source code
def ph_residual_gof_gof(data, test="pearson", expCount=None, mtc='bonferroni', **kwargs): ''' Post-Hoc Residuals Using GoF for GoF ---------------------------------------- This function will perform a goodness-of-fit test using each category and collapsing the other categories. The unadjusted p-values and Bonferroni adjusted p-values are both determined. This function is shown in this [YouTube video](https://youtu.be/PF4Iuh0BDtQ) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/PostHocAfterGoF.html) Parameters ---------- data : list or pandas series test : {"pearson", "freeman-tukey", "freeman-tukey-read", "g", "mod-log-g", "neyman", "powerdivergence", "multinomial"}, optional test to use expCount : pandas dataframe, optional categories and expected counts mtc : string, optional any of the methods available in p_adjust() to correct for multiple tests **kwargs : optional additional arguments for the specific test that are passed along. Returns ------- pandas.DataFrame A dataframe with the following columns: - *category*, the label of the category - *obs. count*, the observed count - *exp. count*, the expected count - *statistic*, the chi-square test statistic - *df*, the degrees of freedom or in case - *p-value*, the unadjusted significance - *adj. p-value*, the adjusted significance - *minExp*, the minimum expected count - *percBelow5*, the percentage of cells with an expected count below 5 - *test*, description of the test used - In case of a multinomial test, the same columns except - *p obs* instead of *statistic*, showing the probability of the observed sample table - *n combs.*, instead of *df*, showing the number of possible tables - no *minExp* and *propBelow5* column. Notes ----- none. Before, After and Alternatives ------------------------------ Before this an omnibus test might be helpful, these are also the tests used on each category: * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test After this you might want to add an effect size measure: * [es_post_hoc_gof](../effect_sizes/eff_size_post_hoc_gof.html#es_post_hoc_gof) for various effect sizes Alternative post-hoc tests: * [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test * [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests * [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests More info on the adjustment for multiple testing: * [p_adjust](../other/p_adjustments.html#p_adjust) Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Examples: get data >>> import pandas as pd >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) >>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = gss_df['mar1']; Example 1 using default settings: >>> ph_residual_gof_gof(ex1) category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used 0 MARRIED 972.0 388.2 1097.444745 1.0 1.186466e-240 5.932332e-240 388.2 0.0 Pearson chi-square test of goodness-of-fit 1 NEVER MARRIED 395.0 388.2 0.148892 1.0 6.995961e-01 1.000000e+00 388.2 0.0 Pearson chi-square test of goodness-of-fit 2 DIVORCED 314.0 388.2 17.728104 1.0 2.548337e-05 1.274169e-04 388.2 0.0 Pearson chi-square test of goodness-of-fit 3 WIDOWED 181.0 388.2 138.240082 1.0 6.457789e-32 3.228895e-31 388.2 0.0 Pearson chi-square test of goodness-of-fit 4 SEPARATED 79.0 388.2 307.845956 1.0 6.433895e-69 3.216947e-68 388.2 0.0 Pearson chi-square test of goodness-of-fit Example 2 using a G test and Holm correction: >>> ph_residual_gof_gof(ex1, test="g", mtc='holm') category obs. count exp. count statistic df p-value adj. p-value minExp percBelow5 test used 0 MARRIED 972.0 388.2 870.406786 1.0 2.661640e-191 1.330820e-190 388.2 0.0 G test of goodness-of-fit 1 NEVER MARRIED 395.0 388.2 0.148246 1.0 7.002168e-01 7.002168e-01 388.2 0.0 G test of goodness-of-fit 2 DIVORCED 314.0 388.2 18.674271 1.0 1.550608e-05 3.101215e-05 388.2 0.0 G test of goodness-of-fit 3 WIDOWED 181.0 388.2 164.679720 1.0 1.074669e-37 3.224007e-37 388.2 0.0 G test of goodness-of-fit 4 SEPARATED 79.0 388.2 424.698965 1.0 2.315657e-94 9.262630e-94 388.2 0.0 G test of goodness-of-fit ''' if type(data) is list: data = pd.Series(data) freq = data.value_counts() if expCount is None: #assume all to be equal n = sum(freq) k = len(freq) categories = list(freq.index) expC = [n/k] * k else: #check if categories match nE = 0 n = 0 for i in range(0, len(expCount)): nE = nE + expCount.iloc[i,1] n = n + freq[expCount.iloc[i,0]] expC = [] for i in range(0,len(expCount)): expC.append(expCount.iloc[i, 1]/nE*n) k = len(expC) categories = list(expCount.iloc[:,0]) results = pd.DataFrame() resRow=0 for i in range(0, k): cat = categories[i] results.at[resRow, 0] = cat n1 = freq[categories[i]] results.at[resRow, 1] = n1 e1 = expC[i] results.at[resRow, 2] = e1 tempA = [categories[i]]*n1 tempB = ['all other']*(n - n1) temp_data = tempA + tempB exP = pd.DataFrame([[categories[i], e1], ['all other', n - e1]], columns=['category', 'count']) if test=="pearson": pair_test_result = ts_pearson_gof(temp_data, expCounts=exP, **kwargs) elif test=="freeman-tukey": pair_test_result = ts_freeman_tukey_gof(temp_data, expCounts=exP, **kwargs) elif test=="freeman-tukey-read": pair_test_result = ts_freeman_tukey_read(temp_data, expCounts=exP, **kwargs) elif test=="g": pair_test_result = ts_g_gof(temp_data, expCounts=exP, **kwargs) elif test=="mod-log-g": pair_test_result = ts_mod_log_likelihood_gof(temp_data, expCounts=exP, **kwargs) elif test=="neyman": pair_test_result = ts_neyman_gof(temp_data, expCounts=exP, **kwargs) elif test=="powerdivergence": pair_test_result = ts_powerdivergence_gof(temp_data, expCounts=exP, **kwargs) if test=="multinomial": pair_test_result = ts_multinomial_gof(temp_data, expCounts=exP, **kwargs) results.at[resRow, 3] = pair_test_result.iloc[0, 0] results.at[resRow, 4] = pair_test_result.iloc[0, 1] results.at[resRow, 5] = pair_test_result.iloc[0, 2] results.at[resRow, 6] = results.at[resRow, 5] results.at[resRow, 7] = pair_test_result.iloc[0, 3] else: results.at[resRow, 3] = pair_test_result.iloc[0, 2] results.at[resRow, 4] = pair_test_result.iloc[0, 3] results.at[resRow, 5] = pair_test_result.iloc[0, 4] results.at[resRow, 6] = results.at[resRow, 5] results.at[resRow, 7] = pair_test_result.iloc[0, 5] results.at[resRow, 8] = pair_test_result.iloc[0, 6] results.at[resRow, 9] = pair_test_result.iloc[0, 7] resRow = resRow + 1 results.iloc[:,6] = p_adjust(results.iloc[:,6], method=mtc) if test == "multinomial": # Set columns for multinomial case results.columns = [ "category", "obs. count", "exp. count", "p obs", "n combs.", "p-value", "adj. p-value", "test used" ] else: # Set columns for other cases results.columns = [ "category", "obs. count", "exp. count", "statistic", "df", "p-value", "adj. p-value", "minExp", "percBelow5", "test used" ] return results