Module stikpetP.tests.test_binomial_os
Expand source code
import pandas as pd
from scipy.stats import binom
# This function is used in ph_binomial()
def ts_binomial_os(data, p0=0.5, p0Cat=None, codes=None, twoSidedMethod="eqdist"):
'''
One-sample Binomial Test
------------------------
Performs a one-sample (exact) binomial test.
This test can be useful with a single binary variable as input. The null hypothesis is usually that the proportions of the two categories in the population are equal (i.e. 0.5 for each). If the p-value of the test is below the pre-defined alpha level (usually 5% = 0.05) the null hypothesis is rejected and the two categories differ in proportion significantly.
The input for the function doesn't have to be a binary variable. A nominal variable can also be used and the two categories to compare indicated.
A significance in general is the probability of a result as in the sample, or more extreme, if the null hypothesis is true. For a two-tailed binomial test the 'or more extreme' causes a bit of a complication. There are different methods to approach this problem. See the details for more information.
This function is shown in this [YouTube video](https://youtu.be/CzysWqVZzT0) and the binomial test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/binomial-one-sample.html)
Parameters
----------
data : list or pandas data series
the data
p0 : float, optional
hypothesized proportion for the first category (default is 0.5)
p0Cat : optional
the category for which p0 was used
codes : list, optional
the two codes to use
twoSidedMethod : {"eqdist", "double", "smallp"}, optional
method to be used for 2-sided significance. Default is "eqdist".
Returns
-------
pandas.DataFrame
A dataframe with the following columns:
- *p-value (2-sided)* : the two-sided significance (p-value)
- *test* : description of the test used
Notes
-------
To decide on which category is associated with p0 the following is used:
* If codes are provided, the first code is assumed to be the category for the p0.
* If p0Cat is specified that will be used for p0 and all other categories will be considered as category 2, this means if there are more than two categories the remaining two or more (besides p0Cat) will be merged as one large category.
* If neither codes or p0Cat is specified and more than two categories are in the data a warning is printed and no results.
* If neither codes or p0Cat is specified and there are two categories, p0 is assumed to be for the category closest matching the p0 value (i.e. if p0 is above 0.5 the category with the highest count is assumed to be used for p0)
It uses scipy.stats' binom. For the formulas below it is assumed that the observed proportion is less than the expected proportion, if this isn't the case, the right-tail probabilities are used.
A one sided p-value is calculated first:
$$sig_{one-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right)$$
With:
$$n_{min} = \\min \\left(n_s, n_f\\right)$$
$$p_0^{\\ast} = \\begin{cases}p_0 & \\text{ if } n_{min}=n_s \\\\ 1 - p_0 & \\text{ if } n_{min}= n_f\\end{cases}$$
Where:
* \\(n\\) is the number of cases
* \\(n_s\\) is the number of successes
* \\(n_f\\) is the number of failures
* \\(p_0\\) is the probability of a success according to the null hypothesis
* \\(p_0^{\\ast}\\) is the probability adjusted in case failures is used
* \\(B\\left(\\dots\\right)\\) the binomial cumulative distribution function
For the two sided significance three options can be used.
Option 1: Equal Distance Method (eqdist)
$$sig_{two-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right) + 1 - B\\left(n, \\lfloor 2 \\times n_0 \\rfloor - n_{min} - 1, p_0^{\\ast}\\right)$$
With:
\\(n_0 = \\lfloor n\\times p_0\\rfloor\\)
This method looks at the number of cases. In a sample of \\(n\\) people, we’d then expect \\(n_0 = \\lfloor n\\times p_0\\rfloor\\) successes (we round the result down to the nearest integer). We only had \\(n_{min}\\), so a difference of \\(n_0-n_{min}\\). The ‘equal distance method’ now means to look for the chance of having \\(k\\) or less, and \\(n_0+n_0-n_{min}=2\\times n_0-n_{min}\\) or more. Each of these two probabilities can be found using a binomial distribution. Adding these two together than gives the two-sided significance.
Option 2: Small p-method
$$sig_{two-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right) + \\sum_{i=n_{min}+1}^n \\begin{cases} 0 & \\text{ if } b\\left(n, i, p_0^{\\ast}\\right)> b\\left(n, n_{min}, p_0^{\\ast}\\right) \\\\ b\\left(n, i, p_0^{\\ast}\\right)& \\text{ if } \\times \\leq b\\left(n, i, p_0^{\\ast}\\right)> b\\left(n, n_{min}, p_0^{\\ast}\\right) \\end{cases}$$
With:
\\(b\\times \\left(\\dots\\right)\\) as the binomial probability mass function.
This method looks at the probabilities itself. \\(b\\left(n, n_{min}, p_0^{\\ast}\\right)\\) is the probability of having exactly \\(n_{min}\\) out of a group of n, with a chance \\(p_0^{\\ast}\\) each time. The method of small p-values now considers ‘or more extreme’ any number between 0 and n (the sample size) that has a probability less or equal to this. This means we need to go over each option, determine the probability and check if it is lower or equal. So, the probability of 0 successes, the probability of 1 success, etc. The sum for all of those will be the two-sided significance. We can reduce the work a little since any value below \\(n_{min}\\), will also have a lower probability, so we only need to sum over the ones above it and add the one-sided significance to the sum of those.
Option 3: Double single
$$sig_{two-tail} = 2\\times sig_{one-tail}$$
Fairly straight forward. Just double the one-sided significance.
Before, After and Alternatives
------------------------------
Before running the test you might first want to get an impression using a frequency table:
[tab_frequency](../other/table_frequency.html#tab_frequency)
After the test you might want an effect size measure:
* [es_cohen_g](../effect_sizes/eff_size_cohen_g.html#es_cohen_g) for Cohen g
* [es_cohen_h_os](../effect_sizes/eff_size_cohen_h_os.html#es_cohen_h_os) for Cohen h'
* [es_alt_ratio](../effect_sizes/eff_size_alt_ratio.html#es_alt_ratio) for Alternative Ratio
Alternatives for this test could be:
* [ts_score_os](../tests/test_score_os.html#ts_score_os) for One-Sample Score Test
* [ts_wald_os](../tests/test_wald_os.html#ts_wald_os) for One-Sample Wald Test
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
---------
>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
Example 1: Numeric list
>>> ex1 = [1, 1, 2, 1, 2, 1, 2, 1]
>>> ts_binomial_os(ex1)
p-value (2-sided) test
0 0.726562 one-sample binomial, with equal-distance method (assuming p0 for 1)
Setting a different hypothesized proportion, and going over different methods to determine two-sided test:
>>> ts_binomial_os(ex1, p0=0.3)
p-value (2-sided) test
0 0.313266 one-sample binomial, with equal-distance method (assuming p0 for 1)
Example 2: pandas Series
>>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ts_binomial_os(gss_df['sex'])
p-value (2-sided) test
0 0.000006 one-sample binomial, with equal-distance method (assuming p0 for FEMALE)
>>> ts_binomial_os(gss_df['mar1'], codes=["DIVORCED", "NEVER MARRIED"])
p-value (2-sided) test
0 0.003002 one-sample binomial, with equal-distance method (with p0 for DIVORCED)
'''
if type(data) is list:
data = pd.Series(data)
#remove missing values
data = data.dropna()
testUsed = "one-sample binomial"
#Determine number of successes, failures, and total sample size
if codes is None:
#create a frequency table
freq = data.value_counts()
if p0Cat is None:
#check if there were exactly two categories or not
if len(freq) != 2:
# unable to determine which category p0 would belong to, so print warning and end
print("WARNING: data does not have two unique categories, please specify two categories using codes parameter")
return
else:
#simply select the two categories as cat1 and cat2
n1 = freq.values[0]
n2 = freq.values[1]
n = n1 + n2
#determine p0 was for which category
p0_cat = freq.index[0]
if p0 > 0.5 and n1 < n2:
n3=n2
n2 = n1
n1 = n3
p0_cat = freq.index[1]
cat_used = " (assuming p0 for " + str(p0_cat) + ")"
else:
n = sum(freq.values)
n1 = sum(data==p0Cat)
n2 = n - n1
p0_cat = p0Cat
cat_used = " (with p0 for " + str(p0Cat) + ")"
else:
n1 = sum(data==codes[0])
n2 = sum(data==codes[1])
n = n1 + n2
cat_used = " (with p0 for " + str(codes[0]) + ")"
minCount = n1
ExpProp = p0
ObsProp = n1/n
if n2 < n1:
minCount = n2
ExpProp = 1 - p0
ObsProp = n2/n
#one sided test
if ExpProp < ObsProp:
sig1 = 1 - binom.cdf(minCount-1,n,ExpProp)
else:
sig1 = binom.cdf(minCount,n,ExpProp)
#two sided
if twoSidedMethod=="double":
sig2 = sig1
testUsed = testUsed + ", with double one-sided method"
elif twoSidedMethod=="eqdist":
#Equal distance
ExpCount = int(n * ExpProp)
if minCount > ExpCount and n*ExpProp - ExpCount != 0:
ExpCount = ExpCount + 1
Dist = ExpCount - minCount
if Dist==0:
sig2 = 1 - binom.cdf(minCount,n,ExpProp)
else:
OtherCount = ExpCount - Dist
if Dist < 0:
sig2 = 0
elif ExpProp < ObsProp:
sig2 = binom.cdf(OtherCount,n,ExpProp)
else:
OtherCount = ExpCount + Dist
if OtherCount > n:
sig2 = 0
else:
sig2 = 1 - binom.cdf(OtherCount - 1,n,ExpProp)
testUsed = testUsed + ", with equal-distance method"
else:
#Method of small p
#find the first value in the other direction that has a pmf less or equal to the observed one
binSmall = binom.pmf(minCount, n, ExpProp)
binDist = binSmall + 1
if ExpProp < ObsProp:
startAt = 0
endAt = minCount - 1
i = endAt
while binDist >= binSmall and i>=0:
binDist = binom.pmf(i, n, ExpProp)
i = i - 1
sig2 = binom.cdf(i+1, n, ExpProp)
else:
startAt = minCount + 1
endAt = n
i = startAt
while binDist >= binSmall and i<=n:
binDist = binom.pmf(i, n, ExpProp)
i = i +1
sig2 = 1 - binom.cdf(i-1-1, n, ExpProp)
testUsed = testUsed + ", with small p method"
testUsed = testUsed + cat_used
sigT = sig1 + sig2
testResults = pd.DataFrame([[sigT, testUsed]], columns=["p-value (2-sided)", "test"])
return testResults
Functions
def ts_binomial_os(data, p0=0.5, p0Cat=None, codes=None, twoSidedMethod='eqdist')
-
One-sample Binomial Test
Performs a one-sample (exact) binomial test.
This test can be useful with a single binary variable as input. The null hypothesis is usually that the proportions of the two categories in the population are equal (i.e. 0.5 for each). If the p-value of the test is below the pre-defined alpha level (usually 5% = 0.05) the null hypothesis is rejected and the two categories differ in proportion significantly.
The input for the function doesn't have to be a binary variable. A nominal variable can also be used and the two categories to compare indicated.
A significance in general is the probability of a result as in the sample, or more extreme, if the null hypothesis is true. For a two-tailed binomial test the 'or more extreme' causes a bit of a complication. There are different methods to approach this problem. See the details for more information.
This function is shown in this YouTube video and the binomial test is also described at PeterStatistics.com
Parameters
data
:list
orpandas data series
- the data
p0
:float
, optional- hypothesized proportion for the first category (default is 0.5)
p0Cat
:optional
- the category for which p0 was used
codes
:list
, optional- the two codes to use
twoSidedMethod
:{"eqdist", "double", "smallp"}
, optional- method to be used for 2-sided significance. Default is "eqdist".
Returns
pandas.DataFrame
-
A dataframe with the following columns:
- p-value (2-sided) : the two-sided significance (p-value)
- test : description of the test used
Notes
To decide on which category is associated with p0 the following is used: * If codes are provided, the first code is assumed to be the category for the p0. * If p0Cat is specified that will be used for p0 and all other categories will be considered as category 2, this means if there are more than two categories the remaining two or more (besides p0Cat) will be merged as one large category. * If neither codes or p0Cat is specified and more than two categories are in the data a warning is printed and no results. * If neither codes or p0Cat is specified and there are two categories, p0 is assumed to be for the category closest matching the p0 value (i.e. if p0 is above 0.5 the category with the highest count is assumed to be used for p0)
It uses scipy.stats' binom. For the formulas below it is assumed that the observed proportion is less than the expected proportion, if this isn't the case, the right-tail probabilities are used.
A one sided p-value is calculated first: sig_{one-tail} = B\left(n, n_{min}, p_0^{\ast}\right) With: n_{min} = \min \left(n_s, n_f\right) p_0^{\ast} = \begin{cases}p_0 & \text{ if } n_{min}=n_s \\ 1 - p_0 & \text{ if } n_{min}= n_f\end{cases} Where:
- n is the number of cases
- n_s is the number of successes
- n_f is the number of failures
- p_0 is the probability of a success according to the null hypothesis
- p_0^{\ast} is the probability adjusted in case failures is used
- B\left(\dots\right) the binomial cumulative distribution function
For the two sided significance three options can be used.
Option 1: Equal Distance Method (eqdist) sig_{two-tail} = B\left(n, n_{min}, p_0^{\ast}\right) + 1 - B\left(n, \lfloor 2 \times n_0 \rfloor - n_{min} - 1, p_0^{\ast}\right) With: n_0 = \lfloor n\times p_0\rfloor
This method looks at the number of cases. In a sample of n people, we’d then expect n_0 = \lfloor n\times p_0\rfloor successes (we round the result down to the nearest integer). We only had n_{min}, so a difference of n_0-n_{min}. The ‘equal distance method’ now means to look for the chance of having k or less, and n_0+n_0-n_{min}=2\times n_0-n_{min} or more. Each of these two probabilities can be found using a binomial distribution. Adding these two together than gives the two-sided significance.
Option 2: Small p-method sig_{two-tail} = B\left(n, n_{min}, p_0^{\ast}\right) + \sum_{i=n_{min}+1}^n \begin{cases} 0 & \text{ if } b\left(n, i, p_0^{\ast}\right)> b\left(n, n_{min}, p_0^{\ast}\right) \\ b\left(n, i, p_0^{\ast}\right)& \text{ if } \times \leq b\left(n, i, p_0^{\ast}\right)> b\left(n, n_{min}, p_0^{\ast}\right) \end{cases} With: b\times \left(\dots\right) as the binomial probability mass function.
This method looks at the probabilities itself. b\left(n, n_{min}, p_0^{\ast}\right) is the probability of having exactly n_{min} out of a group of n, with a chance p_0^{\ast} each time. The method of small p-values now considers ‘or more extreme’ any number between 0 and n (the sample size) that has a probability less or equal to this. This means we need to go over each option, determine the probability and check if it is lower or equal. So, the probability of 0 successes, the probability of 1 success, etc. The sum for all of those will be the two-sided significance. We can reduce the work a little since any value below n_{min}, will also have a lower probability, so we only need to sum over the ones above it and add the one-sided significance to the sum of those.
Option 3: Double single sig_{two-tail} = 2\times sig_{one-tail}
Fairly straight forward. Just double the one-sided significance.
Before, After and Alternatives
Before running the test you might first want to get an impression using a frequency table: tab_frequency
After the test you might want an effect size measure: * es_cohen_g for Cohen g * es_cohen_h_os for Cohen h' * es_alt_ratio for Alternative Ratio
Alternatives for this test could be: * ts_score_os for One-Sample Score Test * ts_wald_os for One-Sample Wald Test
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000)
Example 1: Numeric list
>>> ex1 = [1, 1, 2, 1, 2, 1, 2, 1] >>> ts_binomial_os(ex1) p-value (2-sided) test 0 0.726562 one-sample binomial, with equal-distance method (assuming p0 for 1)
Setting a different hypothesized proportion, and going over different methods to determine two-sided test:
>>> ts_binomial_os(ex1, p0=0.3) p-value (2-sided) test 0 0.313266 one-sample binomial, with equal-distance method (assuming p0 for 1)
Example 2: pandas Series
>>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ts_binomial_os(gss_df['sex']) p-value (2-sided) test 0 0.000006 one-sample binomial, with equal-distance method (assuming p0 for FEMALE) >>> ts_binomial_os(gss_df['mar1'], codes=["DIVORCED", "NEVER MARRIED"]) p-value (2-sided) test 0 0.003002 one-sample binomial, with equal-distance method (with p0 for DIVORCED)
Expand source code
def ts_binomial_os(data, p0=0.5, p0Cat=None, codes=None, twoSidedMethod="eqdist"): ''' One-sample Binomial Test ------------------------ Performs a one-sample (exact) binomial test. This test can be useful with a single binary variable as input. The null hypothesis is usually that the proportions of the two categories in the population are equal (i.e. 0.5 for each). If the p-value of the test is below the pre-defined alpha level (usually 5% = 0.05) the null hypothesis is rejected and the two categories differ in proportion significantly. The input for the function doesn't have to be a binary variable. A nominal variable can also be used and the two categories to compare indicated. A significance in general is the probability of a result as in the sample, or more extreme, if the null hypothesis is true. For a two-tailed binomial test the 'or more extreme' causes a bit of a complication. There are different methods to approach this problem. See the details for more information. This function is shown in this [YouTube video](https://youtu.be/CzysWqVZzT0) and the binomial test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/binomial-one-sample.html) Parameters ---------- data : list or pandas data series the data p0 : float, optional hypothesized proportion for the first category (default is 0.5) p0Cat : optional the category for which p0 was used codes : list, optional the two codes to use twoSidedMethod : {"eqdist", "double", "smallp"}, optional method to be used for 2-sided significance. Default is "eqdist". Returns ------- pandas.DataFrame A dataframe with the following columns: - *p-value (2-sided)* : the two-sided significance (p-value) - *test* : description of the test used Notes ------- To decide on which category is associated with p0 the following is used: * If codes are provided, the first code is assumed to be the category for the p0. * If p0Cat is specified that will be used for p0 and all other categories will be considered as category 2, this means if there are more than two categories the remaining two or more (besides p0Cat) will be merged as one large category. * If neither codes or p0Cat is specified and more than two categories are in the data a warning is printed and no results. * If neither codes or p0Cat is specified and there are two categories, p0 is assumed to be for the category closest matching the p0 value (i.e. if p0 is above 0.5 the category with the highest count is assumed to be used for p0) It uses scipy.stats' binom. For the formulas below it is assumed that the observed proportion is less than the expected proportion, if this isn't the case, the right-tail probabilities are used. A one sided p-value is calculated first: $$sig_{one-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right)$$ With: $$n_{min} = \\min \\left(n_s, n_f\\right)$$ $$p_0^{\\ast} = \\begin{cases}p_0 & \\text{ if } n_{min}=n_s \\\\ 1 - p_0 & \\text{ if } n_{min}= n_f\\end{cases}$$ Where: * \\(n\\) is the number of cases * \\(n_s\\) is the number of successes * \\(n_f\\) is the number of failures * \\(p_0\\) is the probability of a success according to the null hypothesis * \\(p_0^{\\ast}\\) is the probability adjusted in case failures is used * \\(B\\left(\\dots\\right)\\) the binomial cumulative distribution function For the two sided significance three options can be used. Option 1: Equal Distance Method (eqdist) $$sig_{two-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right) + 1 - B\\left(n, \\lfloor 2 \\times n_0 \\rfloor - n_{min} - 1, p_0^{\\ast}\\right)$$ With: \\(n_0 = \\lfloor n\\times p_0\\rfloor\\) This method looks at the number of cases. In a sample of \\(n\\) people, we’d then expect \\(n_0 = \\lfloor n\\times p_0\\rfloor\\) successes (we round the result down to the nearest integer). We only had \\(n_{min}\\), so a difference of \\(n_0-n_{min}\\). The ‘equal distance method’ now means to look for the chance of having \\(k\\) or less, and \\(n_0+n_0-n_{min}=2\\times n_0-n_{min}\\) or more. Each of these two probabilities can be found using a binomial distribution. Adding these two together than gives the two-sided significance. Option 2: Small p-method $$sig_{two-tail} = B\\left(n, n_{min}, p_0^{\\ast}\\right) + \\sum_{i=n_{min}+1}^n \\begin{cases} 0 & \\text{ if } b\\left(n, i, p_0^{\\ast}\\right)> b\\left(n, n_{min}, p_0^{\\ast}\\right) \\\\ b\\left(n, i, p_0^{\\ast}\\right)& \\text{ if } \\times \\leq b\\left(n, i, p_0^{\\ast}\\right)> b\\left(n, n_{min}, p_0^{\\ast}\\right) \\end{cases}$$ With: \\(b\\times \\left(\\dots\\right)\\) as the binomial probability mass function. This method looks at the probabilities itself. \\(b\\left(n, n_{min}, p_0^{\\ast}\\right)\\) is the probability of having exactly \\(n_{min}\\) out of a group of n, with a chance \\(p_0^{\\ast}\\) each time. The method of small p-values now considers ‘or more extreme’ any number between 0 and n (the sample size) that has a probability less or equal to this. This means we need to go over each option, determine the probability and check if it is lower or equal. So, the probability of 0 successes, the probability of 1 success, etc. The sum for all of those will be the two-sided significance. We can reduce the work a little since any value below \\(n_{min}\\), will also have a lower probability, so we only need to sum over the ones above it and add the one-sided significance to the sum of those. Option 3: Double single $$sig_{two-tail} = 2\\times sig_{one-tail}$$ Fairly straight forward. Just double the one-sided significance. Before, After and Alternatives ------------------------------ Before running the test you might first want to get an impression using a frequency table: [tab_frequency](../other/table_frequency.html#tab_frequency) After the test you might want an effect size measure: * [es_cohen_g](../effect_sizes/eff_size_cohen_g.html#es_cohen_g) for Cohen g * [es_cohen_h_os](../effect_sizes/eff_size_cohen_h_os.html#es_cohen_h_os) for Cohen h' * [es_alt_ratio](../effect_sizes/eff_size_alt_ratio.html#es_alt_ratio) for Alternative Ratio Alternatives for this test could be: * [ts_score_os](../tests/test_score_os.html#ts_score_os) for One-Sample Score Test * [ts_wald_os](../tests/test_wald_os.html#ts_wald_os) for One-Sample Wald Test Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples --------- >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) Example 1: Numeric list >>> ex1 = [1, 1, 2, 1, 2, 1, 2, 1] >>> ts_binomial_os(ex1) p-value (2-sided) test 0 0.726562 one-sample binomial, with equal-distance method (assuming p0 for 1) Setting a different hypothesized proportion, and going over different methods to determine two-sided test: >>> ts_binomial_os(ex1, p0=0.3) p-value (2-sided) test 0 0.313266 one-sample binomial, with equal-distance method (assuming p0 for 1) Example 2: pandas Series >>> gss_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ts_binomial_os(gss_df['sex']) p-value (2-sided) test 0 0.000006 one-sample binomial, with equal-distance method (assuming p0 for FEMALE) >>> ts_binomial_os(gss_df['mar1'], codes=["DIVORCED", "NEVER MARRIED"]) p-value (2-sided) test 0 0.003002 one-sample binomial, with equal-distance method (with p0 for DIVORCED) ''' if type(data) is list: data = pd.Series(data) #remove missing values data = data.dropna() testUsed = "one-sample binomial" #Determine number of successes, failures, and total sample size if codes is None: #create a frequency table freq = data.value_counts() if p0Cat is None: #check if there were exactly two categories or not if len(freq) != 2: # unable to determine which category p0 would belong to, so print warning and end print("WARNING: data does not have two unique categories, please specify two categories using codes parameter") return else: #simply select the two categories as cat1 and cat2 n1 = freq.values[0] n2 = freq.values[1] n = n1 + n2 #determine p0 was for which category p0_cat = freq.index[0] if p0 > 0.5 and n1 < n2: n3=n2 n2 = n1 n1 = n3 p0_cat = freq.index[1] cat_used = " (assuming p0 for " + str(p0_cat) + ")" else: n = sum(freq.values) n1 = sum(data==p0Cat) n2 = n - n1 p0_cat = p0Cat cat_used = " (with p0 for " + str(p0Cat) + ")" else: n1 = sum(data==codes[0]) n2 = sum(data==codes[1]) n = n1 + n2 cat_used = " (with p0 for " + str(codes[0]) + ")" minCount = n1 ExpProp = p0 ObsProp = n1/n if n2 < n1: minCount = n2 ExpProp = 1 - p0 ObsProp = n2/n #one sided test if ExpProp < ObsProp: sig1 = 1 - binom.cdf(minCount-1,n,ExpProp) else: sig1 = binom.cdf(minCount,n,ExpProp) #two sided if twoSidedMethod=="double": sig2 = sig1 testUsed = testUsed + ", with double one-sided method" elif twoSidedMethod=="eqdist": #Equal distance ExpCount = int(n * ExpProp) if minCount > ExpCount and n*ExpProp - ExpCount != 0: ExpCount = ExpCount + 1 Dist = ExpCount - minCount if Dist==0: sig2 = 1 - binom.cdf(minCount,n,ExpProp) else: OtherCount = ExpCount - Dist if Dist < 0: sig2 = 0 elif ExpProp < ObsProp: sig2 = binom.cdf(OtherCount,n,ExpProp) else: OtherCount = ExpCount + Dist if OtherCount > n: sig2 = 0 else: sig2 = 1 - binom.cdf(OtherCount - 1,n,ExpProp) testUsed = testUsed + ", with equal-distance method" else: #Method of small p #find the first value in the other direction that has a pmf less or equal to the observed one binSmall = binom.pmf(minCount, n, ExpProp) binDist = binSmall + 1 if ExpProp < ObsProp: startAt = 0 endAt = minCount - 1 i = endAt while binDist >= binSmall and i>=0: binDist = binom.pmf(i, n, ExpProp) i = i - 1 sig2 = binom.cdf(i+1, n, ExpProp) else: startAt = minCount + 1 endAt = n i = startAt while binDist >= binSmall and i<=n: binDist = binom.pmf(i, n, ExpProp) i = i +1 sig2 = 1 - binom.cdf(i-1-1, n, ExpProp) testUsed = testUsed + ", with small p method" testUsed = testUsed + cat_used sigT = sig1 + sig2 testResults = pd.DataFrame([[sigT, testUsed]], columns=["p-value (2-sided)", "test"]) return testResults