Module stikpetP.tests.test_z_is
Expand source code
from statistics import mean, variance, NormalDist
import pandas as pd
def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None):
'''
Independent Samples Z Test
--------------------------
A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead.
For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead.
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
dmu : float, optional
difference according to null hypothesis (default is 0)
sigma1 : float, optional
population standard deviation of the first group, if None sample results will be used
sigma2 : float, optional
population standard deviation of the second group, if None sample results will be used
Returns
-------
A dataframe with:
* *n cat. 1*, the sample size of the first category
* *n cat. 2*, the sample size of the second category
* *mean cat. 1*, the sample mean of the first category
* *mean cat. 2*, the sample mean of the second category
* *diff.*, difference between the two sample means
* *hyp. diff.*, hypothesized difference between the two population means
* *statistic*, the test statistic (z-value)
* *pValue*, the significance (p-value)
* *test*, name of test used
Notes
-----
The formula used is:
$$z = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$
$$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
With:
$$SE = \\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}$$
$$\\sigma_i^2 \\approx s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$
$$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$
*Symbols used:*
* \\(x_{i,j}\\) the j-th score in category i
* \\(n_i\\) the number of scores in category i
Before, After and Alternatives
------------------------------
Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html)
There are four similar tests, with different assumptions.
|test|equal variance|normality|
|-------|-----------|---------|
|[Student t](../tests/test_student_t_is.html)| yes | yes|
|[Welch t](../tests/test_welch_t_is.html) | no | yes|
|[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no |
|[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no |
Another test that in some cases could be used is the [Z test](../tests/test_z_is.html)
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['age']
>>> ex1 = ex1.replace("89 OR OLDER", "90")
>>> ts_z_is(df1['sex'], ex1)
n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic p-value test
0 1083 886 48.561404 47.760722 0.800681 0 0.998958 0.317815 independent samples z-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
>>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
>>> ts_z_is(groups, scores)
n int. n nat. mean int. mean nat. diff. hyp. diff. statistic p-value test
0 12 6 61.916667 41.666667 20.25 0 1.69314 0.090429 independent samples z-test
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
#make sure they are floats
x1 = [float(x) for x in x1]
x2 = [float(x) for x in x2]
n1 = len(x1)
n2 = len(x2)
n = n1 + n2
if sigma1 is None:
var1 = variance(x1)
else:
var1 = sigma1**2
if sigma2 is None:
var2 = variance(x2)
else:
var2 = sigma2**2
sse = var1/n1 + var2/n2
se = (sse)**0.5
m1 = mean(x1)
m2 = mean(x2)
z = (m1 - m2 - dmu)/se
pValue = 2 * (1 - NormalDist().cdf(abs(z)))
statistic = z
testUsed = "independent samples z-test"
#the results
colnames = ["n " + cat1, "n " + cat2, "mean " + cat1, "mean " + cat2, "diff.", "hyp. diff.", "statistic", "p-value", "test"]
results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, pValue, testUsed]], columns=colnames)
return(results)
Functions
def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None)
-
Independent Samples Z Test
A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead.
For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead.
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
dmu
:float
, optional- difference according to null hypothesis (default is 0)
sigma1
:float
, optional- population standard deviation of the first group, if None sample results will be used
sigma2
:float
, optional- population standard deviation of the second group, if None sample results will be used
Returns
A dataframe with:
- n cat. 1, the sample size of the first category
- n cat. 2, the sample size of the second category
- mean cat. 1, the sample mean of the first category
- mean cat. 2, the sample mean of the second category
- diff., difference between the two sample means
- hyp. diff., hypothesized difference between the two population means
- statistic, the test statistic (z-value)
- pValue, the significance (p-value)
- test, name of test used
Notes
The formula used is: z = \frac{\bar{x}_1 - \bar{x}_2}{SE} sig. = 2\times\left(1 - \Phi\left(\left|z\right|\right)\right)
With: SE = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \sigma_i^2 \approx s_i^2 = \frac{\sum_{j=1}^{n_i} \left(x_{i,j} - \bar{x}_i\right)^2}{n_i - 1} \bar{x}_i = \frac{\sum_{j=1}^{n_i} x_{i,j}}{n_i}
Symbols used:
- x_{i,j} the j-th score in category i
- n_i the number of scores in category i
Before, After and Alternatives
Before this you might want some descriptive measures. Use me_mode_bin for Mode for Binned Data, me_mean for different types of mean, and/or me_variation for different Measures of Quantitative Variation
Or a visualisation are vi_boxplot_single for a Box (and Whisker) Plot and vi_histogram for a Histogram
After the test you might want an effect size measure, options include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta, biserial correlation, point-biserial correlation
There are four similar tests, with different assumptions.
test equal variance normality Student t yes yes Welch t no yes Trimmed means yes no Yuen-Welch no no Another test that in some cases could be used is the Z test
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_z_is(df1['sex'], ex1) n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic p-value test 0 1083 886 48.561404 47.760722 0.800681 0 0.998958 0.317815 independent samples z-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_z_is(groups, scores) n int. n nat. mean int. mean nat. diff. hyp. diff. statistic p-value test 0 12 6 61.916667 41.666667 20.25 0 1.69314 0.090429 independent samples z-test
Expand source code
def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None): ''' Independent Samples Z Test -------------------------- A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead. For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead. Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. dmu : float, optional difference according to null hypothesis (default is 0) sigma1 : float, optional population standard deviation of the first group, if None sample results will be used sigma2 : float, optional population standard deviation of the second group, if None sample results will be used Returns ------- A dataframe with: * *n cat. 1*, the sample size of the first category * *n cat. 2*, the sample size of the second category * *mean cat. 1*, the sample mean of the first category * *mean cat. 2*, the sample mean of the second category * *diff.*, difference between the two sample means * *hyp. diff.*, hypothesized difference between the two population means * *statistic*, the test statistic (z-value) * *pValue*, the significance (p-value) * *test*, name of test used Notes ----- The formula used is: $$z = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$ $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$ With: $$SE = \\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}$$ $$\\sigma_i^2 \\approx s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$ $$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$ *Symbols used:* * \\(x_{i,j}\\) the j-th score in category i * \\(n_i\\) the number of scores in category i Before, After and Alternatives ------------------------------ Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html) There are four similar tests, with different assumptions. |test|equal variance|normality| |-------|-----------|---------| |[Student t](../tests/test_student_t_is.html)| yes | yes| |[Welch t](../tests/test_welch_t_is.html) | no | yes| |[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no | |[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no | Another test that in some cases could be used is the [Z test](../tests/test_z_is.html) Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Dataframe >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_z_is(df1['sex'], ex1) n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic p-value test 0 1083 886 48.561404 47.760722 0.800681 0 0.998958 0.317815 independent samples z-test Example 2: List >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_z_is(groups, scores) n int. n nat. mean int. mean nat. diff. hyp. diff. statistic p-value test 0 12 6 61.916667 41.666667 20.25 0 1.69314 0.090429 independent samples z-test ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) #make sure they are floats x1 = [float(x) for x in x1] x2 = [float(x) for x in x2] n1 = len(x1) n2 = len(x2) n = n1 + n2 if sigma1 is None: var1 = variance(x1) else: var1 = sigma1**2 if sigma2 is None: var2 = variance(x2) else: var2 = sigma2**2 sse = var1/n1 + var2/n2 se = (sse)**0.5 m1 = mean(x1) m2 = mean(x2) z = (m1 - m2 - dmu)/se pValue = 2 * (1 - NormalDist().cdf(abs(z))) statistic = z testUsed = "independent samples z-test" #the results colnames = ["n " + cat1, "n " + cat2, "mean " + cat1, "mean " + cat2, "diff.", "hyp. diff.", "statistic", "p-value", "test"] results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, pValue, testUsed]], columns=colnames) return(results)