Module stikpetP.tests.test_z_is

Expand source code
from statistics import mean, variance, NormalDist
import pandas as pd

def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None):
    '''
    Independent Samples Z Test
    --------------------------
    A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead.
    
    For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead.
    
    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
    dmu : float, optional 
        difference according to null hypothesis (default is 0)
    sigma1 : float, optional 
        population standard deviation of the first group, if None sample results will be used
    sigma2 : float, optional 
        population standard deviation of the second group, if None sample results will be used
        
    Returns
    -------
    A dataframe with:
    
    * *n cat. 1*, the sample size of the first category
    * *n cat. 2*, the sample size of the second category
    * *mean cat. 1*, the sample mean of the first category
    * *mean cat. 2*, the sample mean of the second category
    * *diff.*, difference between the two sample means
    * *hyp. diff.*, hypothesized difference between the two population means
    * *statistic*, the test statistic (z-value)
    * *pValue*, the significance (p-value)
    * *test*, name of test used
    
    Notes
    -----
    The formula used is:
    $$z = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$SE = \\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}$$
    $$\\sigma_i^2 \\approx s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$
    $$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$
    
    *Symbols used:*
    
    * \\(x_{i,j}\\) the j-th score in category i
    * \\(n_i\\) the number of scores in category i

    Before, After and Alternatives
    ------------------------------
    Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
    
    Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    
    After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html)
    
    There are four similar tests, with different assumptions. 
    
    |test|equal variance|normality|
    |-------|-----------|---------|
    |[Student t](../tests/test_student_t_is.html)| yes | yes|
    |[Welch t](../tests/test_welch_t_is.html) | no | yes|
    |[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no | 
    |[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no |

    Another test that in some cases could be used is the [Z test](../tests/test_z_is.html)
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Dataframe
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['age']
    >>> ex1 = ex1.replace("89 OR OLDER", "90")
    >>> ts_z_is(df1['sex'], ex1)
       n FEMALE  n MALE  mean FEMALE  mean MALE     diff.  hyp. diff.  statistic   p-value                        test
    0      1083     886    48.561404  47.760722  0.800681           0   0.998958  0.317815  independent samples z-test
    
    Example 2: List
    >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
    >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
    >>> ts_z_is(groups, scores)
       n int.  n nat.  mean int.  mean nat.  diff.  hyp. diff.  statistic   p-value                        test
    0      12       6  61.916667  41.666667  20.25           0    1.69314  0.090429  independent samples z-test
    
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    #make sure they are floats
    x1 = [float(x) for x in x1]
    x2 = [float(x) for x in x2]
    
    n1 = len(x1)
    n2 = len(x2)
    n = n1 + n2
    
    if sigma1 is None:
        var1 = variance(x1)
    else:
        var1 = sigma1**2
    
    if sigma2 is None:
        var2 = variance(x2)
    else:
        var2 = sigma2**2
        
    sse = var1/n1 + var2/n2
    se = (sse)**0.5
    
    m1 = mean(x1)
    m2 = mean(x2)
    
    z = (m1 - m2 - dmu)/se
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    statistic = z
    testUsed = "independent samples z-test"
    
    #the results
    colnames = ["n " + cat1, "n " + cat2, "mean " + cat1, "mean " +  cat2, "diff.", "hyp. diff.", "statistic", "p-value", "test"]
    results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, pValue, testUsed]], columns=colnames)
    
    return(results)

Functions

def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None)

Independent Samples Z Test

A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead.

For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead.

Parameters

catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
dmu : float, optional
difference according to null hypothesis (default is 0)
sigma1 : float, optional
population standard deviation of the first group, if None sample results will be used
sigma2 : float, optional
population standard deviation of the second group, if None sample results will be used

Returns

A dataframe with:
 
  • n cat. 1, the sample size of the first category
  • n cat. 2, the sample size of the second category
  • mean cat. 1, the sample mean of the first category
  • mean cat. 2, the sample mean of the second category
  • diff., difference between the two sample means
  • hyp. diff., hypothesized difference between the two population means
  • statistic, the test statistic (z-value)
  • pValue, the significance (p-value)
  • test, name of test used

Notes

The formula used is: z = \frac{\bar{x}_1 - \bar{x}_2}{SE} sig. = 2\times\left(1 - \Phi\left(\left|z\right|\right)\right)

With: SE = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \sigma_i^2 \approx s_i^2 = \frac{\sum_{j=1}^{n_i} \left(x_{i,j} - \bar{x}_i\right)^2}{n_i - 1} \bar{x}_i = \frac{\sum_{j=1}^{n_i} x_{i,j}}{n_i}

Symbols used:

  • x_{i,j} the j-th score in category i
  • n_i the number of scores in category i

Before, After and Alternatives

Before this you might want some descriptive measures. Use me_mode_bin for Mode for Binned Data, me_mean for different types of mean, and/or me_variation for different Measures of Quantitative Variation

Or a visualisation are vi_boxplot_single for a Box (and Whisker) Plot and vi_histogram for a Histogram

After the test you might want an effect size measure, options include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta, biserial correlation, point-biserial correlation

There are four similar tests, with different assumptions.

test equal variance normality
Student t yes yes
Welch t no yes
Trimmed means yes no
Yuen-Welch no no

Another test that in some cases could be used is the Z test

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: Dataframe

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['age']
>>> ex1 = ex1.replace("89 OR OLDER", "90")
>>> ts_z_is(df1['sex'], ex1)
   n FEMALE  n MALE  mean FEMALE  mean MALE     diff.  hyp. diff.  statistic   p-value                        test
0      1083     886    48.561404  47.760722  0.800681           0   0.998958  0.317815  independent samples z-test

Example 2: List

>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
>>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
>>> ts_z_is(groups, scores)
   n int.  n nat.  mean int.  mean nat.  diff.  hyp. diff.  statistic   p-value                        test
0      12       6  61.916667  41.666667  20.25           0    1.69314  0.090429  independent samples z-test
Expand source code
def ts_z_is(catField, scaleField, categories=None, dmu=0, sigma1=None, sigma2=None):
    '''
    Independent Samples Z Test
    --------------------------
    A test to compare two means. It requires the population variances, but if these are unknown for large enough sample sizes, the sample variances can be used instead.
    
    For smaller sample sizes a t-test (Student, Welch or Trimmed Means) could be used instead.
    
    Parameters
    ----------
    catField : dataframe or list 
        the categorical data
    scaleField : dataframe or list
        the scores
    categories : list, optional 
        to indicate which two categories of catField to use, otherwise first two found will be used.
    dmu : float, optional 
        difference according to null hypothesis (default is 0)
    sigma1 : float, optional 
        population standard deviation of the first group, if None sample results will be used
    sigma2 : float, optional 
        population standard deviation of the second group, if None sample results will be used
        
    Returns
    -------
    A dataframe with:
    
    * *n cat. 1*, the sample size of the first category
    * *n cat. 2*, the sample size of the second category
    * *mean cat. 1*, the sample mean of the first category
    * *mean cat. 2*, the sample mean of the second category
    * *diff.*, difference between the two sample means
    * *hyp. diff.*, hypothesized difference between the two population means
    * *statistic*, the test statistic (z-value)
    * *pValue*, the significance (p-value)
    * *test*, name of test used
    
    Notes
    -----
    The formula used is:
    $$z = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$SE = \\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}$$
    $$\\sigma_i^2 \\approx s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$
    $$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$
    
    *Symbols used:*
    
    * \\(x_{i,j}\\) the j-th score in category i
    * \\(n_i\\) the number of scores in category i

    Before, After and Alternatives
    ------------------------------
    Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
    
    Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    
    After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html)
    
    There are four similar tests, with different assumptions. 
    
    |test|equal variance|normality|
    |-------|-----------|---------|
    |[Student t](../tests/test_student_t_is.html)| yes | yes|
    |[Welch t](../tests/test_welch_t_is.html) | no | yes|
    |[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no | 
    |[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no |

    Another test that in some cases could be used is the [Z test](../tests/test_z_is.html)
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Dataframe
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['age']
    >>> ex1 = ex1.replace("89 OR OLDER", "90")
    >>> ts_z_is(df1['sex'], ex1)
       n FEMALE  n MALE  mean FEMALE  mean MALE     diff.  hyp. diff.  statistic   p-value                        test
    0      1083     886    48.561404  47.760722  0.800681           0   0.998958  0.317815  independent samples z-test
    
    Example 2: List
    >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
    >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
    >>> ts_z_is(groups, scores)
       n int.  n nat.  mean int.  mean nat.  diff.  hyp. diff.  statistic   p-value                        test
    0      12       6  61.916667  41.666667  20.25           0    1.69314  0.090429  independent samples z-test
    
    
    '''
    #convert to pandas series if needed
    if type(catField) is list:
        catField = pd.Series(catField)
    
    if type(scaleField) is list:
        scaleField = pd.Series(scaleField)
    
    #combine as one dataframe
    df = pd.concat([catField, scaleField], axis=1)
    df = df.dropna()
    
    #the two categories
    if categories is not None:
        cat1 = categories[0]
        cat2 = categories[1]
    else:
        cat1 = df.iloc[:,0].value_counts().index[0]
        cat2 = df.iloc[:,0].value_counts().index[1]
    
    #seperate the scores for each category
    x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
    x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
    
    #make sure they are floats
    x1 = [float(x) for x in x1]
    x2 = [float(x) for x in x2]
    
    n1 = len(x1)
    n2 = len(x2)
    n = n1 + n2
    
    if sigma1 is None:
        var1 = variance(x1)
    else:
        var1 = sigma1**2
    
    if sigma2 is None:
        var2 = variance(x2)
    else:
        var2 = sigma2**2
        
    sse = var1/n1 + var2/n2
    se = (sse)**0.5
    
    m1 = mean(x1)
    m2 = mean(x2)
    
    z = (m1 - m2 - dmu)/se
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    statistic = z
    testUsed = "independent samples z-test"
    
    #the results
    colnames = ["n " + cat1, "n " + cat2, "mean " + cat1, "mean " +  cat2, "diff.", "hyp. diff.", "statistic", "p-value", "test"]
    results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, pValue, testUsed]], columns=colnames)
    
    return(results)