Module `stikpetP.tests.test_z_ps`

Expand source code

import pandas as pd
from statistics import NormalDist

def ts_z_ps(field1, field2, dmu=0, dsigma=None):
    '''
    Z-test (Paired Samples)
    -----------------------
    This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used.
    
    The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected.
    
    Parameters
    ----------
    field1 : pandas series
        the ordinal or scale scores of the first variable
    field2 : pandas series
        the ordinal or scale scores of the second variable
    dmu : float, optional 
        hypothesized difference. Default is zero
    dsigma : float, optional
        value of population variance. Default is None, sample variance used.
        
    Returns
    -------
    res : dataframe with 
    
    * *n*, the number of scores
    * *z*, the test statistic (z-value)
    * *p-value*, significance (p-value)
    
    Notes
    -----
    The formula used is:
    $$z = \\frac{\\bar{d} - d_{H0}}{SE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$\\bar{d} = \\mu_1 - \\mu_2 \\approx \\bar{x}_1 - \\bar{x}_2$$
    $$SE = \\sqrt{\\frac{\\sigma_d^2}{n}}\\approx\\sqrt{\\frac{s_d^2}{n}}$$
    $$s_d^2 = \\frac{\\sum_{i=1}^n \\left(d_i -\\bar{d}\\right)^2}{n-1}$$
    $$d_i = x_{i,1} - x_{i,2}$$
    $$\\bar{d}=\\frac{\\sum_{i=1}^n d_i}{n}$$
    
    *Symbols used*
    
    * \\(x_{i,1}\\), is the i-th score from the first variable
    * \\(x_{i,2}\\), is the i-th score from the second variable
    * \\(d_{H0}\\), difference according to null hypothesis (dmu parameter)
    * \\(\\Phi\\left(\\dots\\right)\\), cumulative density function of the standard normal distribution.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    
    '''
    
    if type(field1) == list:
        field1 = pd.Series(field1)
        
    if type(field2) == list:
        field2 = pd.Series(field2)
    
    data = pd.concat([field1, field2], axis=1)
    data.columns = ["field1", "field2"]
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n
    n = len(data["field1"])
    
    data["diffs"] = data["field1"] - data["field2"]
    if dsigma is None:
        dsigma = data["diffs"].std()
        
    dm = data["diffs"].mean()
    se = dsigma/n**0.5
    z = (dm - dmu)/se
    pvalue = 2 * (1 - NormalDist().cdf(abs(z)))
    
    res = pd.DataFrame([[n, z, pvalue]])
    res.columns = ["n", "z", "p-value"]
    
    return res

Functions

def ts_z_ps(field1, field2, dmu=0, dsigma=None)

Z-test (Paired Samples)

This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used.

The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected.

Parameters

field1 : pandas series: the ordinal or scale scores of the first variable
field2 : pandas series: the ordinal or scale scores of the second variable
dmu : float, optional: hypothesized difference. Default is zero
dsigma : float, optional: value of population variance. Default is None, sample variance used.

Returns

res : dataframe with

n, the number of scores
z, the test statistic (z-value)
p-value, significance (p-value)

Notes

The formula used is: $z = \frac{\bar{d} - d_{H0}}{SE}$ $sig. = 2\times\left(1 - \Phi\left(\left|z\right|\right)\right)$

With: $\bar{d} = \mu_1 - \mu_2 \approx \bar{x}_1 - \bar{x}_2$ $SE = \sqrt{\frac{\sigma_d^2}{n}}\approx\sqrt{\frac{s_d^2}{n}}$ $s_d^2 = \frac{\sum_{i=1}^n \left(d_i -\bar{d}\right)^2}{n-1}$ $d_i = x_{i,1} - x_{i,2}$ $\bar{d}=\frac{\sum_{i=1}^n d_i}{n}$

Symbols used

$x_{i,1}$ , is the i-th score from the first variable
$x_{i,2}$ , is the i-th score from the second variable
$d_{H0}$ , difference according to null hypothesis (dmu parameter)
$\Phi\left(\dots\right)$ , cumulative density function of the standard normal distribution.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def ts_z_ps(field1, field2, dmu=0, dsigma=None):
    '''
    Z-test (Paired Samples)
    -----------------------
    This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used.
    
    The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected.
    
    Parameters
    ----------
    field1 : pandas series
        the ordinal or scale scores of the first variable
    field2 : pandas series
        the ordinal or scale scores of the second variable
    dmu : float, optional 
        hypothesized difference. Default is zero
    dsigma : float, optional
        value of population variance. Default is None, sample variance used.
        
    Returns
    -------
    res : dataframe with 
    
    * *n*, the number of scores
    * *z*, the test statistic (z-value)
    * *p-value*, significance (p-value)
    
    Notes
    -----
    The formula used is:
    $$z = \\frac{\\bar{d} - d_{H0}}{SE}$$
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$\\bar{d} = \\mu_1 - \\mu_2 \\approx \\bar{x}_1 - \\bar{x}_2$$
    $$SE = \\sqrt{\\frac{\\sigma_d^2}{n}}\\approx\\sqrt{\\frac{s_d^2}{n}}$$
    $$s_d^2 = \\frac{\\sum_{i=1}^n \\left(d_i -\\bar{d}\\right)^2}{n-1}$$
    $$d_i = x_{i,1} - x_{i,2}$$
    $$\\bar{d}=\\frac{\\sum_{i=1}^n d_i}{n}$$
    
    *Symbols used*
    
    * \\(x_{i,1}\\), is the i-th score from the first variable
    * \\(x_{i,2}\\), is the i-th score from the second variable
    * \\(d_{H0}\\), difference according to null hypothesis (dmu parameter)
    * \\(\\Phi\\left(\\dots\\right)\\), cumulative density function of the standard normal distribution.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    
    '''
    
    if type(field1) == list:
        field1 = pd.Series(field1)
        
    if type(field2) == list:
        field2 = pd.Series(field2)
    
    data = pd.concat([field1, field2], axis=1)
    data.columns = ["field1", "field2"]
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n
    n = len(data["field1"])
    
    data["diffs"] = data["field1"] - data["field2"]
    if dsigma is None:
        dsigma = data["diffs"].std()
        
    dm = data["diffs"].mean()
    se = dsigma/n**0.5
    z = (dm - dmu)/se
    pvalue = 2 * (1 - NormalDist().cdf(abs(z)))
    
    res = pd.DataFrame([[n, z, pvalue]])
    res.columns = ["n", "z", "p-value"]
    
    return res