Module stikpetP.tests.test_z_ps
Expand source code
import pandas as pd
from statistics import NormalDist
def ts_z_ps(field1, field2, dmu=0, dsigma=None):
'''
Z-test (Paired Samples)
-----------------------
This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used.
The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected.
Parameters
----------
field1 : pandas series
the ordinal or scale scores of the first variable
field2 : pandas series
the ordinal or scale scores of the second variable
dmu : float, optional
hypothesized difference. Default is zero
dsigma : float, optional
value of population variance. Default is None, sample variance used.
Returns
-------
res : dataframe with
* *n*, the number of scores
* *z*, the test statistic (z-value)
* *p-value*, significance (p-value)
Notes
-----
The formula used is:
$$z = \\frac{\\bar{d} - d_{H0}}{SE}$$
$$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
With:
$$\\bar{d} = \\mu_1 - \\mu_2 \\approx \\bar{x}_1 - \\bar{x}_2$$
$$SE = \\sqrt{\\frac{\\sigma_d^2}{n}}\\approx\\sqrt{\\frac{s_d^2}{n}}$$
$$s_d^2 = \\frac{\\sum_{i=1}^n \\left(d_i -\\bar{d}\\right)^2}{n-1}$$
$$d_i = x_{i,1} - x_{i,2}$$
$$\\bar{d}=\\frac{\\sum_{i=1}^n d_i}{n}$$
*Symbols used*
* \\(x_{i,1}\\), is the i-th score from the first variable
* \\(x_{i,2}\\), is the i-th score from the second variable
* \\(d_{H0}\\), difference according to null hypothesis (dmu parameter)
* \\(\\Phi\\left(\\dots\\right)\\), cumulative density function of the standard normal distribution.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(field1) == list:
field1 = pd.Series(field1)
if type(field2) == list:
field2 = pd.Series(field2)
data = pd.concat([field1, field2], axis=1)
data.columns = ["field1", "field2"]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n
n = len(data["field1"])
data["diffs"] = data["field1"] - data["field2"]
if dsigma is None:
dsigma = data["diffs"].std()
dm = data["diffs"].mean()
se = dsigma/n**0.5
z = (dm - dmu)/se
pvalue = 2 * (1 - NormalDist().cdf(abs(z)))
res = pd.DataFrame([[n, z, pvalue]])
res.columns = ["n", "z", "p-value"]
return res
Functions
def ts_z_ps(field1, field2, dmu=0, dsigma=None)-
Z-test (Paired Samples)
This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used.
The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected.
Parameters
field1:pandas series- the ordinal or scale scores of the first variable
field2:pandas series- the ordinal or scale scores of the second variable
dmu:float, optional- hypothesized difference. Default is zero
dsigma:float, optional- value of population variance. Default is None, sample variance used.
Returns
res:dataframe with
- n, the number of scores
- z, the test statistic (z-value)
- p-value, significance (p-value)
Notes
The formula used is: z = \frac{\bar{d} - d_{H0}}{SE} sig. = 2\times\left(1 - \Phi\left(\left|z\right|\right)\right)
With: \bar{d} = \mu_1 - \mu_2 \approx \bar{x}_1 - \bar{x}_2 SE = \sqrt{\frac{\sigma_d^2}{n}}\approx\sqrt{\frac{s_d^2}{n}} s_d^2 = \frac{\sum_{i=1}^n \left(d_i -\bar{d}\right)^2}{n-1} d_i = x_{i,1} - x_{i,2} \bar{d}=\frac{\sum_{i=1}^n d_i}{n}
Symbols used
- x_{i,1}, is the i-th score from the first variable
- x_{i,2}, is the i-th score from the second variable
- d_{H0}, difference according to null hypothesis (dmu parameter)
- \Phi\left(\dots\right), cumulative density function of the standard normal distribution.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_z_ps(field1, field2, dmu=0, dsigma=None): ''' Z-test (Paired Samples) ----------------------- This test is often used if there is a large sample size. For smaller sample sizes, a Student t-test is usually used. The assumption about the population (null hypothesis) for this test is a pre-defined difference between two means, usually zero (i.e. the difference between the (arithmetic) means is zero, they are the same in the population). If the p-value (significance) is then below a pre-defined threhold (usually 0.05), the assumption is rejected. Parameters ---------- field1 : pandas series the ordinal or scale scores of the first variable field2 : pandas series the ordinal or scale scores of the second variable dmu : float, optional hypothesized difference. Default is zero dsigma : float, optional value of population variance. Default is None, sample variance used. Returns ------- res : dataframe with * *n*, the number of scores * *z*, the test statistic (z-value) * *p-value*, significance (p-value) Notes ----- The formula used is: $$z = \\frac{\\bar{d} - d_{H0}}{SE}$$ $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$ With: $$\\bar{d} = \\mu_1 - \\mu_2 \\approx \\bar{x}_1 - \\bar{x}_2$$ $$SE = \\sqrt{\\frac{\\sigma_d^2}{n}}\\approx\\sqrt{\\frac{s_d^2}{n}}$$ $$s_d^2 = \\frac{\\sum_{i=1}^n \\left(d_i -\\bar{d}\\right)^2}{n-1}$$ $$d_i = x_{i,1} - x_{i,2}$$ $$\\bar{d}=\\frac{\\sum_{i=1}^n d_i}{n}$$ *Symbols used* * \\(x_{i,1}\\), is the i-th score from the first variable * \\(x_{i,2}\\), is the i-th score from the second variable * \\(d_{H0}\\), difference according to null hypothesis (dmu parameter) * \\(\\Phi\\left(\\dots\\right)\\), cumulative density function of the standard normal distribution. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(field1) == list: field1 = pd.Series(field1) if type(field2) == list: field2 = pd.Series(field2) data = pd.concat([field1, field2], axis=1) data.columns = ["field1", "field2"] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n n = len(data["field1"]) data["diffs"] = data["field1"] - data["field2"] if dsigma is None: dsigma = data["diffs"].std() dm = data["diffs"].mean() se = dsigma/n**0.5 z = (dm - dmu)/se pvalue = 2 * (1 - NormalDist().cdf(abs(z))) res = pd.DataFrame([[n, z, pvalue]]) res.columns = ["n", "z", "p-value"] return res