Module stikpetP.tests.test_student_t_is
Expand source code
from statistics import mean, variance
from scipy.stats import t
import pandas as pd
def ts_student_t_is(catField, scaleField, categories=None, dmu=0):
'''
Student t Test (Independent Samples)
------------------------------------
A test to compare two means. The null hypothesis would be that the means of each category are equal in the population.
The test assumes that the variances in the population of the scores are the same. If this is not the case, a Welch t-test could be used. Ruxten (2006) even argues that the Welch t-test should always be prefered over the Student t-test.
The test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/Student-t.html)
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
dmu : float, optional
difference according to null hypothesis (default is 0)
Returns
-------
A dataframe with:
* *n cat. 1*, the sample size of the first category
* *n cat. 2*, the sample size of the second category
* *mean cat. 1*, the sample mean of the first category
* *mean cat. 2*, the sample mean of the second category
* *diff.*, difference between the two sample means
* *hyp. diff.*, hypothesized difference between the two population means
* *statistic*, the test statistic (t-value)
* *df*, the degrees of freedom
* *pValue*, the significance (p-value)
* *test*, name of test used
Notes
-----
The formula used is:
$$t = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$
$$df = n_1 + n_2 - 2$$
$$sig. = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$
With:
$$SE = s_p\\times\\sqrt{\\frac{1}{n_1} + \\frac{1}{n_2}}$$
$$s_p = \\sqrt{\\frac{\\left(n_1 - 1\\right)\\times s_1^2 + \\left(n_2 - 1\\right)\\times s_2^2}{df}}$$
$$s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$
$$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$
*Symbols used:*
* \\(x_{i,j}\\) the j-th score in category i
* \\(n_i\\) the number of scores in category i
Before, After and Alternatives
------------------------------
Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html)
There are four similar tests, with different assumptions.
|test|equal variance|normality|
|-------|-----------|---------|
|[Student t](../tests/test_student_t_is.html)| yes | yes|
|[Welch t](../tests/test_welch_t_is.html) | no | yes|
|[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no |
|[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no |
Another test that in some cases could be used is the [Z test](../tests/test_z_is.html)
References
----------
Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test. *Behavioral Ecology, 17*(4), 688–690. https://doi.org/10.1093/beheco/ark016
Student. (1908). The probable error of a mean. *Biometrika, 6*(1), 1–25. https://doi.org/10.1093/biomet/6.1.1
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['age']
>>> ex1 = ex1.replace("89 OR OLDER", "90")
>>> ts_student_t_is(df1['sex'], ex1)
n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic df p-value test
0 1083 886 48.561404 47.760722 0.800681 0 0.99833 1967 0.318242 Student independent samples t-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
>>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
>>> ts_student_t_is(groups, scores)
n int. n nat. mean int. mean nat. diff. hyp. diff. statistic df p-value test
0 12 6 61.916667 41.666667 20.25 0 1.716401 16 0.105382 Student independent samples t-test
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
#make sure they are floats
x1 = [float(x) for x in x1]
x2 = [float(x) for x in x2]
n1 = len(x1)
n2 = len(x2)
n = n1 + n2
var1 = variance(x1)
var2 = variance(x2)
sp = (((n1 - 1)*var1 + (n2 - 1)*var2)/(n1 + n2 - 2))**0.5
se = sp*(1/n1 + 1/n2)**0.5
m1 = mean(x1)
m2 = mean(x2)
tValue = (m1 - m2 - dmu)/se
df = n - 2
pValue = 2*(1-t.cdf(abs(tValue), df))
statistic = tValue
testUsed = "Student independent samples t-test"
colnames = ["n "+cat1, "n "+cat2, "mean "+cat1, "mean "+cat2, "diff.", "hyp. diff.", "statistic", "df", "p-value", "test"]
results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, df, pValue, testUsed]], columns=colnames)
return(results)
Functions
def ts_student_t_is(catField, scaleField, categories=None, dmu=0)
-
Student t Test (Independent Samples)
A test to compare two means. The null hypothesis would be that the means of each category are equal in the population.
The test assumes that the variances in the population of the scores are the same. If this is not the case, a Welch t-test could be used. Ruxten (2006) even argues that the Welch t-test should always be prefered over the Student t-test.
The test is also described at PeterStatistics.com
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
dmu
:float
, optional- difference according to null hypothesis (default is 0)
Returns
A dataframe with:
- n cat. 1, the sample size of the first category
- n cat. 2, the sample size of the second category
- mean cat. 1, the sample mean of the first category
- mean cat. 2, the sample mean of the second category
- diff., difference between the two sample means
- hyp. diff., hypothesized difference between the two population means
- statistic, the test statistic (t-value)
- df, the degrees of freedom
- pValue, the significance (p-value)
- test, name of test used
Notes
The formula used is: t = \frac{\bar{x}_1 - \bar{x}_2}{SE} df = n_1 + n_2 - 2 sig. = 2\times\left(1 - T\left(\left|t\right|, df\right)\right)
With: SE = s_p\times\sqrt{\frac{1}{n_1} + \frac{1}{n_2}} s_p = \sqrt{\frac{\left(n_1 - 1\right)\times s_1^2 + \left(n_2 - 1\right)\times s_2^2}{df}} s_i^2 = \frac{\sum_{j=1}^{n_i} \left(x_{i,j} - \bar{x}_i\right)^2}{n_i - 1} \bar{x}_i = \frac{\sum_{j=1}^{n_i} x_{i,j}}{n_i}
Symbols used:
- x_{i,j} the j-th score in category i
- n_i the number of scores in category i
Before, After and Alternatives
Before this you might want some descriptive measures. Use me_mode_bin for Mode for Binned Data, me_mean for different types of mean, and/or me_variation for different Measures of Quantitative Variation
Or a visualisation are vi_boxplot_single for a Box (and Whisker) Plot and vi_histogram for a Histogram
After the test you might want an effect size measure, options include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta, biserial correlation, point-biserial correlation
There are four similar tests, with different assumptions.
test equal variance normality Student t yes yes Welch t no yes Trimmed means yes no Yuen-Welch no no Another test that in some cases could be used is the Z test
References
Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test. Behavioral Ecology, 17(4), 688–690. https://doi.org/10.1093/beheco/ark016
Student. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. https://doi.org/10.1093/biomet/6.1.1
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_student_t_is(df1['sex'], ex1) n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic df p-value test 0 1083 886 48.561404 47.760722 0.800681 0 0.99833 1967 0.318242 Student independent samples t-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_student_t_is(groups, scores) n int. n nat. mean int. mean nat. diff. hyp. diff. statistic df p-value test 0 12 6 61.916667 41.666667 20.25 0 1.716401 16 0.105382 Student independent samples t-test
Expand source code
def ts_student_t_is(catField, scaleField, categories=None, dmu=0): ''' Student t Test (Independent Samples) ------------------------------------ A test to compare two means. The null hypothesis would be that the means of each category are equal in the population. The test assumes that the variances in the population of the scores are the same. If this is not the case, a Welch t-test could be used. Ruxten (2006) even argues that the Welch t-test should always be prefered over the Student t-test. The test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/Student-t.html) Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. dmu : float, optional difference according to null hypothesis (default is 0) Returns ------- A dataframe with: * *n cat. 1*, the sample size of the first category * *n cat. 2*, the sample size of the second category * *mean cat. 1*, the sample mean of the first category * *mean cat. 2*, the sample mean of the second category * *diff.*, difference between the two sample means * *hyp. diff.*, hypothesized difference between the two population means * *statistic*, the test statistic (t-value) * *df*, the degrees of freedom * *pValue*, the significance (p-value) * *test*, name of test used Notes ----- The formula used is: $$t = \\frac{\\bar{x}_1 - \\bar{x}_2}{SE}$$ $$df = n_1 + n_2 - 2$$ $$sig. = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$ With: $$SE = s_p\\times\\sqrt{\\frac{1}{n_1} + \\frac{1}{n_2}}$$ $$s_p = \\sqrt{\\frac{\\left(n_1 - 1\\right)\\times s_1^2 + \\left(n_2 - 1\\right)\\times s_2^2}{df}}$$ $$s_i^2 = \\frac{\\sum_{j=1}^{n_i} \\left(x_{i,j} - \\bar{x}_i\\right)^2}{n_i - 1}$$ $$\\bar{x}_i = \\frac{\\sum_{j=1}^{n_i} x_{i,j}}{n_i}$$ *Symbols used:* * \\(x_{i,j}\\) the j-th score in category i * \\(n_i\\) the number of scores in category i Before, After and Alternatives ------------------------------ Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html) There are four similar tests, with different assumptions. |test|equal variance|normality| |-------|-----------|---------| |[Student t](../tests/test_student_t_is.html)| yes | yes| |[Welch t](../tests/test_welch_t_is.html) | no | yes| |[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no | |[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no | Another test that in some cases could be used is the [Z test](../tests/test_z_is.html) References ---------- Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test. *Behavioral Ecology, 17*(4), 688–690. https://doi.org/10.1093/beheco/ark016 Student. (1908). The probable error of a mean. *Biometrika, 6*(1), 1–25. https://doi.org/10.1093/biomet/6.1.1 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Dataframe >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_student_t_is(df1['sex'], ex1) n FEMALE n MALE mean FEMALE mean MALE diff. hyp. diff. statistic df p-value test 0 1083 886 48.561404 47.760722 0.800681 0 0.99833 1967 0.318242 Student independent samples t-test Example 2: List >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_student_t_is(groups, scores) n int. n nat. mean int. mean nat. diff. hyp. diff. statistic df p-value test 0 12 6 61.916667 41.666667 20.25 0 1.716401 16 0.105382 Student independent samples t-test ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) #make sure they are floats x1 = [float(x) for x in x1] x2 = [float(x) for x in x2] n1 = len(x1) n2 = len(x2) n = n1 + n2 var1 = variance(x1) var2 = variance(x2) sp = (((n1 - 1)*var1 + (n2 - 1)*var2)/(n1 + n2 - 2))**0.5 se = sp*(1/n1 + 1/n2)**0.5 m1 = mean(x1) m2 = mean(x2) tValue = (m1 - m2 - dmu)/se df = n - 2 pValue = 2*(1-t.cdf(abs(tValue), df)) statistic = tValue testUsed = "Student independent samples t-test" colnames = ["n "+cat1, "n "+cat2, "mean "+cat1, "mean "+cat2, "diff.", "hyp. diff.", "statistic", "df", "p-value", "test"] results = pd.DataFrame([[n1, n2, m1, m2, m1 - m2, dmu, statistic, df, pValue, testUsed]], columns=colnames) return(results)