Module `stikpetP.tests.test_trimmed_mean_os`

Expand source code

import pandas as pd
import math
from scipy.stats import t

def ts_trimmed_mean_os(data, mu=None, trimProp=0.1, se="yuen"):
    '''
    One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    -------------------------------------------------
    
    A variation on a one-sample Student t-test where the data is first trimmed, and the Winsorized variance is used.
    
    The assumption about the population for this test is that the mean in the population is equal to the provide mu value. The test will show the probability of the found test statistic, or more extreme, if this assumption would be true. If this is below a specific threshold (usually 0.05) the assumption is rejected.

    This function is shown in this [YouTube video](https://youtu.be/jh2IYmhwctg) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/TrimmedMeanOneSample.html)
    
    Parameters
    ----------
    data : list or pandas data series 
        the data as numbers
    mu : float, optional 
        hypothesized mean, otherwise the midrange will be used
    trimProp : float, optional 
        proportion to trim in total. Default is 0.1 (e.g. 0.05 from each side)
    se : {"yuen", "wilcox"}, optional 
        method to use to determine standard error. Default is "yuen" (default)
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *trim. mean*, the sample trimmed mean
        * *mu*, hypothesized mean
        * *SE*, the standard error
        * *statistic*, the test statistic (t-value)
        * *df*, degrees of freedom
        * *p-value*, p-value (sig.)
        * *test used*, name of test used
    
    Notes
    -----
    The formula used is:
    $$\\frac{\\bar{x}_t - \\mu_{H_0}}{SE}$$
    $$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$
    
    With:
    $$\\bar{x}_t = \\frac{\\sum_{i=g+1}^{n - g}y_i}{}$$
    $$g = \\lfloor n\\times p_t\\rfloor$$
    $$m = n - 2\\times g$$
    $$SE = \\sqrt{\\frac{SSD_w}{m\\times\\left(m - 1\\right)}}$$
    $$SSD_w = g\\times\\left(y_{g+1} - \\bar{x}_w\\right)^2 + g\\times\\left(y_{n-g} - \\bar{x}_w\\right)^2 + \\sum_{i=g+1}^{n - g} \\left(y_i - \\bar{x}_w\\right)^2$$
    $$\\bar{x}_w = \\frac{\\bar{x}_t\\times m + g\\times\\left(y_{g+1} + y_{n-g}\\right)}{n}$$
    
    If *se="wilcox" is used, the formula for SE will be adjusted to:
    $$SE = \\frac{\\sqrt{\\frac{SSD_w}{n - 1}}}{\\left(1 - 2\\times p_t\\right)\\times\\sqrt{n}}$$
    
    *Symbols used:*
    
    * $x_t$ the trimmed mean of the scores
    * $x_w$ The Winsorized mean
    * $SSD_w$ the sum of squared deviations from the Winsorized mean
    * $m$ the number of scores in the trimmed data set from category i
    * $y_i$ the i-th score after the scores are sorted from low to high
    * $p$ the proportion of trimming on each side, we can define
    
    The test is often also referred to as a Yuen test, or Yuen-Welch test.
    
    The standard error can either be calculated using the first SE, which for example can be found in Tukey and McLaughlin (1963, p. 342), and seems similar to the independent samples version of this test as proposed by Yuen (1974, p. 167)
    
    The second version is used in the other libraries from the software R, and can be found in Wilcox (2012, p. 157), or Peró-Cebollero and Guàrdia-Olmos (2013, p. 409).

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want an effect size measure:
    * [es_cohen_d_os](../effect_sizes/eff_size_cohen_d_os.html#es_cohen_d_os) for Cohen d'
    * [es_hedges_g_os](../effect_sizes/eff_size_hedges_g_os.html#es_hedges_g_os) for Hedges g
    * [es_common_language_os](../eff_size_common_language_os/meas_variation.html#es_common_language_os) for the Common Language Effect Size
    
    Alternative Tests:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
    
    References 
    ----------
    Peró-Cebollero, M., & Guàrdia-Olmos, J. (2013). The adequacy of different robust statistical tests in comparing two independent groups. *Psicológica*, 34, 407–424.
    
    Tukey, J. W., & McLaughlin, D. H. (1963). Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1. *Sankhyā: The Indian Journal of Statistics, 25*(3), 331–352.
    
    Wilcox, R. R. (2012). *Introduction to robust estimation and hypothesis testing* (3rd ed.). Academic Press.
    
    Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. *Biometrika, 61*(1), 165–170. doi:10.1093/biomet/61.1.165
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    ---------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    
    Example 1: pandas series
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Gen_Age']
    >>> ts_trimmed_mean_os(ex1)
       trim. mean    mu        SE  statistic  df  p-value                     test used
    0        22.1  68.5  0.629778 -73.676782  39      0.0  one-sample trimmed mean test
    >>> ts_trimmed_mean_os(ex1, mu=23, trimProp=0.15, se="wilcox")
       trim. mean  mu        SE  statistic  df   p-value                     test used
    0        22.0  23  0.648656  -1.541649  37  0.131669  one-sample trimmed mean test
    
    Example 2: Numeric list
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> ts_trimmed_mean_os(ex2)
       trim. mean   mu        SE  statistic  df   p-value                     test used
    0    3.444444  3.0  0.372434    1.19335  17  0.249121  one-sample trimmed mean test
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    data = pd.to_numeric(data)
    data = data.sort_values()
    data = data.reset_index(drop=True)
    
    if mu is None:
        mu = (max(data)+min(data))/2
    
    n = len(data)
    nt = n*trimProp/2
    nl = math.floor(nt)
    mt = data[nl:(n - nl)].mean()
    nat = n - 2*nl
    mw = (mt*nat + nl*(data[nl] + data[nl+nat-1]))/n
    ssdw = nl*(data[nl] - mw)**2 + nl*(data[nl+nat-1] - mw)**2 + sum((data[nl:(nl+nat)] - mw)**2)
    varw = ssdw/(n - 1)
    
    if se=="yuen":
        SE = (ssdw/(nat*(nat - 1)))**0.5
    elif se=="wilcox":
        SE = (varw)**0.5/((1 - trimProp)*(n**0.5))
    
    tValue = (mt - mu)/SE
    df = nat - 1
    pValue = 2 * (1 - t.cdf(abs(tValue), df))
    
    testUsed = "one-sample trimmed mean test"
    results = pd.DataFrame([[mt, mu, SE, tValue, df, pValue, testUsed]], columns=["trim. mean", "mu", "SE", "statistic", "df", "p-value", "test used"])
    
    return (results)

Functions

def ts_trimmed_mean_os(data, mu=None, trimProp=0.1, se='yuen')

One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test

A variation on a one-sample Student t-test where the data is first trimmed, and the Winsorized variance is used.

The assumption about the population for this test is that the mean in the population is equal to the provide mu value. The test will show the probability of the found test statistic, or more extreme, if this assumption would be true. If this is below a specific threshold (usually 0.05) the assumption is rejected.

This function is shown in this YouTube video and the test is also described at PeterStatistics.com

Parameters

data : list or pandas data series: the data as numbers
mu : float, optional: hypothesized mean, otherwise the midrange will be used
trimProp : float, optional: proportion to trim in total. Default is 0.1 (e.g. 0.05 from each side)
se : {"yuen", "wilcox"}, optional: method to use to determine standard error. Default is "yuen" (default)

Returns

pandas.DataFrame

A dataframe with the following columns:

trim. mean, the sample trimmed mean
mu, hypothesized mean
SE, the standard error
statistic, the test statistic (t-value)
df, degrees of freedom
p-value, p-value (sig.)
test used, name of test used

Notes

The formula used is: $\frac{\bar{x}_t - \mu_{H_0}}{SE}$ $sig = 2\times\left(1 - T\left(\left|t\right|, df\right)\right)$

With: $\bar{x}_t = \frac{\sum_{i=g+1}^{n - g}y_i}{}$ $g = \lfloor n\times p_t\rfloor$ $m = n - 2\times g$ $SE = \sqrt{\frac{SSD_w}{m\times\left(m - 1\right)}}$ $SSD_w = g\times\left(y_{g+1} - \bar{x}_w\right)^2 + g\times\left(y_{n-g} - \bar{x}_w\right)^2 + \sum_{i=g+1}^{n - g} \left(y_i - \bar{x}_w\right)^2$ $\bar{x}_w = \frac{\bar{x}_t\times m + g\times\left(y_{g+1} + y_{n-g}\right)}{n}$

If *se="wilcox" is used, the formula for SE will be adjusted to: $SE = \frac{\sqrt{\frac{SSD_w}{n - 1}}}{\left(1 - 2\times p_t\right)\times\sqrt{n}}$

Symbols used:

$x_t$ the trimmed mean of the scores
$x_w$ The Winsorized mean
$SSD_w$ the sum of squared deviations from the Winsorized mean
$m$ the number of scores in the trimmed data set from category i
$y_i$ the i-th score after the scores are sorted from low to high
$p$ the proportion of trimming on each side, we can define

The test is often also referred to as a Yuen test, or Yuen-Welch test.

The standard error can either be calculated using the first SE, which for example can be found in Tukey and McLaughlin (1963, p. 342), and seems similar to the independent samples version of this test as proposed by Yuen (1974, p. 167)

The second version is used in the other libraries from the software R, and can be found in Wilcox (2012, p. 157), or Peró-Cebollero and Guàrdia-Olmos (2013, p. 409).

Before, After and Alternatives

Before this you might want to create a binned frequency table or a visualisation: * tab_frequency_bins to create a binned frequency table * vi_boxplot_single for a Box (and Whisker) Plot * vi_histogram for a Histogram * vi_stem_and_leaf for a Stem-and-Leaf Display

After this you might want an effect size measure: * es_cohen_d_os for Cohen d' * es_hedges_g_os for Hedges g * es_common_language_os for the Common Language Effect Size

Alternative Tests: * ts_student_t_os for One-Sample Student t-Test * ts_z_os for One-Sample Z Test

References

Peró-Cebollero, M., & Guàrdia-Olmos, J. (2013). The adequacy of different robust statistical tests in comparing two independent groups. Psicológica, 34, 407–424.

Tukey, J. W., & McLaughlin, D. H. (1963). Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, 25(3), 331–352.

Wilcox, R. R. (2012). Introduction to robust estimation and hypothesis testing (3rd ed.). Academic Press.

Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. Biometrika, 61(1), 165–170. doi:10.1093/biomet/61.1.165

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)

Example 1: pandas series

>>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df2['Gen_Age']
>>> ts_trimmed_mean_os(ex1)
   trim. mean    mu        SE  statistic  df  p-value                     test used
0        22.1  68.5  0.629778 -73.676782  39      0.0  one-sample trimmed mean test
>>> ts_trimmed_mean_os(ex1, mu=23, trimProp=0.15, se="wilcox")
   trim. mean  mu        SE  statistic  df   p-value                     test used
0        22.0  23  0.648656  -1.541649  37  0.131669  one-sample trimmed mean test

Example 2: Numeric list

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> ts_trimmed_mean_os(ex2)
   trim. mean   mu        SE  statistic  df   p-value                     test used
0    3.444444  3.0  0.372434    1.19335  17  0.249121  one-sample trimmed mean test

Expand source code

def ts_trimmed_mean_os(data, mu=None, trimProp=0.1, se="yuen"):
    '''
    One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    -------------------------------------------------
    
    A variation on a one-sample Student t-test where the data is first trimmed, and the Winsorized variance is used.
    
    The assumption about the population for this test is that the mean in the population is equal to the provide mu value. The test will show the probability of the found test statistic, or more extreme, if this assumption would be true. If this is below a specific threshold (usually 0.05) the assumption is rejected.

    This function is shown in this [YouTube video](https://youtu.be/jh2IYmhwctg) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/TrimmedMeanOneSample.html)
    
    Parameters
    ----------
    data : list or pandas data series 
        the data as numbers
    mu : float, optional 
        hypothesized mean, otherwise the midrange will be used
    trimProp : float, optional 
        proportion to trim in total. Default is 0.1 (e.g. 0.05 from each side)
    se : {"yuen", "wilcox"}, optional 
        method to use to determine standard error. Default is "yuen" (default)
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *trim. mean*, the sample trimmed mean
        * *mu*, hypothesized mean
        * *SE*, the standard error
        * *statistic*, the test statistic (t-value)
        * *df*, degrees of freedom
        * *p-value*, p-value (sig.)
        * *test used*, name of test used
    
    Notes
    -----
    The formula used is:
    $$\\frac{\\bar{x}_t - \\mu_{H_0}}{SE}$$
    $$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$
    
    With:
    $$\\bar{x}_t = \\frac{\\sum_{i=g+1}^{n - g}y_i}{}$$
    $$g = \\lfloor n\\times p_t\\rfloor$$
    $$m = n - 2\\times g$$
    $$SE = \\sqrt{\\frac{SSD_w}{m\\times\\left(m - 1\\right)}}$$
    $$SSD_w = g\\times\\left(y_{g+1} - \\bar{x}_w\\right)^2 + g\\times\\left(y_{n-g} - \\bar{x}_w\\right)^2 + \\sum_{i=g+1}^{n - g} \\left(y_i - \\bar{x}_w\\right)^2$$
    $$\\bar{x}_w = \\frac{\\bar{x}_t\\times m + g\\times\\left(y_{g+1} + y_{n-g}\\right)}{n}$$
    
    If *se="wilcox" is used, the formula for SE will be adjusted to:
    $$SE = \\frac{\\sqrt{\\frac{SSD_w}{n - 1}}}{\\left(1 - 2\\times p_t\\right)\\times\\sqrt{n}}$$
    
    *Symbols used:*
    
    * $x_t$ the trimmed mean of the scores
    * $x_w$ The Winsorized mean
    * $SSD_w$ the sum of squared deviations from the Winsorized mean
    * $m$ the number of scores in the trimmed data set from category i
    * $y_i$ the i-th score after the scores are sorted from low to high
    * $p$ the proportion of trimming on each side, we can define
    
    The test is often also referred to as a Yuen test, or Yuen-Welch test.
    
    The standard error can either be calculated using the first SE, which for example can be found in Tukey and McLaughlin (1963, p. 342), and seems similar to the independent samples version of this test as proposed by Yuen (1974, p. 167)
    
    The second version is used in the other libraries from the software R, and can be found in Wilcox (2012, p. 157), or Peró-Cebollero and Guàrdia-Olmos (2013, p. 409).

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want an effect size measure:
    * [es_cohen_d_os](../effect_sizes/eff_size_cohen_d_os.html#es_cohen_d_os) for Cohen d'
    * [es_hedges_g_os](../effect_sizes/eff_size_hedges_g_os.html#es_hedges_g_os) for Hedges g
    * [es_common_language_os](../eff_size_common_language_os/meas_variation.html#es_common_language_os) for the Common Language Effect Size
    
    Alternative Tests:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
    
    References 
    ----------
    Peró-Cebollero, M., & Guàrdia-Olmos, J. (2013). The adequacy of different robust statistical tests in comparing two independent groups. *Psicológica*, 34, 407–424.
    
    Tukey, J. W., & McLaughlin, D. H. (1963). Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1. *Sankhyā: The Indian Journal of Statistics, 25*(3), 331–352.
    
    Wilcox, R. R. (2012). *Introduction to robust estimation and hypothesis testing* (3rd ed.). Academic Press.
    
    Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. *Biometrika, 61*(1), 165–170. doi:10.1093/biomet/61.1.165
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    ---------
    >>> pd.set_option('display.width',1000)
    >>> pd.set_option('display.max_columns', 1000)
    
    Example 1: pandas series
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Gen_Age']
    >>> ts_trimmed_mean_os(ex1)
       trim. mean    mu        SE  statistic  df  p-value                     test used
    0        22.1  68.5  0.629778 -73.676782  39      0.0  one-sample trimmed mean test
    >>> ts_trimmed_mean_os(ex1, mu=23, trimProp=0.15, se="wilcox")
       trim. mean  mu        SE  statistic  df   p-value                     test used
    0        22.0  23  0.648656  -1.541649  37  0.131669  one-sample trimmed mean test
    
    Example 2: Numeric list
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> ts_trimmed_mean_os(ex2)
       trim. mean   mu        SE  statistic  df   p-value                     test used
    0    3.444444  3.0  0.372434    1.19335  17  0.249121  one-sample trimmed mean test
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    data = pd.to_numeric(data)
    data = data.sort_values()
    data = data.reset_index(drop=True)
    
    if mu is None:
        mu = (max(data)+min(data))/2
    
    n = len(data)
    nt = n*trimProp/2
    nl = math.floor(nt)
    mt = data[nl:(n - nl)].mean()
    nat = n - 2*nl
    mw = (mt*nat + nl*(data[nl] + data[nl+nat-1]))/n
    ssdw = nl*(data[nl] - mw)**2 + nl*(data[nl+nat-1] - mw)**2 + sum((data[nl:(nl+nat)] - mw)**2)
    varw = ssdw/(n - 1)
    
    if se=="yuen":
        SE = (ssdw/(nat*(nat - 1)))**0.5
    elif se=="wilcox":
        SE = (varw)**0.5/((1 - trimProp)*(n**0.5))
    
    tValue = (mt - mu)/SE
    df = nat - 1
    pValue = 2 * (1 - t.cdf(abs(tValue), df))
    
    testUsed = "one-sample trimmed mean test"
    results = pd.DataFrame([[mt, mu, SE, tValue, df, pValue, testUsed]], columns=["trim. mean", "mu", "SE", "statistic", "df", "p-value", "test used"])
    
    return (results)