Module stikpetP.measures.meas_variation

Expand source code
import pandas as pd
import numpy as np
from statistics import mode
from .meas_mean import me_mean

def me_variation(data, levels=None, measure="std", ddof=1, center="mean", azs="square"):
    '''
    Measures of Quantitative Variation
    ----------------------------------
    
    Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version.

    This function is shown in this [YouTube video](https://youtu.be/fV8W3cJSpyc) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QuantitativeVariation.html)
    
    Parameters
    ----------
    data : list or pandas data series 
        numeric data
    levels : dictionary, optional 
        coding to use
    measure : {"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}, optional
        the measure to determine. Default is "std"
    ddof : float, optional
        option to adjust the division in standard deviation or variance with. Default is 1.
    center : {"mean", "median", "mode"} or float, optional
        if measure is "own" the value to use as center. Default is "mean"
    azs : {"square", "abs"}, optional
        if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value
        
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *value*, the value of the measure
        * *measure*, description of the measure
    
    Notes
    -----
    
    **Standard Deviation** (std)
    
    The formula used is:
    $$s = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}}$$
    
    Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
    
    **Variance** (var)
    
    The formula used is:
    $$s^2 = \\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}$$
    
    Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
    
    **Mean Absolute Deviation** (mad)
    
    The formula used is:
    $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\bar{x}\\right|}{n}$$
    
    **Mean Absolute Deviation from the Median** (madmed)
    
    The formula used is:
    $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\tilde{x}\\right|}{n}$$
    
    Where $\\tilde{x}$ is the median
    
    **Median Absolute Deviation** (medad)
    
    The formula used is:
    $$MAD = MED\\left(\\left| x_i - \\tilde{x}\\right|\\right)$$
    
    **Decile Standard Deviation**
    
    The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
    $$s_{dm} = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - DM\\right)^2}{n - d}}$$
    
    Where DM is the decile mean.
    
    **Coefficient of Variation** (cv)
    
    The formula used is (Pearson, 1896, p. 277):
    $$CV = \\frac{s}{\\bar{x}}$$
    
    **Coefficient of Diversity** (cd)
    
    The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
    $$CD = \\frac{s_{dm}}{DM}$$
    
    **Own**
    
    it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences.

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want some other descriptive measures:
    * [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data
    * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
    
    Or a perform a test:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
    
    References
    ----------
    Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. *Philosophical Transactions of the Royal Society of London*. (A.), 1896, 253–318.
    
    Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. *International Journal of Statistics and Applications, 8*(6), 309–315. doi:10.5923/j.statistics.20180806.04
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    Example 1: Sample Standard Deviation of a Numeric Pandas Series
    >>> import pandas as pd
    >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = student_df['Gen_Age']
    >>> me_variation(ex1)
           value                      measure
    0  15.144965  standard deviation (sample)
    
    Example 2: Mean Absolute Deviation of a Numeric list
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_variation(ex2, measure='mad')
          value                  measure
    0  1.395062  mean absolute deviation
    
    '''
    if type(data) is list:        
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.replace(levels)
        dataN = pd.to_numeric(dataN)
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)    
    n = len(dataN)
    
    if measure=="std" and ddof==1:
        lbl = "standard deviation (sample)"
        res = np.std(dataN, ddof=1)
    elif measure=="std" and ddof==0:
        lbl = "standard deviation (population)"
        res = np.std(dataN, ddof=0)
    elif measure=="std":
        lbl = "standard deviation corrected with " + str(ddof)
        res = np.std(dataN, ddof=ddof)
    elif measure=="var" and ddof==1:
        lbl = "variance (sample)"
        res = np.var(dataN, ddof=1)
    elif measure=="var" and ddof==0:
        lbl = "variance (population)"
        res = np.var(dataN, ddof=0)
    elif measure=="var":
        lbl = "variance corrected with " + str(ddof)
        res = np.var(dataN, ddof=ddof)
    elif measure=="mad":
        lbl = "mean absolute deviation"
        mu = np.mean(dataN)
        res = sum(abs(dataN - mu))/n
    elif measure=="madmed":
        lbl = "mean absolute deviation around median"
        mu = np.median(dataN)
        res = sum(abs(dataN - mu))/n
    elif measure=="medad":
        lbl = "median absolute deviation"
        mu = np.median(dataN)
        res = np.median(abs(dataN - mu))    
    elif measure=="cv":
        lbl = "coefficient of variation"
        mu = np.mean(dataN)
        s = np.std(dataN, ddof=ddof)
        res = s/mu
    elif measure=="stddm":
        lbl = "standard deviation with decile mean"
        mu = me_mean(dataN, version="decile")
        res = (sum((dataN - mu)**2)/(n-ddof))**0.5        
    elif measure=="cd":
        lbl = "coefficient of deviation"
        mu = me_mean(dataN, version="decile")
        s = (sum((dataN - mu)**2)/(n-ddof))**0.5
        res = s/mu
    else:
        if center=="mean":
            lbl = "mean"
            mu = np.mean(dataN)
        elif center=="median":
            lbl = "median"
            mu = np.median(dataN)
        elif center=="mode":
            lbl = "mode"
            mu = mode(dataN)
        else:
            lbl = str(center)
            mu = center
            
        if azs=="square":
            lbl = "sum squared deviation around " + str(lbl)
            res = sum((dataN - mu)**2)
        elif azs=="abs":
            lbl = "sum absolute deviation around " + str(lbl)
            res = sum(abs(dataN - mu))
    
    results = pd.DataFrame(list([[res, lbl]]), columns = ["value", "measure"])
                 
    return (results)

Functions

def me_variation(data, levels=None, measure='std', ddof=1, center='mean', azs='square')

Measures Of Quantitative Variation

Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version.

This function is shown in this YouTube video and the measure is also described at PeterStatistics.com

Parameters

data : list or pandas data series
numeric data
levels : dictionary, optional
coding to use
measure : {"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}, optional
the measure to determine. Default is "std"
ddof : float, optional
option to adjust the division in standard deviation or variance with. Default is 1.
center : {"mean", "median", "mode"} or float, optional
if measure is "own" the value to use as center. Default is "mean"
azs : {"square", "abs"}, optional
if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value

Returns

pandas.DataFrame

A dataframe with the following columns:

  • value, the value of the measure
  • measure, description of the measure

Notes

Standard Deviation (std)

The formula used is: s = \sqrt{\frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n - d}}

Where $d$ is the offset specified at ddof. By default this is 1, giving the sample standard deviation.

Variance (var)

The formula used is: s^2 = \frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n - d}

Where $d$ is the offset specified at ddof. By default this is 1, giving the sample standard deviation.

Mean Absolute Deviation (mad)

The formula used is: MAD = \frac{\sum_{i=1}^n \left| x_i - \bar{x}\right|}{n}

Mean Absolute Deviation from the Median (madmed)

The formula used is: MAD = \frac{\sum_{i=1}^n \left| x_i - \tilde{x}\right|}{n}

Where $\tilde{x}$ is the median

Median Absolute Deviation (medad)

The formula used is: MAD = MED\left(\left| x_i - \tilde{x}\right|\right)

Decile Standard Deviation

The formula used is (Siraj-Ud-Doulah, 2018, p. 310): s_{dm} = \sqrt{\frac{\sum_{i=1}^n \left(x_i - DM\right)^2}{n - d}}

Where DM is the decile mean.

Coefficient of Variation (cv)

The formula used is (Pearson, 1896, p. 277): CV = \frac{s}{\bar{x}}

Coefficient of Diversity (cd)

The formula used is (Siraj-Ud-Doulah, 2018, p. 310): CD = \frac{s_{dm}}{DM}

Own

it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences.

Before, After and Alternatives

Before this you might want to create a binned frequency table or a visualisation: * tab_frequency_bins to create a binned frequency table * vi_boxplot_single for a Box (and Whisker) Plot * vi_histogram for a Histogram * vi_stem_and_leaf for a Stem-and-Leaf Display

After this you might want some other descriptive measures: * me_mode_bin for Mode for Binned Data * me_mean for different types of mean

Or a perform a test: * ts_student_t_os for One-Sample Student t-Test * ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * ts_z_os for One-Sample Z Test

References

Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. Philosophical Transactions of the Royal Society of London. (A.), 1896, 253–318.

Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. International Journal of Statistics and Applications, 8(6), 309–315. doi:10.5923/j.statistics.20180806.04

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: Sample Standard Deviation of a Numeric Pandas Series

>>> import pandas as pd
>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = student_df['Gen_Age']
>>> me_variation(ex1)
       value                      measure
0  15.144965  standard deviation (sample)

Example 2: Mean Absolute Deviation of a Numeric list

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> me_variation(ex2, measure='mad')
      value                  measure
0  1.395062  mean absolute deviation
Expand source code
def me_variation(data, levels=None, measure="std", ddof=1, center="mean", azs="square"):
    '''
    Measures of Quantitative Variation
    ----------------------------------
    
    Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version.

    This function is shown in this [YouTube video](https://youtu.be/fV8W3cJSpyc) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QuantitativeVariation.html)
    
    Parameters
    ----------
    data : list or pandas data series 
        numeric data
    levels : dictionary, optional 
        coding to use
    measure : {"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}, optional
        the measure to determine. Default is "std"
    ddof : float, optional
        option to adjust the division in standard deviation or variance with. Default is 1.
    center : {"mean", "median", "mode"} or float, optional
        if measure is "own" the value to use as center. Default is "mean"
    azs : {"square", "abs"}, optional
        if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value
        
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * *value*, the value of the measure
        * *measure*, description of the measure
    
    Notes
    -----
    
    **Standard Deviation** (std)
    
    The formula used is:
    $$s = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}}$$
    
    Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
    
    **Variance** (var)
    
    The formula used is:
    $$s^2 = \\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}$$
    
    Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
    
    **Mean Absolute Deviation** (mad)
    
    The formula used is:
    $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\bar{x}\\right|}{n}$$
    
    **Mean Absolute Deviation from the Median** (madmed)
    
    The formula used is:
    $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\tilde{x}\\right|}{n}$$
    
    Where $\\tilde{x}$ is the median
    
    **Median Absolute Deviation** (medad)
    
    The formula used is:
    $$MAD = MED\\left(\\left| x_i - \\tilde{x}\\right|\\right)$$
    
    **Decile Standard Deviation**
    
    The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
    $$s_{dm} = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - DM\\right)^2}{n - d}}$$
    
    Where DM is the decile mean.
    
    **Coefficient of Variation** (cv)
    
    The formula used is (Pearson, 1896, p. 277):
    $$CV = \\frac{s}{\\bar{x}}$$
    
    **Coefficient of Diversity** (cd)
    
    The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
    $$CD = \\frac{s_{dm}}{DM}$$
    
    **Own**
    
    it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences.

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want some other descriptive measures:
    * [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data
    * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
    
    Or a perform a test:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
    
    References
    ----------
    Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. *Philosophical Transactions of the Royal Society of London*. (A.), 1896, 253–318.
    
    Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. *International Journal of Statistics and Applications, 8*(6), 309–315. doi:10.5923/j.statistics.20180806.04
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    Example 1: Sample Standard Deviation of a Numeric Pandas Series
    >>> import pandas as pd
    >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = student_df['Gen_Age']
    >>> me_variation(ex1)
           value                      measure
    0  15.144965  standard deviation (sample)
    
    Example 2: Mean Absolute Deviation of a Numeric list
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_variation(ex2, measure='mad')
          value                  measure
    0  1.395062  mean absolute deviation
    
    '''
    if type(data) is list:        
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.replace(levels)
        dataN = pd.to_numeric(dataN)
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)    
    n = len(dataN)
    
    if measure=="std" and ddof==1:
        lbl = "standard deviation (sample)"
        res = np.std(dataN, ddof=1)
    elif measure=="std" and ddof==0:
        lbl = "standard deviation (population)"
        res = np.std(dataN, ddof=0)
    elif measure=="std":
        lbl = "standard deviation corrected with " + str(ddof)
        res = np.std(dataN, ddof=ddof)
    elif measure=="var" and ddof==1:
        lbl = "variance (sample)"
        res = np.var(dataN, ddof=1)
    elif measure=="var" and ddof==0:
        lbl = "variance (population)"
        res = np.var(dataN, ddof=0)
    elif measure=="var":
        lbl = "variance corrected with " + str(ddof)
        res = np.var(dataN, ddof=ddof)
    elif measure=="mad":
        lbl = "mean absolute deviation"
        mu = np.mean(dataN)
        res = sum(abs(dataN - mu))/n
    elif measure=="madmed":
        lbl = "mean absolute deviation around median"
        mu = np.median(dataN)
        res = sum(abs(dataN - mu))/n
    elif measure=="medad":
        lbl = "median absolute deviation"
        mu = np.median(dataN)
        res = np.median(abs(dataN - mu))    
    elif measure=="cv":
        lbl = "coefficient of variation"
        mu = np.mean(dataN)
        s = np.std(dataN, ddof=ddof)
        res = s/mu
    elif measure=="stddm":
        lbl = "standard deviation with decile mean"
        mu = me_mean(dataN, version="decile")
        res = (sum((dataN - mu)**2)/(n-ddof))**0.5        
    elif measure=="cd":
        lbl = "coefficient of deviation"
        mu = me_mean(dataN, version="decile")
        s = (sum((dataN - mu)**2)/(n-ddof))**0.5
        res = s/mu
    else:
        if center=="mean":
            lbl = "mean"
            mu = np.mean(dataN)
        elif center=="median":
            lbl = "median"
            mu = np.median(dataN)
        elif center=="mode":
            lbl = "mode"
            mu = mode(dataN)
        else:
            lbl = str(center)
            mu = center
            
        if azs=="square":
            lbl = "sum squared deviation around " + str(lbl)
            res = sum((dataN - mu)**2)
        elif azs=="abs":
            lbl = "sum absolute deviation around " + str(lbl)
            res = sum(abs(dataN - mu))
    
    results = pd.DataFrame(list([[res, lbl]]), columns = ["value", "measure"])
                 
    return (results)