Module stikpetP.measures.meas_variation
Expand source code
import pandas as pd
import numpy as np
from statistics import mode
from .meas_mean import me_mean
def me_variation(data, levels=None, measure="std", ddof=1, center="mean", azs="square"):
'''
Measures of Quantitative Variation
----------------------------------
Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version.
This function is shown in this [YouTube video](https://youtu.be/fV8W3cJSpyc) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QuantitativeVariation.html)
Parameters
----------
data : list or pandas data series
numeric data
levels : dictionary, optional
coding to use
measure : {"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}, optional
the measure to determine. Default is "std"
ddof : float, optional
option to adjust the division in standard deviation or variance with. Default is 1.
center : {"mean", "median", "mode"} or float, optional
if measure is "own" the value to use as center. Default is "mean"
azs : {"square", "abs"}, optional
if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value
Returns
-------
pandas.DataFrame
A dataframe with the following columns:
* *value*, the value of the measure
* *measure*, description of the measure
Notes
-----
**Standard Deviation** (std)
The formula used is:
$$s = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}}$$
Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
**Variance** (var)
The formula used is:
$$s^2 = \\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}$$
Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation.
**Mean Absolute Deviation** (mad)
The formula used is:
$$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\bar{x}\\right|}{n}$$
**Mean Absolute Deviation from the Median** (madmed)
The formula used is:
$$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\tilde{x}\\right|}{n}$$
Where $\\tilde{x}$ is the median
**Median Absolute Deviation** (medad)
The formula used is:
$$MAD = MED\\left(\\left| x_i - \\tilde{x}\\right|\\right)$$
**Decile Standard Deviation**
The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
$$s_{dm} = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - DM\\right)^2}{n - d}}$$
Where DM is the decile mean.
**Coefficient of Variation** (cv)
The formula used is (Pearson, 1896, p. 277):
$$CV = \\frac{s}{\\bar{x}}$$
**Coefficient of Diversity** (cd)
The formula used is (Siraj-Ud-Doulah, 2018, p. 310):
$$CD = \\frac{s_{dm}}{DM}$$
**Own**
it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences.
Before, After and Alternatives
------------------------------
Before this you might want to create a binned frequency table or a visualisation:
* [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
* [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
* [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
* [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display
After this you might want some other descriptive measures:
* [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data
* [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
Or a perform a test:
* [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
* [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
* [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
References
----------
Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. *Philosophical Transactions of the Royal Society of London*. (A.), 1896, 253–318.
Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. *International Journal of Statistics and Applications, 8*(6), 309–315. doi:10.5923/j.statistics.20180806.04
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Sample Standard Deviation of a Numeric Pandas Series
>>> import pandas as pd
>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = student_df['Gen_Age']
>>> me_variation(ex1)
value measure
0 15.144965 standard deviation (sample)
Example 2: Mean Absolute Deviation of a Numeric list
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> me_variation(ex2, measure='mad')
value measure
0 1.395062 mean absolute deviation
'''
if type(data) is list:
data = pd.Series(data)
data = data.dropna()
if levels is not None:
dataN = data.replace(levels)
dataN = pd.to_numeric(dataN)
else:
dataN = pd.to_numeric(data)
dataN = dataN.sort_values().reset_index(drop=True)
n = len(dataN)
if measure=="std" and ddof==1:
lbl = "standard deviation (sample)"
res = np.std(dataN, ddof=1)
elif measure=="std" and ddof==0:
lbl = "standard deviation (population)"
res = np.std(dataN, ddof=0)
elif measure=="std":
lbl = "standard deviation corrected with " + str(ddof)
res = np.std(dataN, ddof=ddof)
elif measure=="var" and ddof==1:
lbl = "variance (sample)"
res = np.var(dataN, ddof=1)
elif measure=="var" and ddof==0:
lbl = "variance (population)"
res = np.var(dataN, ddof=0)
elif measure=="var":
lbl = "variance corrected with " + str(ddof)
res = np.var(dataN, ddof=ddof)
elif measure=="mad":
lbl = "mean absolute deviation"
mu = np.mean(dataN)
res = sum(abs(dataN - mu))/n
elif measure=="madmed":
lbl = "mean absolute deviation around median"
mu = np.median(dataN)
res = sum(abs(dataN - mu))/n
elif measure=="medad":
lbl = "median absolute deviation"
mu = np.median(dataN)
res = np.median(abs(dataN - mu))
elif measure=="cv":
lbl = "coefficient of variation"
mu = np.mean(dataN)
s = np.std(dataN, ddof=ddof)
res = s/mu
elif measure=="stddm":
lbl = "standard deviation with decile mean"
mu = me_mean(dataN, version="decile")
res = (sum((dataN - mu)**2)/(n-ddof))**0.5
elif measure=="cd":
lbl = "coefficient of deviation"
mu = me_mean(dataN, version="decile")
s = (sum((dataN - mu)**2)/(n-ddof))**0.5
res = s/mu
else:
if center=="mean":
lbl = "mean"
mu = np.mean(dataN)
elif center=="median":
lbl = "median"
mu = np.median(dataN)
elif center=="mode":
lbl = "mode"
mu = mode(dataN)
else:
lbl = str(center)
mu = center
if azs=="square":
lbl = "sum squared deviation around " + str(lbl)
res = sum((dataN - mu)**2)
elif azs=="abs":
lbl = "sum absolute deviation around " + str(lbl)
res = sum(abs(dataN - mu))
results = pd.DataFrame(list([[res, lbl]]), columns = ["value", "measure"])
return (results)
Functions
def me_variation(data, levels=None, measure='std', ddof=1, center='mean', azs='square')
-
Measures Of Quantitative Variation
Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version.
This function is shown in this YouTube video and the measure is also described at PeterStatistics.com
Parameters
data
:list
orpandas data series
- numeric data
levels
:dictionary
, optional- coding to use
measure
:{"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}
, optional- the measure to determine. Default is "std"
ddof
:float
, optional- option to adjust the division in standard deviation or variance with. Default is 1.
center
:{"mean", "median", "mode"}
orfloat
, optional- if measure is "own" the value to use as center. Default is "mean"
azs
:{"square", "abs"}
, optional- if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value
Returns
pandas.DataFrame
-
A dataframe with the following columns:
- value, the value of the measure
- measure, description of the measure
Notes
Standard Deviation (std)
The formula used is: s = \sqrt{\frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n - d}}
Where $d$ is the offset specified at ddof. By default this is 1, giving the sample standard deviation.
Variance (var)
The formula used is: s^2 = \frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n - d}
Where $d$ is the offset specified at ddof. By default this is 1, giving the sample standard deviation.
Mean Absolute Deviation (mad)
The formula used is: MAD = \frac{\sum_{i=1}^n \left| x_i - \bar{x}\right|}{n}
Mean Absolute Deviation from the Median (madmed)
The formula used is: MAD = \frac{\sum_{i=1}^n \left| x_i - \tilde{x}\right|}{n}
Where $\tilde{x}$ is the median
Median Absolute Deviation (medad)
The formula used is: MAD = MED\left(\left| x_i - \tilde{x}\right|\right)
Decile Standard Deviation
The formula used is (Siraj-Ud-Doulah, 2018, p. 310): s_{dm} = \sqrt{\frac{\sum_{i=1}^n \left(x_i - DM\right)^2}{n - d}}
Where DM is the decile mean.
Coefficient of Variation (cv)
The formula used is (Pearson, 1896, p. 277): CV = \frac{s}{\bar{x}}
Coefficient of Diversity (cd)
The formula used is (Siraj-Ud-Doulah, 2018, p. 310): CD = \frac{s_{dm}}{DM}
Own
it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences.
Before, After and Alternatives
Before this you might want to create a binned frequency table or a visualisation: * tab_frequency_bins to create a binned frequency table * vi_boxplot_single for a Box (and Whisker) Plot * vi_histogram for a Histogram * vi_stem_and_leaf for a Stem-and-Leaf Display
After this you might want some other descriptive measures: * me_mode_bin for Mode for Binned Data * me_mean for different types of mean
Or a perform a test: * ts_student_t_os for One-Sample Student t-Test * ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * ts_z_os for One-Sample Z Test
References
Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. Philosophical Transactions of the Royal Society of London. (A.), 1896, 253–318.
Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. International Journal of Statistics and Applications, 8(6), 309–315. doi:10.5923/j.statistics.20180806.04
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Sample Standard Deviation of a Numeric Pandas Series
>>> import pandas as pd >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Gen_Age'] >>> me_variation(ex1) value measure 0 15.144965 standard deviation (sample)
Example 2: Mean Absolute Deviation of a Numeric list
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] >>> me_variation(ex2, measure='mad') value measure 0 1.395062 mean absolute deviation
Expand source code
def me_variation(data, levels=None, measure="std", ddof=1, center="mean", azs="square"): ''' Measures of Quantitative Variation ---------------------------------- Probably the most famous measure of dispersion is the standard deviation, but there are more. This function provides a variety of measures and allows the creation of your own version. This function is shown in this [YouTube video](https://youtu.be/fV8W3cJSpyc) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QuantitativeVariation.html) Parameters ---------- data : list or pandas data series numeric data levels : dictionary, optional coding to use measure : {"std", "var", "mad", "madmed", "medad", "stddm", "cv", "cd", "own"}, optional the measure to determine. Default is "std" ddof : float, optional option to adjust the division in standard deviation or variance with. Default is 1. center : {"mean", "median", "mode"} or float, optional if measure is "own" the value to use as center. Default is "mean" azs : {"square", "abs"}, optional if measure is "own" the way to avoid a zero sum. Either by squaring or absolute value Returns ------- pandas.DataFrame A dataframe with the following columns: * *value*, the value of the measure * *measure*, description of the measure Notes ----- **Standard Deviation** (std) The formula used is: $$s = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}}$$ Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation. **Variance** (var) The formula used is: $$s^2 = \\frac{\\sum_{i=1}^n \\left(x_i - \\bar{x}\\right)^2}{n - d}$$ Where $d$ is the offset specified at *ddof*. By default this is 1, giving the sample standard deviation. **Mean Absolute Deviation** (mad) The formula used is: $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\bar{x}\\right|}{n}$$ **Mean Absolute Deviation from the Median** (madmed) The formula used is: $$MAD = \\frac{\\sum_{i=1}^n \\left| x_i - \\tilde{x}\\right|}{n}$$ Where $\\tilde{x}$ is the median **Median Absolute Deviation** (medad) The formula used is: $$MAD = MED\\left(\\left| x_i - \\tilde{x}\\right|\\right)$$ **Decile Standard Deviation** The formula used is (Siraj-Ud-Doulah, 2018, p. 310): $$s_{dm} = \\sqrt{\\frac{\\sum_{i=1}^n \\left(x_i - DM\\right)^2}{n - d}}$$ Where DM is the decile mean. **Coefficient of Variation** (cv) The formula used is (Pearson, 1896, p. 277): $$CV = \\frac{s}{\\bar{x}}$$ **Coefficient of Diversity** (cd) The formula used is (Siraj-Ud-Doulah, 2018, p. 310): $$CD = \\frac{s_{dm}}{DM}$$ **Own** it's possible to create one's own method. Decide on a specific center. Default options are the mean, median and mode. Then on either to sum the squared deviations or the absolute differences. Before, After and Alternatives ------------------------------ Before this you might want to create a binned frequency table or a visualisation: * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display After this you might want some other descriptive measures: * [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean Or a perform a test: * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test References ---------- Pearson, K. (1896). Contributions to the mathematical theory of evolution. III. Regression, Heredity, and Panmixia. *Philosophical Transactions of the Royal Society of London*. (A.), 1896, 253–318. Siraj-Ud-Doulah, M. (2018). Alternative measures of standard deviation coefficient of variation and standard error. *International Journal of Statistics and Applications, 8*(6), 309–315. doi:10.5923/j.statistics.20180806.04 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Sample Standard Deviation of a Numeric Pandas Series >>> import pandas as pd >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Gen_Age'] >>> me_variation(ex1) value measure 0 15.144965 standard deviation (sample) Example 2: Mean Absolute Deviation of a Numeric list >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] >>> me_variation(ex2, measure='mad') value measure 0 1.395062 mean absolute deviation ''' if type(data) is list: data = pd.Series(data) data = data.dropna() if levels is not None: dataN = data.replace(levels) dataN = pd.to_numeric(dataN) else: dataN = pd.to_numeric(data) dataN = dataN.sort_values().reset_index(drop=True) n = len(dataN) if measure=="std" and ddof==1: lbl = "standard deviation (sample)" res = np.std(dataN, ddof=1) elif measure=="std" and ddof==0: lbl = "standard deviation (population)" res = np.std(dataN, ddof=0) elif measure=="std": lbl = "standard deviation corrected with " + str(ddof) res = np.std(dataN, ddof=ddof) elif measure=="var" and ddof==1: lbl = "variance (sample)" res = np.var(dataN, ddof=1) elif measure=="var" and ddof==0: lbl = "variance (population)" res = np.var(dataN, ddof=0) elif measure=="var": lbl = "variance corrected with " + str(ddof) res = np.var(dataN, ddof=ddof) elif measure=="mad": lbl = "mean absolute deviation" mu = np.mean(dataN) res = sum(abs(dataN - mu))/n elif measure=="madmed": lbl = "mean absolute deviation around median" mu = np.median(dataN) res = sum(abs(dataN - mu))/n elif measure=="medad": lbl = "median absolute deviation" mu = np.median(dataN) res = np.median(abs(dataN - mu)) elif measure=="cv": lbl = "coefficient of variation" mu = np.mean(dataN) s = np.std(dataN, ddof=ddof) res = s/mu elif measure=="stddm": lbl = "standard deviation with decile mean" mu = me_mean(dataN, version="decile") res = (sum((dataN - mu)**2)/(n-ddof))**0.5 elif measure=="cd": lbl = "coefficient of deviation" mu = me_mean(dataN, version="decile") s = (sum((dataN - mu)**2)/(n-ddof))**0.5 res = s/mu else: if center=="mean": lbl = "mean" mu = np.mean(dataN) elif center=="median": lbl = "median" mu = np.median(dataN) elif center=="mode": lbl = "mode" mu = mode(dataN) else: lbl = str(center) mu = center if azs=="square": lbl = "sum squared deviation around " + str(lbl) res = sum((dataN - mu)**2) elif azs=="abs": lbl = "sum absolute deviation around " + str(lbl) res = sum(abs(dataN - mu)) results = pd.DataFrame(list([[res, lbl]]), columns = ["value", "measure"]) return (results)