Module stikpetP.measures.meas_quantiles
Expand source code
import pandas as pd
import math
from ..helper.help_quantileIndex import he_quantileIndex
def me_quantiles(data, levels=None, k=4, method="own", indexMethod="sas1", qLfrac="linear", qLint="int", qHfrac="linear", qHint="int"):
'''
Quantiles
---------
Quantiles split the data into k sections, each containing n/k scores. They can be seen as a generalisation of various 'tiles'. For example 4-quantiles is the same as the quartiles, 5-quantiles the same as quintiles, 100-quantiles the same as percentiles, etc.
Quite a few different methods exist to determine these. See the notes for more information.
This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html)
Parameters
----------
data : list or pandas series
levels : dictionary, optional
coding to use
k : number of quantiles
method : string, optional
which method to use to calculate quantiles
indexMethod : {"sas1", "sas4", "excel", "hl", "hf8", "hf9"}, optional
to indicate which type of indexing to use. Default is "sas1"
qLfrac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional
to indicate what type of rounding to use for quantiles below 50 percent. Default is "linear"
qLint : {"int", "midpoint"}, optional
to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
qHfrac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional
to indicate what type of rounding to use for quantiles equal or above 50 percent. Default is "linear"
qHint : {"int", "midpoint"}, optional
to indicate the use of the integer or the midpoint method for quantiles equal or above 50 percent. Default is "int"
method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
Returns
-------
results : the quantiles, or if levels are used also additionally text versions
Notes
-----
To determine the quartiles a specific indexing method can be used. See **he_quantileIndexing()** for details on the different methods to choose from.
Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used.
See the **he_quantilesIndex()** for details on this.
Note that the rounding method can even vary per quantile, i.e. the one used for the ones below the median being different than the one those equal or above.
I've come across the following methods:
|method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional|
|------|--------|----------|-------------|----------|-------------|
|sas1|sas1|use int|linear|use int|linear|
|sas2|sas1|use int|bankers|use int|bankers|
|sas3|sas1|use int|up|use int|up|
|sas5|sas1|midpoint|up|midpoint|up|
|hf3b|sas1|use int|nearest|use int|halfdown|
|sas4|sas4|use int|linear|use int|linear|
|ms|sas4|use int|nearest|use int|halfdown|
|lohninger|sas4|use int|nearest|use int|nearest|
|hl2|hl|use int|linear|use int|linear|
|hl1|hl|use int|midpoint|use int|midpoint|
|excel|excel|use int|linear|use int|linear|
|pd2|excel|use int|down|use int|down|
|pd3|excel|use int|up|use int|up|
|pd4|excel|use int|halfdown|use int|nearest|
|pd5|excel|use int|midpoint|use int|midpoint|
|hf8|hf8|use int|linear|use int|linear|
|hf9|hf9|use int|linear|use int|linear|
The following values can be used for the *method* parameter:
1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
1. hf3b = closest_observation
1. ms (Mendenhall & Sincich, 1992, p. 35)
1. lohninger (Lohninger, n.d.)
1. hl1 (Hogg & Ledolter, 1992, p. 21)
1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
1. maple2
1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
1. pd2 = lower
1. pd3 = higher
1. pd4 = nearest
1. pd5 = midpoint
1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)
*hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R.
The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*.
Before, After and Alternatives
------------------------------
Before this measure you might want an impression using a frequency table or a visualisation:
* [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table
* [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart
* [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart
After this you might want some other descriptive measures:
* [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus
* [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample)
* [me_median](../measures/meas_median.html#me_median) for the Median
* [me_quartiles](../measures/meas_quartiles.html#me_quantiles) for Quartiles / Hinges
* [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
or perform a test:
* [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test
* [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test
* [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample)
For more information on the quartile indexing methods and index itself:
* [he_quantileIndexing](../helper/help_quantileIndexing.html#he_quartileIndexing)
* [he_quantilesIndex](../helper/help_quantileIndex.html#he_quartilesIndex)
References
----------
Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479
Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272.
Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647.
Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563
Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan.
Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934
Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589
Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html
McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061
Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company.
Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621
SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.
Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley.
Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press.
Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press.
Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Text Pandas Series
>>> import pandas as pd
>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = student_df['Teach_Motivate']
>>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
>>> me_quantiles(ex1, levels=order)
(0 1.0
1 1.0
2 2.0
3 3.0
4 5.0
dtype: float64, ['Fully Disagree', 'Fully Disagree', 'Disagree', 'Neither disagree nor agree', 'Fully agree'])
Example 2: Numeric data
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> me_quantiles(ex2)
0 1.0
1 2.0
2 4.0
3 5.0
4 5.0
dtype: float64
Example 3: Text data
>>> ex3 = ["a", "b", "f", "d", "e", "c"]
>>> order = {"a":1, "b":2, "c":3, "d":4, "e":5, "f":6}
>>> me_quantiles(ex3, levels=order)
(0 1.0
1 1.5
2 3.0
3 4.5
4 6.0
dtype: float64, ['a', 'between a and b', 'c', 'between d and e', 'f'])
'''
if type(data) is list:
data = pd.Series(data)
data = data.dropna()
if levels is not None:
pd.set_option('future.no_silent_downcasting', True)
dataN = data.map(levels).astype('Int8')
else:
dataN = pd.to_numeric(data)
dataN = dataN.sort_values().reset_index(drop=True)
#ataN = list(dataN)
#alternative namings
if method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]:
method = "sas5"
elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]:
method = "sas4"
elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]:
method = "excel"
elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]:
method = "sas1"
elif method in ["sas2", "hf3", "r3"]:
method = "sas2"
elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]:
method = "sas3"
elif method in ["hf3b", "closest_observation"]:
method = "hf3b"
elif method in ["hl2", "hazen", "hf5", "maple4"]:
method = "hl2"
elif method in ["np", "midpoint", "pd5"]:
method = "pd5"
elif method in ["hf8", "median_unbiased", "maple7", "r8"]:
method = "hf8"
elif method in ["hf9", "normal_unbiased", "maple8", "r9"]:
method = "hf9"
elif method in ["pd2", "lower"]:
method = "pd2"
elif method in ["pd3", "higher"]:
method = "pd3"
elif method in ["pd4", "nearest"]:
method = "pd4"
#settings
settings = [indexMethod, qLfrac, qLint, qHfrac, qHint]
if method=="sas1":
settings = ["sas1","linear","int","linear","int"]
elif method=="sas2":
settings = ["sas1","bankers","int","bankers" ,"int"]
elif method=="sas3":
settings = ["sas1","up","int","up","int"]
elif method=="sas5":
settings = ["sas1","up","midpoint","up","midpoint"]
elif method=="sas4":
settings = ["sas4","linear", "int","linear","int"]
elif method=="ms":
settings = ["sas4", "nearest","int", "halfdown","int"]
elif method=="lohninger":
settings = ["sas4", "nearest", "int","nearest","int"]
elif method=="hl2":
settings = ["hl", "linear", "int","linear","int"]
elif method=="hl1":
settings = ["hl", "midpoint","int", "midpoint","int"]
elif method=="excel":
settings = ["excel", "linear","int","linear", "int"]
elif method=="pd2":
settings = ["excel", "down", "int", "down","int"]
elif method=="pd3":
settings = ["excel", "up","int","up","int"]
elif method=="pd4":
settings = ["excel", "halfdown", "int","nearest", "int"]
elif method=="hf3b":
settings = ["sas1", "nearest","int","halfdown","int"]
elif method=="pd5":
settings = ["excel", "midpoint","int","midpoint","int"]
elif method=="hf8":
settings = ["hf8", "linear","int","linear", "int"]
elif method=="hf9":
settings = ["hf9", "linear","int","linear", "int"]
elif method=="maple2":
settings = ["hl", "down","int","down", "int"]
quantiles = he_quantileIndex(dataN, k, settings[0], settings[1], settings[2], settings[3], settings[4])
#he_quantileIndex(data, k=4, indexMethod="sas1", qLfrac="linear", qLint="int", qHfrac="linear", qHint="int")
#find the text representatives
if levels is not None:
quantilesText = []
for i in range(k+1):
if quantiles[i] == round(quantiles[i]):
qT = list(levels.keys())[list(levels.values()).index(quantiles[i])]
else:
qT = "between " + list(levels.keys())[list(levels.values()).index(math.floor(quantiles[i]))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(quantiles[i]))]
quantilesText.append(qT)
results = quantiles, quantilesText
else:
results = quantiles
return results
Functions
def me_quantiles(data, levels=None, k=4, method='own', indexMethod='sas1', qLfrac='linear', qLint='int', qHfrac='linear', qHint='int')
-
Quantiles
Quantiles split the data into k sections, each containing n/k scores. They can be seen as a generalisation of various 'tiles'. For example 4-quantiles is the same as the quartiles, 5-quantiles the same as quintiles, 100-quantiles the same as percentiles, etc.
Quite a few different methods exist to determine these. See the notes for more information.
This function is shown in this YouTube video and the measure is also described at PeterStatistics.com
Parameters
data
:list
orpandas series
levels
:dictionary
, optional- coding to use
k
:number
ofquantiles
method
:string
, optional- which method to use to calculate quantiles
indexMethod
:{"sas1", "sas4", "excel", "hl", "hf8", "hf9"}
, optional- to indicate which type of indexing to use. Default is "sas1"
qLfrac
:{"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}
, optional- to indicate what type of rounding to use for quantiles below 50 percent. Default is "linear"
qLint
:{"int", "midpoint"}
, optional- to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
qHfrac
:{"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}
, optional- to indicate what type of rounding to use for quantiles equal or above 50 percent. Default is "linear"
qHint
:{"int", "midpoint"}
, optional- to indicate the use of the integer or the midpoint method for quantiles equal or above 50 percent. Default is "int"
method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
Returns
results
:the quantiles,
orif levels are used also additionally text versions
Notes
To determine the quartiles a specific indexing method can be used. See he_quantileIndexing() for details on the different methods to choose from.
Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used.
See the he_quantilesIndex() for details on this.
Note that the rounding method can even vary per quantile, i.e. the one used for the ones below the median being different than the one those equal or above.
I've come across the following methods:
method indexing q1 integer q1 fractional q3 integer q3 fractional sas1 sas1 use int linear use int linear sas2 sas1 use int bankers use int bankers sas3 sas1 use int up use int up sas5 sas1 midpoint up midpoint up hf3b sas1 use int nearest use int halfdown sas4 sas4 use int linear use int linear ms sas4 use int nearest use int halfdown lohninger sas4 use int nearest use int nearest hl2 hl use int linear use int linear hl1 hl use int midpoint use int midpoint excel excel use int linear use int linear pd2 excel use int down use int down pd3 excel use int up use int up pd4 excel use int halfdown use int nearest pd5 excel use int midpoint use int midpoint hf8 hf8 use int linear use int linear hf9 hf9 use int linear use int linear The following values can be used for the method parameter:
- sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
- sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
- sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
- sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
- sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
- hf3b = closest_observation
- ms (Mendenhall & Sincich, 1992, p. 35)
- lohninger (Lohninger, n.d.)
- hl1 (Hogg & Ledolter, 1992, p. 21)
- hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
- maple2
- excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
- pd2 = lower
- pd3 = higher
- pd4 = nearest
- pd5 = midpoint
- hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
- hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)
hf is short for Hyndman and Fan who wrote an article showcasing many different methods, hl is short for Hog and Ledolter, ms is short for Mendenhall and Sincich, jf is short for Joarder and Firozzaman. sas refers to the software package SAS, maple to Maple, pd to Python's pandas library, and r to R.
The names linear, lower, higher, nearest and midpoint are all used by pandas quantile function and numpy percentile function. Numpy also uses inverted_cdf, averaged_inverted_cdf, closest_observation, interpolated_inverted_cdf, hazen, weibull, median_unbiased, and normal_unbiased.
Before, After and Alternatives
Before this measure you might want an impression using a frequency table or a visualisation: * tab_frequency for a frequency table * vi_bar_stacked_single for Single Stacked Bar-Chart * vi_bar_dual_axis for Dual-Axis Bar Chart
After this you might want some other descriptive measures: * me_consensus for the Consensus * me_hodges_lehmann_os for the Hodges-Lehmann Estimate (One-Sample) * me_median for the Median * me_quartiles for Quartiles / Hinges * me_quartile_range for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
or perform a test: * ts_sign_os for One-Sample Sign Test * ts_trinomial_os for One-Sample Trinomial Test * ts_wilcoxon_os for Wilcoxon Signed Rank Test (One-Sample)
For more information on the quartile indexing methods and index itself: * he_quantileIndexing * he_quantilesIndex
References
Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. The American Statistician, 41(3), 200–203. doi:10.1080/00031305.1987.10475479
Galton, F. (1881). Report of the anthropometric committee. Report of the British Association for the Advancement of Science, 51, 225–272.
Gumbel, E. J. (1939). La Probabilité des Hypothèses. Compes Rendus de l’ Académie des Sciences, 209, 645–647.
Hazen, A. (1914). Storage to be provided in impounding municipal water supply. Transactions of the American Society of Civil Engineers, 77(1), 1539–1640. doi:10.1061/taceat.0002563
Hogg, R. V., & Ledolter, J. (1992). Applied statistics for engineers and physical scientists (2nd int.). Macmillan.
Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361–365. doi:10.2307/2684934
Langford, E. (2006). Quartiles in elementary statistics. Journal of Statistics Education, 14(3), 1–17. doi:10.1080/10691898.2006.11910589
Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html
McAlister, D. (1879). The law of the geometric mean. Proceedings of the Royal Society of London, 29(196–199), 367–376. doi:10.1098/rspl.1879.0061
Mendenhall, W., & Sincich, T. (1992). Statistics for engineering and the sciences (3rd ed.). Dellen Publishing Company.
Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American Statistical Association, 74(365), 105–121. doi:10.1080/01621459.1979.10481621
SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.
Siegel, A. F., & Morgan, C. J. (1996). Statistics and data analysis: An introduction (2nd ed.). J. Wiley.
Snedecor, G. W. (1940). Statistical methods applied to experiments in agriculture and biology (3rd ed.). The Iowa State College Press.
Vining, G. G. (1998). Statistical methods for engineers. Duxbury Press.
Weibull, W. (1939). The phenomenon of rupture in solids. Ingeniörs Vetenskaps Akademien, 153, 1–55.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Text Pandas Series
>>> import pandas as pd >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Teach_Motivate'] >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5} >>> me_quantiles(ex1, levels=order) (0 1.0 1 1.0 2 2.0 3 3.0 4 5.0 dtype: float64, ['Fully Disagree', 'Fully Disagree', 'Disagree', 'Neither disagree nor agree', 'Fully agree'])
Example 2: Numeric data
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] >>> me_quantiles(ex2) 0 1.0 1 2.0 2 4.0 3 5.0 4 5.0 dtype: float64
Example 3: Text data
>>> ex3 = ["a", "b", "f", "d", "e", "c"] >>> order = {"a":1, "b":2, "c":3, "d":4, "e":5, "f":6} >>> me_quantiles(ex3, levels=order) (0 1.0 1 1.5 2 3.0 3 4.5 4 6.0 dtype: float64, ['a', 'between a and b', 'c', 'between d and e', 'f'])
Expand source code
def me_quantiles(data, levels=None, k=4, method="own", indexMethod="sas1", qLfrac="linear", qLint="int", qHfrac="linear", qHint="int"): ''' Quantiles --------- Quantiles split the data into k sections, each containing n/k scores. They can be seen as a generalisation of various 'tiles'. For example 4-quantiles is the same as the quartiles, 5-quantiles the same as quintiles, 100-quantiles the same as percentiles, etc. Quite a few different methods exist to determine these. See the notes for more information. This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html) Parameters ---------- data : list or pandas series levels : dictionary, optional coding to use k : number of quantiles method : string, optional which method to use to calculate quantiles indexMethod : {"sas1", "sas4", "excel", "hl", "hf8", "hf9"}, optional to indicate which type of indexing to use. Default is "sas1" qLfrac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional to indicate what type of rounding to use for quantiles below 50 percent. Default is "linear" qLint : {"int", "midpoint"}, optional to indicate the use of the integer or the midpoint method for first quarter. Default is "int" qHfrac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional to indicate what type of rounding to use for quantiles equal or above 50 percent. Default is "linear" qHint : {"int", "midpoint"}, optional to indicate the use of the integer or the midpoint method for quantiles equal or above 50 percent. Default is "int" method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes. Returns ------- results : the quantiles, or if levels are used also additionally text versions Notes ----- To determine the quartiles a specific indexing method can be used. See **he_quantileIndexing()** for details on the different methods to choose from. Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used. See the **he_quantilesIndex()** for details on this. Note that the rounding method can even vary per quantile, i.e. the one used for the ones below the median being different than the one those equal or above. I've come across the following methods: |method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional| |------|--------|----------|-------------|----------|-------------| |sas1|sas1|use int|linear|use int|linear| |sas2|sas1|use int|bankers|use int|bankers| |sas3|sas1|use int|up|use int|up| |sas5|sas1|midpoint|up|midpoint|up| |hf3b|sas1|use int|nearest|use int|halfdown| |sas4|sas4|use int|linear|use int|linear| |ms|sas4|use int|nearest|use int|halfdown| |lohninger|sas4|use int|nearest|use int|nearest| |hl2|hl|use int|linear|use int|linear| |hl1|hl|use int|midpoint|use int|midpoint| |excel|excel|use int|linear|use int|linear| |pd2|excel|use int|down|use int|down| |pd3|excel|use int|up|use int|up| |pd4|excel|use int|halfdown|use int|nearest| |pd5|excel|use int|midpoint|use int|midpoint| |hf8|hf8|use int|linear|use int|linear| |hf9|hf9|use int|linear|use int|linear| The following values can be used for the *method* parameter: 1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363) 1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362) 1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362) 1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626) 1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362) 1. hf3b = closest_observation 1. ms (Mendenhall & Sincich, 1992, p. 35) 1. lohninger (Lohninger, n.d.) 1. hl1 (Hogg & Ledolter, 1992, p. 21) 1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?) 1. maple2 1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?) 1. pd2 = lower 1. pd3 = higher 1. pd4 = nearest 1. pd5 = midpoint 1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363) 1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363) *hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R. The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*. Before, After and Alternatives ------------------------------ Before this measure you might want an impression using a frequency table or a visualisation: * [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table * [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart * [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart After this you might want some other descriptive measures: * [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus * [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample) * [me_median](../measures/meas_median.html#me_median) for the Median * [me_quartiles](../measures/meas_quartiles.html#me_quantiles) for Quartiles / Hinges * [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range or perform a test: * [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test * [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test * [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample) For more information on the quartile indexing methods and index itself: * [he_quantileIndexing](../helper/help_quantileIndexing.html#he_quartileIndexing) * [he_quantilesIndex](../helper/help_quantileIndex.html#he_quartilesIndex) References ---------- Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479 Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272. Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647. Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563 Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan. Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934 Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589 Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061 Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company. Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621 SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute. Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley. Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press. Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press. Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Text Pandas Series >>> import pandas as pd >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Teach_Motivate'] >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5} >>> me_quantiles(ex1, levels=order) (0 1.0 1 1.0 2 2.0 3 3.0 4 5.0 dtype: float64, ['Fully Disagree', 'Fully Disagree', 'Disagree', 'Neither disagree nor agree', 'Fully agree']) Example 2: Numeric data >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] >>> me_quantiles(ex2) 0 1.0 1 2.0 2 4.0 3 5.0 4 5.0 dtype: float64 Example 3: Text data >>> ex3 = ["a", "b", "f", "d", "e", "c"] >>> order = {"a":1, "b":2, "c":3, "d":4, "e":5, "f":6} >>> me_quantiles(ex3, levels=order) (0 1.0 1 1.5 2 3.0 3 4.5 4 6.0 dtype: float64, ['a', 'between a and b', 'c', 'between d and e', 'f']) ''' if type(data) is list: data = pd.Series(data) data = data.dropna() if levels is not None: pd.set_option('future.no_silent_downcasting', True) dataN = data.map(levels).astype('Int8') else: dataN = pd.to_numeric(data) dataN = dataN.sort_values().reset_index(drop=True) #ataN = list(dataN) #alternative namings if method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]: method = "sas5" elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]: method = "sas4" elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]: method = "excel" elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]: method = "sas1" elif method in ["sas2", "hf3", "r3"]: method = "sas2" elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]: method = "sas3" elif method in ["hf3b", "closest_observation"]: method = "hf3b" elif method in ["hl2", "hazen", "hf5", "maple4"]: method = "hl2" elif method in ["np", "midpoint", "pd5"]: method = "pd5" elif method in ["hf8", "median_unbiased", "maple7", "r8"]: method = "hf8" elif method in ["hf9", "normal_unbiased", "maple8", "r9"]: method = "hf9" elif method in ["pd2", "lower"]: method = "pd2" elif method in ["pd3", "higher"]: method = "pd3" elif method in ["pd4", "nearest"]: method = "pd4" #settings settings = [indexMethod, qLfrac, qLint, qHfrac, qHint] if method=="sas1": settings = ["sas1","linear","int","linear","int"] elif method=="sas2": settings = ["sas1","bankers","int","bankers" ,"int"] elif method=="sas3": settings = ["sas1","up","int","up","int"] elif method=="sas5": settings = ["sas1","up","midpoint","up","midpoint"] elif method=="sas4": settings = ["sas4","linear", "int","linear","int"] elif method=="ms": settings = ["sas4", "nearest","int", "halfdown","int"] elif method=="lohninger": settings = ["sas4", "nearest", "int","nearest","int"] elif method=="hl2": settings = ["hl", "linear", "int","linear","int"] elif method=="hl1": settings = ["hl", "midpoint","int", "midpoint","int"] elif method=="excel": settings = ["excel", "linear","int","linear", "int"] elif method=="pd2": settings = ["excel", "down", "int", "down","int"] elif method=="pd3": settings = ["excel", "up","int","up","int"] elif method=="pd4": settings = ["excel", "halfdown", "int","nearest", "int"] elif method=="hf3b": settings = ["sas1", "nearest","int","halfdown","int"] elif method=="pd5": settings = ["excel", "midpoint","int","midpoint","int"] elif method=="hf8": settings = ["hf8", "linear","int","linear", "int"] elif method=="hf9": settings = ["hf9", "linear","int","linear", "int"] elif method=="maple2": settings = ["hl", "down","int","down", "int"] quantiles = he_quantileIndex(dataN, k, settings[0], settings[1], settings[2], settings[3], settings[4]) #he_quantileIndex(data, k=4, indexMethod="sas1", qLfrac="linear", qLint="int", qHfrac="linear", qHint="int") #find the text representatives if levels is not None: quantilesText = [] for i in range(k+1): if quantiles[i] == round(quantiles[i]): qT = list(levels.keys())[list(levels.values()).index(quantiles[i])] else: qT = "between " + list(levels.keys())[list(levels.values()).index(math.floor(quantiles[i]))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(quantiles[i]))] quantilesText.append(qT) results = quantiles, quantilesText else: results = quantiles return results