Module `stikpetP.measures.meas_quartiles`

Expand source code

import pandas as pd
import math
from ..helper.help_quartileIndex import he_quartileIndex

#This function is used in me_quartile_range

def me_quartiles(data, levels=None, method="own", indexMethod="sas1", q1Frac="linear", q1Int="int", q3Frac="linear", q3Int="int"):
    '''
    Quartiles and Hinges
    -------------------- 
    The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.
    
    Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.
    
    There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

    This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html)
    
    Parameters
    ----------
    data : list or pandas series
    levels : dictionary, optional 
        coding to use
    method : string, optional 
        which method to use to calculate quartiles
    indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional 
        to indicate which type of indexing to use. Default is "sas1"
    q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for first quarter. Default is "linear"
    q1Int : {"int", "midpoint"}, optional 
        to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
    q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for third quarter. Default is "linear"
    q3Int : {"int", "midpoint"}, optional  
        to indicate the use of the integer or the midpoint method for third quarter. Default is "int"
    
    method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * Q1, the numeric value of the first quarter
        * Q3, the numeric value of the third quarter
        * Q1 text, text version of first quarter (only if levels are used)
        * Q3 text, text version of third quarter (only if levels are used)
    
    Notes
    -----
    To determine the quartiles a specific indexing method can be used. See **he_quartileIndexing()** for details on the different methods to choose from.
    
    Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used. 
    
    See the **he_quartilesIndex()** for details on this.
    
    Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

    I've come across the following methods:

    |method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional|
    |------|--------|----------|-------------|----------|-------------|
    |sas1|sas1|use int|linear|use int|linear|
    |sas2|sas1|use int|bankers|use int|bankers|
    |sas3|sas1|use int|up|use int|up|
    |sas5|sas1|midpoint|up|midpoint|up|
    |hf3b|sas1|use int|nearest|use int|halfdown|
    |sas4|sas4|use int|linear|use int|linear|
    |ms|sas4|use int|nearest|use int|halfdown|
    |lohninger|sas4|use int|nearest|use int|nearest|
    |hl2|hl|use int|linear|use int|linear|
    |hl1|hl|use int|midpoint|use int|midpoint|
    |excel|excel|use int|linear|use int|linear|
    |pd2|excel|use int|down|use int|down|
    |pd3|excel|use int|up|use int|up|
    |pd4|excel|use int|halfdown|use int|nearest|
    |pd5|excel|use int|midpoint|use int|midpoint|
    |hf8|hf8|use int|linear|use int|linear|
    |hf9|hf9|use int|linear|use int|linear|

    The following values can be used for the *method* parameter:

    1. inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
    1. exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
    1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
    1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
    1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. hf3b = closest_observation 
    1. ms (Mendenhall & Sincich, 1992, p. 35)
    1. lohninger (Lohninger, n.d.)
    1. hl1 (Hogg & Ledolter, 1992, p. 21)
    1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
    1. maple2
    1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
    1. pd2 = lower
    1. pd3 = higher
    1. pd4 = nearest
    1. pd5 = midpoint
    1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
    1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

    *hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R.
    
    The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*. 

    Before, After and Alternatives
    ------------------------------
    Before this measure you might want an impression using a frequency table or a visualisation:
    * [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table
    * [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart
    * [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart

    After this you might want some other descriptive measures:
    * [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus
    * [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample)
    * [me_median](../measures/meas_median.html#me_median) for the Median
    * [me_quantiles](../measures/meas_quantiles.html#me_quantiles) for Quantiles
    * [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
    
    or perform a test:
    * [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test
    * [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test
    * [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample)

    For more information on the quartile indexing methods and index itself:
    * [he_quartileIndexing](../helper/help_quartileIndexing.html#he_quartileIndexing)
    * [he_quartilesIndex](../helper/help_quartileIndex.html#he_quartilesIndex)
    
    References
    ----------
    Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479

    Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272.

    Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647.

    Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563

    Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan.

    Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934

    Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. *Teaching Statistics, 23*(3), 86–89. doi:10.1111/1467-9639.00063

    Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589

    Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

    McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061

    Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company.

    Moore, D. S., & McCabe, G. P. (1989). *Introduction to the practice of statistics*. W.H. Freeman.

    Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621

    SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

    Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley.

    Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press.

    Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley Pub. Co.

    Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press.

    Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Text Pandas Series
    >>> import pandas as pd
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Teach_Motivate']
    >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
    >>> me_quartiles(ex1, levels=order)
        Q1   Q3         Q1 text                     Q3 text
    0  1.0  3.0  Fully Disagree  Neither disagree nor agree
    
    Example 2: Numeric data
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_quartiles(ex2)
        Q1   Q3
    0  2.0  5.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.map(levels).astype('Int8')
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)
    #dataN = list(dataN)
    
    #alternative namings
    if method in ["inclusive", "tukey", "vining", "hinges"]:
        method="inclusive"
    elif method in ["exclusive", "jf"]:
        method ="exclusive"
    elif method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]:
        method = "sas5"
    elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]:
        method = "sas4"
    elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]:
        method = "excel"
    elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]:
        method = "sas1"
    elif method in ["sas2", "hf3", "r3"]:
        method = "sas2"
    elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]:
        method = "sas3"
    elif method in ["hf3b", "closest_observation"]:
        method = "hf3b"
    elif method in ["hl2", "hazen", "hf5", "maple4"]:
        method = "hl2"
    elif method in ["np", "midpoint", "pd5"]:
        method = "pd5"
    elif method in ["hf8", "median_unbiased", "maple7", "r8"]:
        method = "hf8"
    elif method in ["hf9", "normal_unbiased", "maple8", "r9"]:
        method = "hf9"
    elif method in ["pd2", "lower"]:
        method = "pd2"
    elif method in ["pd3", "higher"]:
        method = "pd3"
    elif method in ["pd4", "nearest"]:
        method = "pd4"
    
    #settings
    settings = [indexMethod, q1Frac, q1Int, q3Frac, q3Int]
    if method=="inclusive":
        settings = ["inclusive", "linear","int","linear","int"]
    elif method=="exclusive":
        settings = ["exclusive", "linear","int","linear","int"]
    elif method=="sas1":
        settings = ["sas1","linear","int","linear","int"]
    elif method=="sas2":
        settings = ["sas1","bankers","int","bankers" ,"int"]
    elif method=="sas3":
        settings = ["sas1","up","int","up","int"]
    elif method=="sas5":
        settings = ["sas1","up","midpoint","up","midpoint"]
    elif method=="sas4":    
        settings = ["sas4","linear", "int","linear","int"]
    elif method=="ms": 
        settings = ["sas4", "nearest","int", "halfdown","int"]
    elif method=="lohninger":
        settings = ["sas4", "nearest", "int","nearest","int"]
    elif method=="hl2":
        settings = ["hl", "linear", "int","linear","int"]
    elif method=="hl1":
        settings = ["hl", "midpoint","int", "midpoint","int"]
    elif method=="excel":
        settings = ["excel", "linear","int","linear", "int"]
    elif method=="pd2":
        settings = ["excel", "down", "int", "down","int"]
    elif method=="pd3":
        settings = ["excel", "up","int","up","int"]
    elif method=="pd4":
        settings = ["excel", "halfdown",  "int","nearest", "int"]
    elif method=="hf3b":
        settings = ["sas1", "nearest","int","halfdown","int"]
    elif method=="pd5":
        settings = ["excel", "midpoint","int","midpoint","int"]
    elif method=="hf8":
        settings = ["hf8", "linear","int","linear", "int"]
    elif method=="hf9":
        settings = ["hf9", "linear","int","linear", "int"]
    elif method=="maple2":
        settings = ["hl", "down","int","down", "int"]
    
    q1, q3 = he_quartileIndex(dataN, settings[0], settings[1], settings[2], settings[3], settings[4])
    
    #find the text representatives
    
    if levels is not None:
        if q1 == round(q1):
            q1T = list(levels.keys())[list(levels.values()).index(q1)]

        else:
            q1T = "between " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.floor(q1)])] + " and " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.ceil(q1)])]

        if q3 == round(q3):
            q3T = list(levels.keys())[list(levels.values()).index(q3)]

        else:
            q3T = "between " + list(levels.keys())[list(levels.values()).index(math.floor(q3))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(q3))]
        
        
        results = pd.DataFrame([[q1, q3, q1T, q3T]], columns=["Q1", "Q3", "Q1 text", "Q3 text"])
    else:
        results = pd.DataFrame([[q1, q3]], columns=["Q1", "Q3"])
        
    pd.set_option('display.max_colwidth', None)
    
    return results

Functions

def me_quartiles(data, levels=None, method='own', indexMethod='sas1', q1Frac='linear', q1Int='int', q3Frac='linear', q3Int='int')

Quartiles and Hinges

The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.

Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.

There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

This function is shown in this YouTube video and the measure is also described at PeterStatistics.com

Parameters

data : list or pandas series
levels : dictionary, optional: coding to use
method : string, optional: which method to use to calculate quartiles
indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional: to indicate which type of indexing to use. Default is "sas1"
q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional: to indicate what type of rounding to use for first quarter. Default is "linear"
q1Int : {"int", "midpoint"}, optional: to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional: to indicate what type of rounding to use for third quarter. Default is "linear"
q3Int : {"int", "midpoint"}, optional: to indicate the use of the integer or the midpoint method for third quarter. Default is "int"

method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.

Returns

pandas.DataFrame

A dataframe with the following columns:

Q1, the numeric value of the first quarter
Q3, the numeric value of the third quarter
Q1 text, text version of first quarter (only if levels are used)
Q3 text, text version of third quarter (only if levels are used)

Notes

To determine the quartiles a specific indexing method can be used. See he_quartileIndexing() for details on the different methods to choose from.

Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used.

See the he_quartilesIndex() for details on this.

Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

I've come across the following methods:

method	indexing	q1 integer	q1 fractional	q3 integer	q3 fractional
sas1	sas1	use int	linear	use int	linear
sas2	sas1	use int	bankers	use int	bankers
sas3	sas1	use int	up	use int	up
sas5	sas1	midpoint	up	midpoint	up
hf3b	sas1	use int	nearest	use int	halfdown
sas4	sas4	use int	linear	use int	linear
ms	sas4	use int	nearest	use int	halfdown
lohninger	sas4	use int	nearest	use int	nearest
hl2	hl	use int	linear	use int	linear
hl1	hl	use int	midpoint	use int	midpoint
excel	excel	use int	linear	use int	linear
pd2	excel	use int	down	use int	down
pd3	excel	use int	up	use int	up
pd4	excel	use int	halfdown	use int	nearest
pd5	excel	use int	midpoint	use int	midpoint
hf8	hf8	use int	linear	use int	linear
hf9	hf9	use int	linear	use int	linear

The following values can be used for the method parameter:

inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
hf3b = closest_observation
ms (Mendenhall & Sincich, 1992, p. 35)
lohninger (Lohninger, n.d.)
hl1 (Hogg & Ledolter, 1992, p. 21)
hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
maple2
excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
pd2 = lower
pd3 = higher
pd4 = nearest
pd5 = midpoint
hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

hf is short for Hyndman and Fan who wrote an article showcasing many different methods, hl is short for Hog and Ledolter, ms is short for Mendenhall and Sincich, jf is short for Joarder and Firozzaman. sas refers to the software package SAS, maple to Maple, pd to Python's pandas library, and r to R.

The names linear, lower, higher, nearest and midpoint are all used by pandas quantile function and numpy percentile function. Numpy also uses inverted_cdf, averaged_inverted_cdf, closest_observation, interpolated_inverted_cdf, hazen, weibull, median_unbiased, and normal_unbiased.

Before, After and Alternatives

Before this measure you might want an impression using a frequency table or a visualisation: * tab_frequency for a frequency table * vi_bar_stacked_single for Single Stacked Bar-Chart * vi_bar_dual_axis for Dual-Axis Bar Chart

After this you might want some other descriptive measures: * me_consensus for the Consensus * me_hodges_lehmann_os for the Hodges-Lehmann Estimate (One-Sample) * me_median for the Median * me_quantiles for Quantiles * me_quartile_range for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range

or perform a test: * ts_sign_os for One-Sample Sign Test * ts_trinomial_os for One-Sample Trinomial Test * ts_wilcoxon_os for Wilcoxon Signed Rank Test (One-Sample)

For more information on the quartile indexing methods and index itself: * he_quartileIndexing * he_quartilesIndex

References

Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. The American Statistician, 41(3), 200–203. doi:10.1080/00031305.1987.10475479

Galton, F. (1881). Report of the anthropometric committee. Report of the British Association for the Advancement of Science, 51, 225–272.

Gumbel, E. J. (1939). La Probabilité des Hypothèses. Compes Rendus de l’ Académie des Sciences, 209, 645–647.

Hazen, A. (1914). Storage to be provided in impounding municipal water supply. Transactions of the American Society of Civil Engineers, 77(1), 1539–1640. doi:10.1061/taceat.0002563

Hogg, R. V., & Ledolter, J. (1992). Applied statistics for engineers and physical scientists (2nd int.). Macmillan.

Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361–365. doi:10.2307/2684934

Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. Teaching Statistics, 23(3), 86–89. doi:10.1111/1467-9639.00063

Langford, E. (2006). Quartiles in elementary statistics. Journal of Statistics Education, 14(3), 1–17. doi:10.1080/10691898.2006.11910589

Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

McAlister, D. (1879). The law of the geometric mean. Proceedings of the Royal Society of London, 29(196–199), 367–376. doi:10.1098/rspl.1879.0061

Mendenhall, W., & Sincich, T. (1992). Statistics for engineering and the sciences (3rd ed.). Dellen Publishing Company.

Moore, D. S., & McCabe, G. P. (1989). Introduction to the practice of statistics. W.H. Freeman.

Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American Statistical Association, 74(365), 105–121. doi:10.1080/01621459.1979.10481621

SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

Siegel, A. F., & Morgan, C. J. (1996). Statistics and data analysis: An introduction (2nd ed.). J. Wiley.

Snedecor, G. W. (1940). Statistical methods applied to experiments in agriculture and biology (3rd ed.). The Iowa State College Press.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley Pub. Co.

Vining, G. G. (1998). Statistical methods for engineers. Duxbury Press.

Weibull, W. (1939). The phenomenon of rupture in solids. Ingeniörs Vetenskaps Akademien, 153, 1–55.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: Text Pandas Series

>>> import pandas as pd
>>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df2['Teach_Motivate']
>>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
>>> me_quartiles(ex1, levels=order)
    Q1   Q3         Q1 text                     Q3 text
0  1.0  3.0  Fully Disagree  Neither disagree nor agree

Example 2: Numeric data

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> me_quartiles(ex2)
    Q1   Q3
0  2.0  5.0

Expand source code

def me_quartiles(data, levels=None, method="own", indexMethod="sas1", q1Frac="linear", q1Int="int", q3Frac="linear", q3Int="int"):
    '''
    Quartiles and Hinges
    -------------------- 
    The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.
    
    Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.
    
    There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

    This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html)
    
    Parameters
    ----------
    data : list or pandas series
    levels : dictionary, optional 
        coding to use
    method : string, optional 
        which method to use to calculate quartiles
    indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional 
        to indicate which type of indexing to use. Default is "sas1"
    q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for first quarter. Default is "linear"
    q1Int : {"int", "midpoint"}, optional 
        to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
    q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for third quarter. Default is "linear"
    q3Int : {"int", "midpoint"}, optional  
        to indicate the use of the integer or the midpoint method for third quarter. Default is "int"
    
    method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * Q1, the numeric value of the first quarter
        * Q3, the numeric value of the third quarter
        * Q1 text, text version of first quarter (only if levels are used)
        * Q3 text, text version of third quarter (only if levels are used)
    
    Notes
    -----
    To determine the quartiles a specific indexing method can be used. See **he_quartileIndexing()** for details on the different methods to choose from.
    
    Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used. 
    
    See the **he_quartilesIndex()** for details on this.
    
    Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

    I've come across the following methods:

    |method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional|
    |------|--------|----------|-------------|----------|-------------|
    |sas1|sas1|use int|linear|use int|linear|
    |sas2|sas1|use int|bankers|use int|bankers|
    |sas3|sas1|use int|up|use int|up|
    |sas5|sas1|midpoint|up|midpoint|up|
    |hf3b|sas1|use int|nearest|use int|halfdown|
    |sas4|sas4|use int|linear|use int|linear|
    |ms|sas4|use int|nearest|use int|halfdown|
    |lohninger|sas4|use int|nearest|use int|nearest|
    |hl2|hl|use int|linear|use int|linear|
    |hl1|hl|use int|midpoint|use int|midpoint|
    |excel|excel|use int|linear|use int|linear|
    |pd2|excel|use int|down|use int|down|
    |pd3|excel|use int|up|use int|up|
    |pd4|excel|use int|halfdown|use int|nearest|
    |pd5|excel|use int|midpoint|use int|midpoint|
    |hf8|hf8|use int|linear|use int|linear|
    |hf9|hf9|use int|linear|use int|linear|

    The following values can be used for the *method* parameter:

    1. inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
    1. exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
    1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
    1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
    1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. hf3b = closest_observation 
    1. ms (Mendenhall & Sincich, 1992, p. 35)
    1. lohninger (Lohninger, n.d.)
    1. hl1 (Hogg & Ledolter, 1992, p. 21)
    1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
    1. maple2
    1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
    1. pd2 = lower
    1. pd3 = higher
    1. pd4 = nearest
    1. pd5 = midpoint
    1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
    1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

    *hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R.
    
    The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*. 

    Before, After and Alternatives
    ------------------------------
    Before this measure you might want an impression using a frequency table or a visualisation:
    * [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table
    * [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart
    * [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart

    After this you might want some other descriptive measures:
    * [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus
    * [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample)
    * [me_median](../measures/meas_median.html#me_median) for the Median
    * [me_quantiles](../measures/meas_quantiles.html#me_quantiles) for Quantiles
    * [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
    
    or perform a test:
    * [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test
    * [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test
    * [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample)

    For more information on the quartile indexing methods and index itself:
    * [he_quartileIndexing](../helper/help_quartileIndexing.html#he_quartileIndexing)
    * [he_quartilesIndex](../helper/help_quartileIndex.html#he_quartilesIndex)
    
    References
    ----------
    Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479

    Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272.

    Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647.

    Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563

    Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan.

    Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934

    Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. *Teaching Statistics, 23*(3), 86–89. doi:10.1111/1467-9639.00063

    Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589

    Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

    McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061

    Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company.

    Moore, D. S., & McCabe, G. P. (1989). *Introduction to the practice of statistics*. W.H. Freeman.

    Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621

    SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

    Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley.

    Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press.

    Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley Pub. Co.

    Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press.

    Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Text Pandas Series
    >>> import pandas as pd
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Teach_Motivate']
    >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
    >>> me_quartiles(ex1, levels=order)
        Q1   Q3         Q1 text                     Q3 text
    0  1.0  3.0  Fully Disagree  Neither disagree nor agree
    
    Example 2: Numeric data
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_quartiles(ex2)
        Q1   Q3
    0  2.0  5.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.map(levels).astype('Int8')
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)
    #dataN = list(dataN)
    
    #alternative namings
    if method in ["inclusive", "tukey", "vining", "hinges"]:
        method="inclusive"
    elif method in ["exclusive", "jf"]:
        method ="exclusive"
    elif method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]:
        method = "sas5"
    elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]:
        method = "sas4"
    elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]:
        method = "excel"
    elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]:
        method = "sas1"
    elif method in ["sas2", "hf3", "r3"]:
        method = "sas2"
    elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]:
        method = "sas3"
    elif method in ["hf3b", "closest_observation"]:
        method = "hf3b"
    elif method in ["hl2", "hazen", "hf5", "maple4"]:
        method = "hl2"
    elif method in ["np", "midpoint", "pd5"]:
        method = "pd5"
    elif method in ["hf8", "median_unbiased", "maple7", "r8"]:
        method = "hf8"
    elif method in ["hf9", "normal_unbiased", "maple8", "r9"]:
        method = "hf9"
    elif method in ["pd2", "lower"]:
        method = "pd2"
    elif method in ["pd3", "higher"]:
        method = "pd3"
    elif method in ["pd4", "nearest"]:
        method = "pd4"
    
    #settings
    settings = [indexMethod, q1Frac, q1Int, q3Frac, q3Int]
    if method=="inclusive":
        settings = ["inclusive", "linear","int","linear","int"]
    elif method=="exclusive":
        settings = ["exclusive", "linear","int","linear","int"]
    elif method=="sas1":
        settings = ["sas1","linear","int","linear","int"]
    elif method=="sas2":
        settings = ["sas1","bankers","int","bankers" ,"int"]
    elif method=="sas3":
        settings = ["sas1","up","int","up","int"]
    elif method=="sas5":
        settings = ["sas1","up","midpoint","up","midpoint"]
    elif method=="sas4":    
        settings = ["sas4","linear", "int","linear","int"]
    elif method=="ms": 
        settings = ["sas4", "nearest","int", "halfdown","int"]
    elif method=="lohninger":
        settings = ["sas4", "nearest", "int","nearest","int"]
    elif method=="hl2":
        settings = ["hl", "linear", "int","linear","int"]
    elif method=="hl1":
        settings = ["hl", "midpoint","int", "midpoint","int"]
    elif method=="excel":
        settings = ["excel", "linear","int","linear", "int"]
    elif method=="pd2":
        settings = ["excel", "down", "int", "down","int"]
    elif method=="pd3":
        settings = ["excel", "up","int","up","int"]
    elif method=="pd4":
        settings = ["excel", "halfdown",  "int","nearest", "int"]
    elif method=="hf3b":
        settings = ["sas1", "nearest","int","halfdown","int"]
    elif method=="pd5":
        settings = ["excel", "midpoint","int","midpoint","int"]
    elif method=="hf8":
        settings = ["hf8", "linear","int","linear", "int"]
    elif method=="hf9":
        settings = ["hf9", "linear","int","linear", "int"]
    elif method=="maple2":
        settings = ["hl", "down","int","down", "int"]
    
    q1, q3 = he_quartileIndex(dataN, settings[0], settings[1], settings[2], settings[3], settings[4])
    
    #find the text representatives
    
    if levels is not None:
        if q1 == round(q1):
            q1T = list(levels.keys())[list(levels.values()).index(q1)]

        else:
            q1T = "between " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.floor(q1)])] + " and " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.ceil(q1)])]

        if q3 == round(q3):
            q3T = list(levels.keys())[list(levels.values()).index(q3)]

        else:
            q3T = "between " + list(levels.keys())[list(levels.values()).index(math.floor(q3))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(q3))]
        
        
        results = pd.DataFrame([[q1, q3, q1T, q3T]], columns=["Q1", "Q3", "Q1 text", "Q3 text"])
    else:
        results = pd.DataFrame([[q1, q3]], columns=["Q1", "Q3"])
        
    pd.set_option('display.max_colwidth', None)
    
    return results