Module stikpetP.measures.meas_quartiles

Expand source code
import pandas as pd
import math
from ..helper.help_quartileIndex import he_quartileIndex

#This function is used in me_quartile_range

def me_quartiles(data, levels=None, method="own", indexMethod="sas1", q1Frac="linear", q1Int="int", q3Frac="linear", q3Int="int"):
    '''
    Quartiles and Hinges
    -------------------- 
    The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.
    
    Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.
    
    There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

    This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html)
    
    Parameters
    ----------
    data : list or pandas series
    levels : dictionary, optional 
        coding to use
    method : string, optional 
        which method to use to calculate quartiles
    indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional 
        to indicate which type of indexing to use. Default is "sas1"
    q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for first quarter. Default is "linear"
    q1Int : {"int", "midpoint"}, optional 
        to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
    q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for third quarter. Default is "linear"
    q3Int : {"int", "midpoint"}, optional  
        to indicate the use of the integer or the midpoint method for third quarter. Default is "int"
    
    method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * Q1, the numeric value of the first quarter
        * Q3, the numeric value of the third quarter
        * Q1 text, text version of first quarter (only if levels are used)
        * Q3 text, text version of third quarter (only if levels are used)
    
    Notes
    -----
    To determine the quartiles a specific indexing method can be used. See **he_quartileIndexing()** for details on the different methods to choose from.
    
    Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used. 
    
    See the **he_quartilesIndex()** for details on this.
    
    Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

    I've come across the following methods:

    |method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional|
    |------|--------|----------|-------------|----------|-------------|
    |sas1|sas1|use int|linear|use int|linear|
    |sas2|sas1|use int|bankers|use int|bankers|
    |sas3|sas1|use int|up|use int|up|
    |sas5|sas1|midpoint|up|midpoint|up|
    |hf3b|sas1|use int|nearest|use int|halfdown|
    |sas4|sas4|use int|linear|use int|linear|
    |ms|sas4|use int|nearest|use int|halfdown|
    |lohninger|sas4|use int|nearest|use int|nearest|
    |hl2|hl|use int|linear|use int|linear|
    |hl1|hl|use int|midpoint|use int|midpoint|
    |excel|excel|use int|linear|use int|linear|
    |pd2|excel|use int|down|use int|down|
    |pd3|excel|use int|up|use int|up|
    |pd4|excel|use int|halfdown|use int|nearest|
    |pd5|excel|use int|midpoint|use int|midpoint|
    |hf8|hf8|use int|linear|use int|linear|
    |hf9|hf9|use int|linear|use int|linear|

    The following values can be used for the *method* parameter:

    1. inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
    1. exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
    1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
    1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
    1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. hf3b = closest_observation 
    1. ms (Mendenhall & Sincich, 1992, p. 35)
    1. lohninger (Lohninger, n.d.)
    1. hl1 (Hogg & Ledolter, 1992, p. 21)
    1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
    1. maple2
    1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
    1. pd2 = lower
    1. pd3 = higher
    1. pd4 = nearest
    1. pd5 = midpoint
    1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
    1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

    *hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R.
    
    The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*. 

    Before, After and Alternatives
    ------------------------------
    Before this measure you might want an impression using a frequency table or a visualisation:
    * [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table
    * [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart
    * [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart

    After this you might want some other descriptive measures:
    * [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus
    * [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample)
    * [me_median](../measures/meas_median.html#me_median) for the Median
    * [me_quantiles](../measures/meas_quantiles.html#me_quantiles) for Quantiles
    * [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
    
    or perform a test:
    * [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test
    * [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test
    * [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample)

    For more information on the quartile indexing methods and index itself:
    * [he_quartileIndexing](../helper/help_quartileIndexing.html#he_quartileIndexing)
    * [he_quartilesIndex](../helper/help_quartileIndex.html#he_quartilesIndex)
    
    References
    ----------
    Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479

    Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272.

    Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647.

    Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563

    Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan.

    Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934

    Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. *Teaching Statistics, 23*(3), 86–89. doi:10.1111/1467-9639.00063

    Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589

    Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

    McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061

    Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company.

    Moore, D. S., & McCabe, G. P. (1989). *Introduction to the practice of statistics*. W.H. Freeman.

    Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621

    SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

    Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley.

    Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press.

    Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley Pub. Co.

    Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press.

    Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Text Pandas Series
    >>> import pandas as pd
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Teach_Motivate']
    >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
    >>> me_quartiles(ex1, levels=order)
        Q1   Q3         Q1 text                     Q3 text
    0  1.0  3.0  Fully Disagree  Neither disagree nor agree
    
    Example 2: Numeric data
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_quartiles(ex2)
        Q1   Q3
    0  2.0  5.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.map(levels).astype('Int8')
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)
    #dataN = list(dataN)
    
    #alternative namings
    if method in ["inclusive", "tukey", "vining", "hinges"]:
        method="inclusive"
    elif method in ["exclusive", "jf"]:
        method ="exclusive"
    elif method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]:
        method = "sas5"
    elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]:
        method = "sas4"
    elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]:
        method = "excel"
    elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]:
        method = "sas1"
    elif method in ["sas2", "hf3", "r3"]:
        method = "sas2"
    elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]:
        method = "sas3"
    elif method in ["hf3b", "closest_observation"]:
        method = "hf3b"
    elif method in ["hl2", "hazen", "hf5", "maple4"]:
        method = "hl2"
    elif method in ["np", "midpoint", "pd5"]:
        method = "pd5"
    elif method in ["hf8", "median_unbiased", "maple7", "r8"]:
        method = "hf8"
    elif method in ["hf9", "normal_unbiased", "maple8", "r9"]:
        method = "hf9"
    elif method in ["pd2", "lower"]:
        method = "pd2"
    elif method in ["pd3", "higher"]:
        method = "pd3"
    elif method in ["pd4", "nearest"]:
        method = "pd4"
    
    #settings
    settings = [indexMethod, q1Frac, q1Int, q3Frac, q3Int]
    if method=="inclusive":
        settings = ["inclusive", "linear","int","linear","int"]
    elif method=="exclusive":
        settings = ["exclusive", "linear","int","linear","int"]
    elif method=="sas1":
        settings = ["sas1","linear","int","linear","int"]
    elif method=="sas2":
        settings = ["sas1","bankers","int","bankers" ,"int"]
    elif method=="sas3":
        settings = ["sas1","up","int","up","int"]
    elif method=="sas5":
        settings = ["sas1","up","midpoint","up","midpoint"]
    elif method=="sas4":    
        settings = ["sas4","linear", "int","linear","int"]
    elif method=="ms": 
        settings = ["sas4", "nearest","int", "halfdown","int"]
    elif method=="lohninger":
        settings = ["sas4", "nearest", "int","nearest","int"]
    elif method=="hl2":
        settings = ["hl", "linear", "int","linear","int"]
    elif method=="hl1":
        settings = ["hl", "midpoint","int", "midpoint","int"]
    elif method=="excel":
        settings = ["excel", "linear","int","linear", "int"]
    elif method=="pd2":
        settings = ["excel", "down", "int", "down","int"]
    elif method=="pd3":
        settings = ["excel", "up","int","up","int"]
    elif method=="pd4":
        settings = ["excel", "halfdown",  "int","nearest", "int"]
    elif method=="hf3b":
        settings = ["sas1", "nearest","int","halfdown","int"]
    elif method=="pd5":
        settings = ["excel", "midpoint","int","midpoint","int"]
    elif method=="hf8":
        settings = ["hf8", "linear","int","linear", "int"]
    elif method=="hf9":
        settings = ["hf9", "linear","int","linear", "int"]
    elif method=="maple2":
        settings = ["hl", "down","int","down", "int"]
    
    q1, q3 = he_quartileIndex(dataN, settings[0], settings[1], settings[2], settings[3], settings[4])
    
    #find the text representatives
    
    if levels is not None:
        if q1 == round(q1):
            q1T = list(levels.keys())[list(levels.values()).index(q1)]

        else:
            q1T = "between " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.floor(q1)])] + " and " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.ceil(q1)])]

        if q3 == round(q3):
            q3T = list(levels.keys())[list(levels.values()).index(q3)]

        else:
            q3T = "between " + list(levels.keys())[list(levels.values()).index(math.floor(q3))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(q3))]
        
        
        results = pd.DataFrame([[q1, q3, q1T, q3T]], columns=["Q1", "Q3", "Q1 text", "Q3 text"])
    else:
        results = pd.DataFrame([[q1, q3]], columns=["Q1", "Q3"])
        
    pd.set_option('display.max_colwidth', None)
    
    return results

Functions

def me_quartiles(data, levels=None, method='own', indexMethod='sas1', q1Frac='linear', q1Int='int', q3Frac='linear', q3Int='int')

Quartiles and Hinges

The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.

Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.

There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

This function is shown in this YouTube video and the measure is also described at PeterStatistics.com

Parameters

data : list or pandas series
 
levels : dictionary, optional
coding to use
method : string, optional
which method to use to calculate quartiles
indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional
to indicate which type of indexing to use. Default is "sas1"
q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional
to indicate what type of rounding to use for first quarter. Default is "linear"
q1Int : {"int", "midpoint"}, optional
to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional
to indicate what type of rounding to use for third quarter. Default is "linear"
q3Int : {"int", "midpoint"}, optional
to indicate the use of the integer or the midpoint method for third quarter. Default is "int"

method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.

Returns

pandas.DataFrame

A dataframe with the following columns:

  • Q1, the numeric value of the first quarter
  • Q3, the numeric value of the third quarter
  • Q1 text, text version of first quarter (only if levels are used)
  • Q3 text, text version of third quarter (only if levels are used)

Notes

To determine the quartiles a specific indexing method can be used. See he_quartileIndexing() for details on the different methods to choose from.

Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used.

See the he_quartilesIndex() for details on this.

Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

I've come across the following methods:

method indexing q1 integer q1 fractional q3 integer q3 fractional
sas1 sas1 use int linear use int linear
sas2 sas1 use int bankers use int bankers
sas3 sas1 use int up use int up
sas5 sas1 midpoint up midpoint up
hf3b sas1 use int nearest use int halfdown
sas4 sas4 use int linear use int linear
ms sas4 use int nearest use int halfdown
lohninger sas4 use int nearest use int nearest
hl2 hl use int linear use int linear
hl1 hl use int midpoint use int midpoint
excel excel use int linear use int linear
pd2 excel use int down use int down
pd3 excel use int up use int up
pd4 excel use int halfdown use int nearest
pd5 excel use int midpoint use int midpoint
hf8 hf8 use int linear use int linear
hf9 hf9 use int linear use int linear

The following values can be used for the method parameter:

  1. inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
  2. exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
  3. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
  4. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
  5. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
  6. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
  7. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
  8. hf3b = closest_observation
  9. ms (Mendenhall & Sincich, 1992, p. 35)
  10. lohninger (Lohninger, n.d.)
  11. hl1 (Hogg & Ledolter, 1992, p. 21)
  12. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
  13. maple2
  14. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
  15. pd2 = lower
  16. pd3 = higher
  17. pd4 = nearest
  18. pd5 = midpoint
  19. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
  20. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

hf is short for Hyndman and Fan who wrote an article showcasing many different methods, hl is short for Hog and Ledolter, ms is short for Mendenhall and Sincich, jf is short for Joarder and Firozzaman. sas refers to the software package SAS, maple to Maple, pd to Python's pandas library, and r to R.

The names linear, lower, higher, nearest and midpoint are all used by pandas quantile function and numpy percentile function. Numpy also uses inverted_cdf, averaged_inverted_cdf, closest_observation, interpolated_inverted_cdf, hazen, weibull, median_unbiased, and normal_unbiased.

Before, After and Alternatives

Before this measure you might want an impression using a frequency table or a visualisation: * tab_frequency for a frequency table * vi_bar_stacked_single for Single Stacked Bar-Chart * vi_bar_dual_axis for Dual-Axis Bar Chart

After this you might want some other descriptive measures: * me_consensus for the Consensus * me_hodges_lehmann_os for the Hodges-Lehmann Estimate (One-Sample) * me_median for the Median * me_quantiles for Quantiles * me_quartile_range for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range

or perform a test: * ts_sign_os for One-Sample Sign Test * ts_trinomial_os for One-Sample Trinomial Test * ts_wilcoxon_os for Wilcoxon Signed Rank Test (One-Sample)

For more information on the quartile indexing methods and index itself: * he_quartileIndexing * he_quartilesIndex

References

Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. The American Statistician, 41(3), 200–203. doi:10.1080/00031305.1987.10475479

Galton, F. (1881). Report of the anthropometric committee. Report of the British Association for the Advancement of Science, 51, 225–272.

Gumbel, E. J. (1939). La Probabilité des Hypothèses. Compes Rendus de l’ Académie des Sciences, 209, 645–647.

Hazen, A. (1914). Storage to be provided in impounding municipal water supply. Transactions of the American Society of Civil Engineers, 77(1), 1539–1640. doi:10.1061/taceat.0002563

Hogg, R. V., & Ledolter, J. (1992). Applied statistics for engineers and physical scientists (2nd int.). Macmillan.

Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361–365. doi:10.2307/2684934

Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. Teaching Statistics, 23(3), 86–89. doi:10.1111/1467-9639.00063

Langford, E. (2006). Quartiles in elementary statistics. Journal of Statistics Education, 14(3), 1–17. doi:10.1080/10691898.2006.11910589

Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

McAlister, D. (1879). The law of the geometric mean. Proceedings of the Royal Society of London, 29(196–199), 367–376. doi:10.1098/rspl.1879.0061

Mendenhall, W., & Sincich, T. (1992). Statistics for engineering and the sciences (3rd ed.). Dellen Publishing Company.

Moore, D. S., & McCabe, G. P. (1989). Introduction to the practice of statistics. W.H. Freeman.

Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American Statistical Association, 74(365), 105–121. doi:10.1080/01621459.1979.10481621

SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

Siegel, A. F., & Morgan, C. J. (1996). Statistics and data analysis: An introduction (2nd ed.). J. Wiley.

Snedecor, G. W. (1940). Statistical methods applied to experiments in agriculture and biology (3rd ed.). The Iowa State College Press.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley Pub. Co.

Vining, G. G. (1998). Statistical methods for engineers. Duxbury Press.

Weibull, W. (1939). The phenomenon of rupture in solids. Ingeniörs Vetenskaps Akademien, 153, 1–55.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: Text Pandas Series

>>> import pandas as pd
>>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df2['Teach_Motivate']
>>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
>>> me_quartiles(ex1, levels=order)
    Q1   Q3         Q1 text                     Q3 text
0  1.0  3.0  Fully Disagree  Neither disagree nor agree

Example 2: Numeric data

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
>>> me_quartiles(ex2)
    Q1   Q3
0  2.0  5.0
Expand source code
def me_quartiles(data, levels=None, method="own", indexMethod="sas1", q1Frac="linear", q1Int="int", q3Frac="linear", q3Int="int"):
    '''
    Quartiles and Hinges
    -------------------- 
    The quartiles are at quarters of the data (McAlister, 1879, p. 374; Galton, 1881, p. 245). The median is at 50%, and the quartiles at 25% and 75%. Note that there are five quartiles, the minimum value is the 0-quartile, at 25% the first (or lower) quartile, at 50% the median a.k.a. the second quartile, at 75% the third (or upper) quartile, and the maximum as the fourth quartile.
    
    Tukey (1977) also introduced the term Hinges and sorted the values in a W shape, where the bottom parts of the W are then the hinges.
    
    There are quite a few different methods to determine the quartiles. This function has 19 different ones. See the notes for a description.

    This function is shown in this [YouTube video](https://youtu.be/iI07nJ3wlOQ) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Quantiles.html)
    
    Parameters
    ----------
    data : list or pandas series
    levels : dictionary, optional 
        coding to use
    method : string, optional 
        which method to use to calculate quartiles
    indexMethod : {"sas1", "inclusive", "exclusive", "sas4", "excel", "hl", "hf8", "hf9"}, optional 
        to indicate which type of indexing to use. Default is "sas1"
    q1Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for first quarter. Default is "linear"
    q1Int : {"int", "midpoint"}, optional 
        to indicate the use of the integer or the midpoint method for first quarter. Default is "int"
    q3Frac : {"linear", "down", "up", "bankers", "nearest", "halfdown", "midpoint"}, optional 
        to indicate what type of rounding to use for third quarter. Default is "linear"
    q3Int : {"int", "midpoint"}, optional  
        to indicate the use of the integer or the midpoint method for third quarter. Default is "int"
    
    method can be set to "own" and then provide the next parameters, or any of the methods listed in the notes.
    
    Returns
    -------
    pandas.DataFrame
        A dataframe with the following columns:
    
        * Q1, the numeric value of the first quarter
        * Q3, the numeric value of the third quarter
        * Q1 text, text version of first quarter (only if levels are used)
        * Q3 text, text version of third quarter (only if levels are used)
    
    Notes
    -----
    To determine the quartiles a specific indexing method can be used. See **he_quartileIndexing()** for details on the different methods to choose from.
    
    Then based on the indexes either linear interpolation or different rounding methods (bankers, nearest, down, up, half-down) can be used, or the midpoint between the two values. If the index is an integer either the integer or the mid point is used. 
    
    See the **he_quartilesIndex()** for details on this.
    
    Note that the rounding method can even vary per quartile, i.e. the one used for the first quartile being different than the one for the second.

    I've come across the following methods:

    |method|indexing|q1 integer|q1 fractional|q3 integer|q3 fractional|
    |------|--------|----------|-------------|----------|-------------|
    |sas1|sas1|use int|linear|use int|linear|
    |sas2|sas1|use int|bankers|use int|bankers|
    |sas3|sas1|use int|up|use int|up|
    |sas5|sas1|midpoint|up|midpoint|up|
    |hf3b|sas1|use int|nearest|use int|halfdown|
    |sas4|sas4|use int|linear|use int|linear|
    |ms|sas4|use int|nearest|use int|halfdown|
    |lohninger|sas4|use int|nearest|use int|nearest|
    |hl2|hl|use int|linear|use int|linear|
    |hl1|hl|use int|midpoint|use int|midpoint|
    |excel|excel|use int|linear|use int|linear|
    |pd2|excel|use int|down|use int|down|
    |pd3|excel|use int|up|use int|up|
    |pd4|excel|use int|halfdown|use int|nearest|
    |pd5|excel|use int|midpoint|use int|midpoint|
    |hf8|hf8|use int|linear|use int|linear|
    |hf9|hf9|use int|linear|use int|linear|

    The following values can be used for the *method* parameter:

    1. inclusive = tukey =hinges = vining. (Tukey, 1977, p. 32; Siegel & Morgan, 1996, p. 77; Vining, 1998, p. 44).
    1. exclusive = jf. (Moore & McCabe, 1989, p. 33; Joarder & Firozzaman, 2001, p. 88).
    1. sas1 = parzen = hf4 = interpolated_inverted_cdf = maple3 = r4. (Parzen, 1979, p. 108; SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 363)
    1. sas2 = hf3 = r3. (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas3 = hf1 = inverted_cdf = maple1 = r1 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. sas4 = hf6 = minitab = snedecor = weibull = maple5 = r6 (Hyndman & Fan, 1996, p. 363; Weibull, 1939, p. ?; Snedecor, 1940, p. 43; SAS, 1990, p. 626)
    1. sas5 = hf2 = CDF = averaged_inverted_cdf = r2 (SAS, 1990, p. 626; Hyndman & Fan, 1996, p. 362)
    1. hf3b = closest_observation 
    1. ms (Mendenhall & Sincich, 1992, p. 35)
    1. lohninger (Lohninger, n.d.)
    1. hl1 (Hogg & Ledolter, 1992, p. 21)
    1. hl2 = hf5 = Hazen = maple4 = r5 (Hogg & Ledolter, 1992, p. 21; Hazen, 1914, p. ?)
    1. maple2
    1. excel = hf7 = pd1 = linear = gumbel = maple6 = r7 (Hyndman & Fan, 1996, p. 363; Freund & Perles, 1987, p. 201; Gumbel, 1939, p. ?)
    1. pd2 = lower
    1. pd3 = higher
    1. pd4 = nearest
    1. pd5 = midpoint
    1. hf8 = median_unbiased = maple7 = r8 (Hyndman & Fan, 1996, p. 363)
    1. hf9 = normal_unbiased = maple8 = r9 (Hyndman & Fan, 1996, p. 363)

    *hf* is short for Hyndman and Fan who wrote an article showcasing many different methods, *hl* is short for Hog and Ledolter, *ms* is short for Mendenhall and Sincich, *jf* is short for Joarder and Firozzaman. *sas* refers to the software package SAS, *maple* to Maple, *pd* to Python's pandas library, and *r* to R.
    
    The names *linear*, *lower*, *higher*, *nearest* and *midpoint* are all used by pandas quantile function and numpy percentile function. Numpy also uses *inverted_cdf*, *averaged_inverted_cdf*, *closest_observation*, *interpolated_inverted_cdf*, *hazen*, *weibull*, *median_unbiased*, and *normal_unbiased*. 

    Before, After and Alternatives
    ------------------------------
    Before this measure you might want an impression using a frequency table or a visualisation:
    * [tab_frequency](../other/table_frequency.html#tab_frequency) for a frequency table
    * [vi_bar_stacked_single](../visualisations/vis_bar_stacked_single.html#vi_bar_stacked_single) for Single Stacked Bar-Chart
    * [vi_bar_dual_axis](../visualisations/vis_bar_dual_axis.html#vi_bar_dual_axis) for Dual-Axis Bar Chart

    After this you might want some other descriptive measures:
    * [me_consensus](../measures/meas_consensus.html#me_consensus) for the Consensus
    * [me_hodges_lehmann_os](../measures/meas_hodges_lehmann_os.html#me_hodges_lehmann_os) for the Hodges-Lehmann Estimate (One-Sample)
    * [me_median](../measures/meas_median.html#me_median) for the Median
    * [me_quantiles](../measures/meas_quantiles.html#me_quantiles) for Quantiles
    * [me_quartile_range](../measures/meas_quartile_range.html#me_quartile_range) for Interquartile Range, Semi-Interquartile Range and Mid-Quartile Range
    
    or perform a test:
    * [ts_sign_os](../tests/test_sign_os.html#ts_sign_os) for One-Sample Sign Test
    * [ts_trinomial_os](../tests/test_trinomial_os.html#ts_trinomial_os) for One-Sample Trinomial Test
    * [ts_wilcoxon_os](../tests/test_wilcoxon_os.html#ts_wilcoxon_os) for Wilcoxon Signed Rank Test (One-Sample)

    For more information on the quartile indexing methods and index itself:
    * [he_quartileIndexing](../helper/help_quartileIndexing.html#he_quartileIndexing)
    * [he_quartilesIndex](../helper/help_quartileIndex.html#he_quartilesIndex)
    
    References
    ----------
    Freund, J. E., & Perles, B. M. (1987). A new look at quartiles of ungrouped data. *The American Statistician, 41*(3), 200–203. doi:10.1080/00031305.1987.10475479

    Galton, F. (1881). Report of the anthropometric committee. *Report of the British Association for the Advancement of Science, 51*, 225–272.

    Gumbel, E. J. (1939). La Probabilité des Hypothèses. *Compes Rendus de l’ Académie des Sciences, 209*, 645–647.

    Hazen, A. (1914). Storage to be provided in impounding municipal water supply. *Transactions of the American Society of Civil Engineers, 77*(1), 1539–1640. doi:10.1061/taceat.0002563

    Hogg, R. V., & Ledolter, J. (1992). *Applied statistics for engineers and physical scientists* (2nd int.). Macmillan.

    Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. *The American Statistician, 50*(4), 361–365. doi:10.2307/2684934

    Joarder, A. H., & Firozzaman, M. (2001). Quartiles for discrete data. *Teaching Statistics, 23*(3), 86–89. doi:10.1111/1467-9639.00063

    Langford, E. (2006). Quartiles in elementary statistics. *Journal of Statistics Education, 14*(3), 1–17. doi:10.1080/10691898.2006.11910589

    Lohninger, H. (n.d.). Quartile. Fundamentals of Statistics. Retrieved April 7, 2023, from http://www.statistics4u.com/fundstat_eng/cc_quartile.html

    McAlister, D. (1879). The law of the geometric mean. *Proceedings of the Royal Society of London, 29*(196–199), 367–376. doi:10.1098/rspl.1879.0061

    Mendenhall, W., & Sincich, T. (1992). *Statistics for engineering and the sciences* (3rd ed.). Dellen Publishing Company.

    Moore, D. S., & McCabe, G. P. (1989). *Introduction to the practice of statistics*. W.H. Freeman.

    Parzen, E. (1979). Nonparametric statistical data modeling. *Journal of the American Statistical Association, 74*(365), 105–121. doi:10.1080/01621459.1979.10481621

    SAS. (1990). SAS procedures guide: Version 6 (3rd ed.). SAS Institute.

    Siegel, A. F., & Morgan, C. J. (1996). *Statistics and data analysis: An introduction* (2nd ed.). J. Wiley.

    Snedecor, G. W. (1940). *Statistical methods applied to experiments in agriculture and biology* (3rd ed.). The Iowa State College Press.

    Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley Pub. Co.

    Vining, G. G. (1998). *Statistical methods for engineers*. Duxbury Press.

    Weibull, W. (1939).* The phenomenon of rupture in solids*. Ingeniörs Vetenskaps Akademien, 153, 1–55.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: Text Pandas Series
    >>> import pandas as pd
    >>> df2 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df2['Teach_Motivate']
    >>> order = {"Fully Disagree":1, "Disagree":2, "Neither disagree nor agree":3, "Agree":4, "Fully agree":5}
    >>> me_quartiles(ex1, levels=order)
        Q1   Q3         Q1 text                     Q3 text
    0  1.0  3.0  Fully Disagree  Neither disagree nor agree
    
    Example 2: Numeric data
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
    >>> me_quartiles(ex2)
        Q1   Q3
    0  2.0  5.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    data = data.dropna()
    if levels is not None:
        dataN = data.map(levels).astype('Int8')
    else:
        dataN = pd.to_numeric(data)
    
    dataN = dataN.sort_values().reset_index(drop=True)
    #dataN = list(dataN)
    
    #alternative namings
    if method in ["inclusive", "tukey", "vining", "hinges"]:
        method="inclusive"
    elif method in ["exclusive", "jf"]:
        method ="exclusive"
    elif method in ["cdf", "sas5", "hf2", "averaged_inverted_cdf", "r2"]:
        method = "sas5"
    elif method in ["sas4", "minitab", "hf6", "weibull", "maple5", "r6"]:
        method = "sas4"
    elif method in ["excel", "hf7", "pd1", "linear", "gumbel", "maple6", "r7"]:
        method = "excel"
    elif method in ["sas1", "parzen", "hf4", "interpolated_inverted_cdf", "maple3", "r4"]:
        method = "sas1"
    elif method in ["sas2", "hf3", "r3"]:
        method = "sas2"
    elif method in ["sas3", "hf1", "inverted_cdf", "maple1", "r1"]:
        method = "sas3"
    elif method in ["hf3b", "closest_observation"]:
        method = "hf3b"
    elif method in ["hl2", "hazen", "hf5", "maple4"]:
        method = "hl2"
    elif method in ["np", "midpoint", "pd5"]:
        method = "pd5"
    elif method in ["hf8", "median_unbiased", "maple7", "r8"]:
        method = "hf8"
    elif method in ["hf9", "normal_unbiased", "maple8", "r9"]:
        method = "hf9"
    elif method in ["pd2", "lower"]:
        method = "pd2"
    elif method in ["pd3", "higher"]:
        method = "pd3"
    elif method in ["pd4", "nearest"]:
        method = "pd4"
    
    #settings
    settings = [indexMethod, q1Frac, q1Int, q3Frac, q3Int]
    if method=="inclusive":
        settings = ["inclusive", "linear","int","linear","int"]
    elif method=="exclusive":
        settings = ["exclusive", "linear","int","linear","int"]
    elif method=="sas1":
        settings = ["sas1","linear","int","linear","int"]
    elif method=="sas2":
        settings = ["sas1","bankers","int","bankers" ,"int"]
    elif method=="sas3":
        settings = ["sas1","up","int","up","int"]
    elif method=="sas5":
        settings = ["sas1","up","midpoint","up","midpoint"]
    elif method=="sas4":    
        settings = ["sas4","linear", "int","linear","int"]
    elif method=="ms": 
        settings = ["sas4", "nearest","int", "halfdown","int"]
    elif method=="lohninger":
        settings = ["sas4", "nearest", "int","nearest","int"]
    elif method=="hl2":
        settings = ["hl", "linear", "int","linear","int"]
    elif method=="hl1":
        settings = ["hl", "midpoint","int", "midpoint","int"]
    elif method=="excel":
        settings = ["excel", "linear","int","linear", "int"]
    elif method=="pd2":
        settings = ["excel", "down", "int", "down","int"]
    elif method=="pd3":
        settings = ["excel", "up","int","up","int"]
    elif method=="pd4":
        settings = ["excel", "halfdown",  "int","nearest", "int"]
    elif method=="hf3b":
        settings = ["sas1", "nearest","int","halfdown","int"]
    elif method=="pd5":
        settings = ["excel", "midpoint","int","midpoint","int"]
    elif method=="hf8":
        settings = ["hf8", "linear","int","linear", "int"]
    elif method=="hf9":
        settings = ["hf9", "linear","int","linear", "int"]
    elif method=="maple2":
        settings = ["hl", "down","int","down", "int"]
    
    q1, q3 = he_quartileIndex(dataN, settings[0], settings[1], settings[2], settings[3], settings[4])
    
    #find the text representatives
    
    if levels is not None:
        if q1 == round(q1):
            q1T = list(levels.keys())[list(levels.values()).index(q1)]

        else:
            q1T = "between " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.floor(q1)])] + " and " + list(levels.keys())[list(levels.values()).index(dataN.iloc[math.ceil(q1)])]

        if q3 == round(q3):
            q3T = list(levels.keys())[list(levels.values()).index(q3)]

        else:
            q3T = "between " + list(levels.keys())[list(levels.values()).index(math.floor(q3))] + " and " + list(levels.keys())[list(levels.values()).index(math.ceil(q3))]
        
        
        results = pd.DataFrame([[q1, q3, q1T, q3T]], columns=["Q1", "Q3", "Q1 text", "Q3 text"])
    else:
        results = pd.DataFrame([[q1, q3]], columns=["Q1", "Q3"])
        
    pd.set_option('display.max_colwidth', None)
    
    return results