Module `stikpetP.measures.meas_mode_bin`

Expand source code

import pandas as pd
from ..other.table_frequency_bins import tab_frequency_bins

def me_mode_bin(data, nbins="sturges", bins=None, incl_lower=True, adjust=1, allEq="none", value="none"):
    '''
    Mode for Binned Data
    --------------------
    
    The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002).

    For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this.

    The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose.

    If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used.

    An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types.

    This function is shown in this [YouTube video](https://youtu.be/_-ht6yFKBDI) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Mode.html)

    Parameters
    ----------
    data : list or pandas data series
    nbins : int or string, optional
        either the number of bins to create, or a specific method from the *tab_nbins()* function. Default is "sturges"
    bins : list of tuples, optional
    incl_lower : boolean, optional
        to include the lower bound, otherwise the upper bound is included. Default is True
    adjust : float, optional
        value to add  or subtract to guarantee all scores will fit in a bin
    allEq : {"none", "all"}, optional 
        indicator on what to do if maximum frequency is equal for more than one category. Default is "none"
    value : {"none", "midpoint", "quadratic"} 
        optional which value to show in the output. Default is "none"

    Returns
    -------
    A pandas dataframe with:

    * *mode*, the mode(s)
    * *mode fd*, frequency density of the mode

    Notes
    -----
    **Value to return**

    If *value="midpoint"* is used the modal bin(s) midpoints are shown, using:
    $$MP_m = \\frac{UB_m + LB_m}{2}$$
    Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound.

    If *value="quadratic"* is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to 
    the midpoint of the bin after the modal bin. This is done using:
    $$M = LB_{m} + \\frac{d_1}{d_1 + d_2}\\times\\left(UB_m - LB_m\\right)$$
    With:
    $$d_1 = FD_m - FD_{m -1}$$
    $$d_2 = FD_m - FD_{m + 1}$$

    Where $FD_m$ is the frequency density of the modal category.

    **Multimode**

    One small controversy exists if all categories have the same frequency. In this case none of them has a higher 
    occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64, 
    Larson & Farber, 2014, p. 69). This is used when *allEq="none"* and the default.

    On a rare occasion someone might argue that if all categories have the same frequency, then all categories are 
    part of the mode since they all have the highest frequency. This is used when *allEq="all"*.

    The function can return the bins that are the modal bins, by setting *value="none"*.

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want some other descriptive measures:
    * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
    * [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
    
    Or a perform a test:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test

    The mode for non-binned data can be determined using:
    * [me_mode](../measures/meas_mode.html#me_mode) for Mode 

    References
    ----------
    Larson, R., & Farber, E. (2014). *Elementary statistics: Picturing the world* (6th ed.). Pearson.

    Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. *Philosophical Transactions of the Royal Society of London. (A.), 186*, 343–414. doi:10.1098/rsta.1895.0010

    Spiegel, M. R., & Stephens, L. J. (2008). *Schaum’s outline of theory and problems of statistics* (4th ed.). McGraw-Hill.

    Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.

    Zedeck, S. (Ed.). (2014). *APA dictionary of statistics and research methods*. American Psychological Association.

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    Example 1: pandas series
    >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = student_df['Gen_Age']
    >>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)]
    >>> me_mode_bin(ex1, bins=myBins)
              mode  mode fd.
    0  20.0 < 25.0       4.2
    
    Example 2: Numeric list unimodal
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5]
    >>> myBins = [(0, 3), (3, 5), (5, 6)]
    >>> me_mode_bin(ex2, bins=myBins)
            mode  mode fd.
    0  5.0 < 6.0       6.0
    
    Example 3: Numeric list bimodal and using midpoint
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
    >>> myBins = [(1, 3), (3, 5), (5, 7)]
    >>> me_mode_bin(ex2, bins=myBins, value='midpoint')
           mode  mode fd.
    0  2.0, 6.0       3.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    freq = tab_frequency_bins(data, nbins, bins, incl_lower, adjust)
    modeFD = max(freq['frequency density'])
    nModes = sum(freq['frequency density']==modeFD)
    k = len(freq)
    
    if nModes==k and allEq=="none":
        mode = "none"
        modeFD = "none"
    else:
        if value=="midpoint":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    newMode = (freq.iloc[i, 1] + freq.iloc[i, 0])/2
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode)
        elif value=="quadratic":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    if i==0:
                        d1 = modeFD
                        d2 = modeFD - freq.iloc[i+1, 3]
                    elif i==(k-1):
                        d1 = modeFD - freq.iloc[i-1, 3]
                        d2 = modeFD
                    else:
                        d1 = modeFD - freq.iloc[i-1, 3]
                        d2 = modeFD - freq.iloc[i+1, 3]
                
                    newMode = freq.iloc[i, 0] + d1/(d1 + d2) * (freq.iloc[i, 1] - freq.iloc[i, 0])
                    
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode) 
                        
        elif value == "none":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    newMode = str(freq.iloc[i, 0]) + " < " + str(freq.iloc[i, 1])
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode)
                    
    res = pd.DataFrame(list([[mode, modeFD]]), columns = ["mode", "mode fd."])
    
    return (res)

Functions

def me_mode_bin(data, nbins='sturges', bins=None, incl_lower=True, adjust=1, allEq='none', value='none')

Mode For Binned Data

The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002).

For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this.

The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose.

If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used.

An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types.

This function is shown in this YouTube video and the measure is also described at PeterStatistics.com

Parameters

data : list or pandas data series
nbins : int or string, optional: either the number of bins to create, or a specific method from the tab_nbins() function. Default is "sturges"
bins : list of tuples, optional
incl_lower : boolean, optional: to include the lower bound, otherwise the upper bound is included. Default is True
adjust : float, optional: value to add or subtract to guarantee all scores will fit in a bin
allEq : {"none", "all"}, optional: indicator on what to do if maximum frequency is equal for more than one category. Default is "none"
value : {"none", "midpoint", "quadratic"}: optional which value to show in the output. Default is "none"

Returns

A pandas dataframe with:

mode, the mode(s)
mode fd, frequency density of the mode

Notes

Value to return

If value="midpoint" is used the modal bin(s) midpoints are shown, using: $MP_m = \frac{UB_m + LB_m}{2}$ Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound.

If value="quadratic" is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to the midpoint of the bin after the modal bin. This is done using: $M = LB_{m} + \frac{d_1}{d_1 + d_2}\times\left(UB_m - LB_m\right)$ With: $d_1 = FD_m - FD_{m -1}$ $d_2 = FD_m - FD_{m + 1}$

Where $FD_m$ is the frequency density of the modal category.

Multimode

One small controversy exists if all categories have the same frequency. In this case none of them has a higher occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64, Larson & Farber, 2014, p. 69). This is used when allEq="none" and the default.

On a rare occasion someone might argue that if all categories have the same frequency, then all categories are part of the mode since they all have the highest frequency. This is used when allEq="all".

The function can return the bins that are the modal bins, by setting value="none".

Before, After and Alternatives

Before this you might want to create a binned frequency table or a visualisation: * tab_frequency_bins to create a binned frequency table * vi_boxplot_single for a Box (and Whisker) Plot * vi_histogram for a Histogram * vi_stem_and_leaf for a Stem-and-Leaf Display

After this you might want some other descriptive measures: * me_mean for different types of mean * me_variation for different Measures of Quantitative Variation

Or a perform a test: * ts_student_t_os for One-Sample Student t-Test * ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * ts_z_os for One-Sample Z Test

The mode for non-binned data can be determined using: * me_mode for Mode

References

Larson, R., & Farber, E. (2014). Elementary statistics: Picturing the world (6th ed.). Pearson.

Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London. (A.), 186, 343–414. doi:10.1098/rsta.1895.0010

Spiegel, M. R., & Stephens, L. J. (2008). Schaum’s outline of theory and problems of statistics (4th ed.). McGraw-Hill.

Weisstein, E. W. (2002). CRC concise encyclopedia of mathematics (2nd ed.). Chapman & Hall/CRC.

Zedeck, S. (Ed.). (2014). APA dictionary of statistics and research methods. American Psychological Association.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: pandas series

>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = student_df['Gen_Age']
>>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)]
>>> me_mode_bin(ex1, bins=myBins)
          mode  mode fd.
0  20.0 < 25.0       4.2

Example 2: Numeric list unimodal

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5]
>>> myBins = [(0, 3), (3, 5), (5, 6)]
>>> me_mode_bin(ex2, bins=myBins)
        mode  mode fd.
0  5.0 < 6.0       6.0

Example 3: Numeric list bimodal and using midpoint

>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
>>> myBins = [(1, 3), (3, 5), (5, 7)]
>>> me_mode_bin(ex2, bins=myBins, value='midpoint')
       mode  mode fd.
0  2.0, 6.0       3.0

Expand source code

def me_mode_bin(data, nbins="sturges", bins=None, incl_lower=True, adjust=1, allEq="none", value="none"):
    '''
    Mode for Binned Data
    --------------------
    
    The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002).

    For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this.

    The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose.

    If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used.

    An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types.

    This function is shown in this [YouTube video](https://youtu.be/_-ht6yFKBDI) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Mode.html)

    Parameters
    ----------
    data : list or pandas data series
    nbins : int or string, optional
        either the number of bins to create, or a specific method from the *tab_nbins()* function. Default is "sturges"
    bins : list of tuples, optional
    incl_lower : boolean, optional
        to include the lower bound, otherwise the upper bound is included. Default is True
    adjust : float, optional
        value to add  or subtract to guarantee all scores will fit in a bin
    allEq : {"none", "all"}, optional 
        indicator on what to do if maximum frequency is equal for more than one category. Default is "none"
    value : {"none", "midpoint", "quadratic"} 
        optional which value to show in the output. Default is "none"

    Returns
    -------
    A pandas dataframe with:

    * *mode*, the mode(s)
    * *mode fd*, frequency density of the mode

    Notes
    -----
    **Value to return**

    If *value="midpoint"* is used the modal bin(s) midpoints are shown, using:
    $$MP_m = \\frac{UB_m + LB_m}{2}$$
    Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound.

    If *value="quadratic"* is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to 
    the midpoint of the bin after the modal bin. This is done using:
    $$M = LB_{m} + \\frac{d_1}{d_1 + d_2}\\times\\left(UB_m - LB_m\\right)$$
    With:
    $$d_1 = FD_m - FD_{m -1}$$
    $$d_2 = FD_m - FD_{m + 1}$$

    Where $FD_m$ is the frequency density of the modal category.

    **Multimode**

    One small controversy exists if all categories have the same frequency. In this case none of them has a higher 
    occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64, 
    Larson & Farber, 2014, p. 69). This is used when *allEq="none"* and the default.

    On a rare occasion someone might argue that if all categories have the same frequency, then all categories are 
    part of the mode since they all have the highest frequency. This is used when *allEq="all"*.

    The function can return the bins that are the modal bins, by setting *value="none"*.

    Before, After and Alternatives
    ------------------------------
    Before this you might want to create a binned frequency table or a visualisation:
    * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
    * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
    * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
    * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display

    After this you might want some other descriptive measures:
    * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
    * [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
    
    Or a perform a test:
    * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
    * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
    * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test

    The mode for non-binned data can be determined using:
    * [me_mode](../measures/meas_mode.html#me_mode) for Mode 

    References
    ----------
    Larson, R., & Farber, E. (2014). *Elementary statistics: Picturing the world* (6th ed.). Pearson.

    Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. *Philosophical Transactions of the Royal Society of London. (A.), 186*, 343–414. doi:10.1098/rsta.1895.0010

    Spiegel, M. R., & Stephens, L. J. (2008). *Schaum’s outline of theory and problems of statistics* (4th ed.). McGraw-Hill.

    Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.

    Zedeck, S. (Ed.). (2014). *APA dictionary of statistics and research methods*. American Psychological Association.

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    Examples
    --------
    Example 1: pandas series
    >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = student_df['Gen_Age']
    >>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)]
    >>> me_mode_bin(ex1, bins=myBins)
              mode  mode fd.
    0  20.0 < 25.0       4.2
    
    Example 2: Numeric list unimodal
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5]
    >>> myBins = [(0, 3), (3, 5), (5, 6)]
    >>> me_mode_bin(ex2, bins=myBins)
            mode  mode fd.
    0  5.0 < 6.0       6.0
    
    Example 3: Numeric list bimodal and using midpoint
    >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
    >>> myBins = [(1, 3), (3, 5), (5, 7)]
    >>> me_mode_bin(ex2, bins=myBins, value='midpoint')
           mode  mode fd.
    0  2.0, 6.0       3.0
    
    '''
    if type(data) is list:
        data = pd.Series(data)
        
    freq = tab_frequency_bins(data, nbins, bins, incl_lower, adjust)
    modeFD = max(freq['frequency density'])
    nModes = sum(freq['frequency density']==modeFD)
    k = len(freq)
    
    if nModes==k and allEq=="none":
        mode = "none"
        modeFD = "none"
    else:
        if value=="midpoint":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    newMode = (freq.iloc[i, 1] + freq.iloc[i, 0])/2
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode)
        elif value=="quadratic":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    if i==0:
                        d1 = modeFD
                        d2 = modeFD - freq.iloc[i+1, 3]
                    elif i==(k-1):
                        d1 = modeFD - freq.iloc[i-1, 3]
                        d2 = modeFD
                    else:
                        d1 = modeFD - freq.iloc[i-1, 3]
                        d2 = modeFD - freq.iloc[i+1, 3]
                
                    newMode = freq.iloc[i, 0] + d1/(d1 + d2) * (freq.iloc[i, 1] - freq.iloc[i, 0])
                    
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode) 
                        
        elif value == "none":
            ff = 0
            for i in range(0,k):
                if freq.iloc[i, 3] == modeFD:
                    newMode = str(freq.iloc[i, 0]) + " < " + str(freq.iloc[i, 1])
                    if ff==0:
                        mode = newMode
                        ff = ff + 1
                    else:
                        mode = str(mode) + ", " + str(newMode)
                    
    res = pd.DataFrame(list([[mode, modeFD]]), columns = ["mode", "mode fd."])
    
    return (res)