Module stikpetP.measures.meas_mode_bin
Expand source code
import pandas as pd
from ..other.table_frequency_bins import tab_frequency_bins
def me_mode_bin(data, nbins="sturges", bins=None, incl_lower=True, adjust=1, allEq="none", value="none"):
'''
Mode for Binned Data
--------------------
The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002).
For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this.
The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose.
If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used.
An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types.
This function is shown in this [YouTube video](https://youtu.be/_-ht6yFKBDI) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Mode.html)
Parameters
----------
data : list or pandas data series
nbins : int or string, optional
either the number of bins to create, or a specific method from the *tab_nbins()* function. Default is "sturges"
bins : list of tuples, optional
incl_lower : boolean, optional
to include the lower bound, otherwise the upper bound is included. Default is True
adjust : float, optional
value to add or subtract to guarantee all scores will fit in a bin
allEq : {"none", "all"}, optional
indicator on what to do if maximum frequency is equal for more than one category. Default is "none"
value : {"none", "midpoint", "quadratic"}
optional which value to show in the output. Default is "none"
Returns
-------
A pandas dataframe with:
* *mode*, the mode(s)
* *mode fd*, frequency density of the mode
Notes
-----
**Value to return**
If *value="midpoint"* is used the modal bin(s) midpoints are shown, using:
$$MP_m = \\frac{UB_m + LB_m}{2}$$
Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound.
If *value="quadratic"* is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to
the midpoint of the bin after the modal bin. This is done using:
$$M = LB_{m} + \\frac{d_1}{d_1 + d_2}\\times\\left(UB_m - LB_m\\right)$$
With:
$$d_1 = FD_m - FD_{m -1}$$
$$d_2 = FD_m - FD_{m + 1}$$
Where $FD_m$ is the frequency density of the modal category.
**Multimode**
One small controversy exists if all categories have the same frequency. In this case none of them has a higher
occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64,
Larson & Farber, 2014, p. 69). This is used when *allEq="none"* and the default.
On a rare occasion someone might argue that if all categories have the same frequency, then all categories are
part of the mode since they all have the highest frequency. This is used when *allEq="all"*.
The function can return the bins that are the modal bins, by setting *value="none"*.
Before, After and Alternatives
------------------------------
Before this you might want to create a binned frequency table or a visualisation:
* [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table
* [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot
* [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
* [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display
After this you might want some other descriptive measures:
* [me_mean](../measures/meas_mean.html#me_mean) for different types of mean
* [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
Or a perform a test:
* [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test
* [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test
* [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test
The mode for non-binned data can be determined using:
* [me_mode](../measures/meas_mode.html#me_mode) for Mode
References
----------
Larson, R., & Farber, E. (2014). *Elementary statistics: Picturing the world* (6th ed.). Pearson.
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. *Philosophical Transactions of the Royal Society of London. (A.), 186*, 343–414. doi:10.1098/rsta.1895.0010
Spiegel, M. R., & Stephens, L. J. (2008). *Schaum’s outline of theory and problems of statistics* (4th ed.). McGraw-Hill.
Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.
Zedeck, S. (Ed.). (2014). *APA dictionary of statistics and research methods*. American Psychological Association.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: pandas series
>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = student_df['Gen_Age']
>>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)]
>>> me_mode_bin(ex1, bins=myBins)
mode mode fd.
0 20.0 < 25.0 4.2
Example 2: Numeric list unimodal
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5]
>>> myBins = [(0, 3), (3, 5), (5, 6)]
>>> me_mode_bin(ex2, bins=myBins)
mode mode fd.
0 5.0 < 6.0 6.0
Example 3: Numeric list bimodal and using midpoint
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]
>>> myBins = [(1, 3), (3, 5), (5, 7)]
>>> me_mode_bin(ex2, bins=myBins, value='midpoint')
mode mode fd.
0 2.0, 6.0 3.0
'''
if type(data) is list:
data = pd.Series(data)
freq = tab_frequency_bins(data, nbins, bins, incl_lower, adjust)
modeFD = max(freq['frequency density'])
nModes = sum(freq['frequency density']==modeFD)
k = len(freq)
if nModes==k and allEq=="none":
mode = "none"
modeFD = "none"
else:
if value=="midpoint":
ff = 0
for i in range(0,k):
if freq.iloc[i, 3] == modeFD:
newMode = (freq.iloc[i, 1] + freq.iloc[i, 0])/2
if ff==0:
mode = newMode
ff = ff + 1
else:
mode = str(mode) + ", " + str(newMode)
elif value=="quadratic":
ff = 0
for i in range(0,k):
if freq.iloc[i, 3] == modeFD:
if i==0:
d1 = modeFD
d2 = modeFD - freq.iloc[i+1, 3]
elif i==(k-1):
d1 = modeFD - freq.iloc[i-1, 3]
d2 = modeFD
else:
d1 = modeFD - freq.iloc[i-1, 3]
d2 = modeFD - freq.iloc[i+1, 3]
newMode = freq.iloc[i, 0] + d1/(d1 + d2) * (freq.iloc[i, 1] - freq.iloc[i, 0])
if ff==0:
mode = newMode
ff = ff + 1
else:
mode = str(mode) + ", " + str(newMode)
elif value == "none":
ff = 0
for i in range(0,k):
if freq.iloc[i, 3] == modeFD:
newMode = str(freq.iloc[i, 0]) + " < " + str(freq.iloc[i, 1])
if ff==0:
mode = newMode
ff = ff + 1
else:
mode = str(mode) + ", " + str(newMode)
res = pd.DataFrame(list([[mode, modeFD]]), columns = ["mode", "mode fd."])
return (res)
Functions
def me_mode_bin(data, nbins='sturges', bins=None, incl_lower=True, adjust=1, allEq='none', value='none')
-
Mode For Binned Data
The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002).
For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this.
The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose.
If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used.
An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types.
This function is shown in this YouTube video and the measure is also described at PeterStatistics.com
Parameters
data
:list
orpandas data series
nbins
:int
orstring
, optional- either the number of bins to create, or a specific method from the tab_nbins() function. Default is "sturges"
bins
:list
oftuples
, optionalincl_lower
:boolean
, optional- to include the lower bound, otherwise the upper bound is included. Default is True
adjust
:float
, optional- value to add or subtract to guarantee all scores will fit in a bin
allEq
:{"none", "all"}
, optional- indicator on what to do if maximum frequency is equal for more than one category. Default is "none"
value
:{"none", "midpoint", "quadratic"}
- optional which value to show in the output. Default is "none"
Returns
A pandas dataframe with:
- mode, the mode(s)
- mode fd, frequency density of the mode
Notes
Value to return
If value="midpoint" is used the modal bin(s) midpoints are shown, using: MP_m = \frac{UB_m + LB_m}{2} Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound.
If value="quadratic" is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to the midpoint of the bin after the modal bin. This is done using: M = LB_{m} + \frac{d_1}{d_1 + d_2}\times\left(UB_m - LB_m\right) With: d_1 = FD_m - FD_{m -1} d_2 = FD_m - FD_{m + 1}
Where $FD_m$ is the frequency density of the modal category.
Multimode
One small controversy exists if all categories have the same frequency. In this case none of them has a higher occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64, Larson & Farber, 2014, p. 69). This is used when allEq="none" and the default.
On a rare occasion someone might argue that if all categories have the same frequency, then all categories are part of the mode since they all have the highest frequency. This is used when allEq="all".
The function can return the bins that are the modal bins, by setting value="none".
Before, After and Alternatives
Before this you might want to create a binned frequency table or a visualisation: * tab_frequency_bins to create a binned frequency table * vi_boxplot_single for a Box (and Whisker) Plot * vi_histogram for a Histogram * vi_stem_and_leaf for a Stem-and-Leaf Display
After this you might want some other descriptive measures: * me_mean for different types of mean * me_variation for different Measures of Quantitative Variation
Or a perform a test: * ts_student_t_os for One-Sample Student t-Test * ts_trimmed_mean_os for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * ts_z_os for One-Sample Z Test
The mode for non-binned data can be determined using: * me_mode for Mode
References
Larson, R., & Farber, E. (2014). Elementary statistics: Picturing the world (6th ed.). Pearson.
Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London. (A.), 186, 343–414. doi:10.1098/rsta.1895.0010
Spiegel, M. R., & Stephens, L. J. (2008). Schaum’s outline of theory and problems of statistics (4th ed.). McGraw-Hill.
Weisstein, E. W. (2002). CRC concise encyclopedia of mathematics (2nd ed.). Chapman & Hall/CRC.
Zedeck, S. (Ed.). (2014). APA dictionary of statistics and research methods. American Psychological Association.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: pandas series
>>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Gen_Age'] >>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)] >>> me_mode_bin(ex1, bins=myBins) mode mode fd. 0 20.0 < 25.0 4.2
Example 2: Numeric list unimodal
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5] >>> myBins = [(0, 3), (3, 5), (5, 6)] >>> me_mode_bin(ex2, bins=myBins) mode mode fd. 0 5.0 < 6.0 6.0
Example 3: Numeric list bimodal and using midpoint
>>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6] >>> myBins = [(1, 3), (3, 5), (5, 7)] >>> me_mode_bin(ex2, bins=myBins, value='midpoint') mode mode fd. 0 2.0, 6.0 3.0
Expand source code
def me_mode_bin(data, nbins="sturges", bins=None, incl_lower=True, adjust=1, allEq="none", value="none"): ''' Mode for Binned Data -------------------- The mode is a measure of central tendency and defined as “the abscissa corresponding to the ordinate of maximum frequency” (Pearson, 1895, p. 345). A more modern definition would be “the most common value obtained in a set of observations” (Weisstein, 2002). For binned data the mode is the bin with the highest frequency density. This will have the same result as using the highest frequency if all bins are of equal size. A frequency density is the frequency divided by the bin size (Zedeck, 2014, pp. 144-145). Different methods exist to narrow this down to a single value. See the notes for more info on this. The word mode might even come from the French word 'mode' which means fashion. Fashion is what most people wear, so the mode is the option most people chose. If one category has the highest frequency this category will be the modal category and if two or more categories have the same highest frequency each of them will be the mode. If there is only one mode the set is sometimes called unimodal, if there are two it is called bimodal, with three trimodal, etc. For two or more, thse term multimodal can also be used. An advantage of the mode over many other measures of central tendency (like the median and mean), is that it can be determined for already nominal data types. This function is shown in this [YouTube video](https://youtu.be/_-ht6yFKBDI) and the measure is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/Mode.html) Parameters ---------- data : list or pandas data series nbins : int or string, optional either the number of bins to create, or a specific method from the *tab_nbins()* function. Default is "sturges" bins : list of tuples, optional incl_lower : boolean, optional to include the lower bound, otherwise the upper bound is included. Default is True adjust : float, optional value to add or subtract to guarantee all scores will fit in a bin allEq : {"none", "all"}, optional indicator on what to do if maximum frequency is equal for more than one category. Default is "none" value : {"none", "midpoint", "quadratic"} optional which value to show in the output. Default is "none" Returns ------- A pandas dataframe with: * *mode*, the mode(s) * *mode fd*, frequency density of the mode Notes ----- **Value to return** If *value="midpoint"* is used the modal bin(s) midpoints are shown, using: $$MP_m = \\frac{UB_m + LB_m}{2}$$ Where $UB_m$ is the upper bound of the modal bin, and $LB_m$ the lower bound. If *value="quadratic"* is used a quadratic curve is made from the midpoint of the bin prior to the modal bin, to the midpoint of the bin after the modal bin. This is done using: $$M = LB_{m} + \\frac{d_1}{d_1 + d_2}\\times\\left(UB_m - LB_m\\right)$$ With: $$d_1 = FD_m - FD_{m -1}$$ $$d_2 = FD_m - FD_{m + 1}$$ Where $FD_m$ is the frequency density of the modal category. **Multimode** One small controversy exists if all categories have the same frequency. In this case none of them has a higher occurence than the others, so none of them would be the mode (see for example Spiegel & Stephens, 2008, p. 64, Larson & Farber, 2014, p. 69). This is used when *allEq="none"* and the default. On a rare occasion someone might argue that if all categories have the same frequency, then all categories are part of the mode since they all have the highest frequency. This is used when *allEq="all"*. The function can return the bins that are the modal bins, by setting *value="none"*. Before, After and Alternatives ------------------------------ Before this you might want to create a binned frequency table or a visualisation: * [tab_frequency_bins](../other/table_frequency_bins.html#tab_frequency_bins) to create a binned frequency table * [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot * [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram * [vi_stem_and_leaf](../visualisations/vis_stem_and_leaf.html#vi_stem_and_leaf) for a Stem-and-Leaf Display After this you might want some other descriptive measures: * [me_mean](../measures/meas_mean.html#me_mean) for different types of mean * [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation Or a perform a test: * [ts_student_t_os](../tests/test_student_t_os.html#ts_student_t_os) for One-Sample Student t-Test * [ts_trimmed_mean_os](../tests/test_trimmed_mean_os.html#ts_trimmed_mean_os) for One-Sample Trimmed (Yuen or Yuen-Welch) Mean Test * [ts_z_os](../tests/test_z_os.html#ts_z_os) for One-Sample Z Test The mode for non-binned data can be determined using: * [me_mode](../measures/meas_mode.html#me_mode) for Mode References ---------- Larson, R., & Farber, E. (2014). *Elementary statistics: Picturing the world* (6th ed.). Pearson. Pearson, K. (1895). Contributions to the mathematical theory of evolution. II. Skew variation in homogeneous material. *Philosophical Transactions of the Royal Society of London. (A.), 186*, 343–414. doi:10.1098/rsta.1895.0010 Spiegel, M. R., & Stephens, L. J. (2008). *Schaum’s outline of theory and problems of statistics* (4th ed.). McGraw-Hill. Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC. Zedeck, S. (Ed.). (2014). *APA dictionary of statistics and research methods*. American Psychological Association. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: pandas series >>> student_df = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/StudentStatistics.csv', sep=';', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = student_df['Gen_Age'] >>> myBins = [(0, 20), (20, 25), (25, 30), (30, 120)] >>> me_mode_bin(ex1, bins=myBins) mode mode fd. 0 20.0 < 25.0 4.2 Example 2: Numeric list unimodal >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5] >>> myBins = [(0, 3), (3, 5), (5, 6)] >>> me_mode_bin(ex2, bins=myBins) mode mode fd. 0 5.0 < 6.0 6.0 Example 3: Numeric list bimodal and using midpoint >>> ex2 = [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6] >>> myBins = [(1, 3), (3, 5), (5, 7)] >>> me_mode_bin(ex2, bins=myBins, value='midpoint') mode mode fd. 0 2.0, 6.0 3.0 ''' if type(data) is list: data = pd.Series(data) freq = tab_frequency_bins(data, nbins, bins, incl_lower, adjust) modeFD = max(freq['frequency density']) nModes = sum(freq['frequency density']==modeFD) k = len(freq) if nModes==k and allEq=="none": mode = "none" modeFD = "none" else: if value=="midpoint": ff = 0 for i in range(0,k): if freq.iloc[i, 3] == modeFD: newMode = (freq.iloc[i, 1] + freq.iloc[i, 0])/2 if ff==0: mode = newMode ff = ff + 1 else: mode = str(mode) + ", " + str(newMode) elif value=="quadratic": ff = 0 for i in range(0,k): if freq.iloc[i, 3] == modeFD: if i==0: d1 = modeFD d2 = modeFD - freq.iloc[i+1, 3] elif i==(k-1): d1 = modeFD - freq.iloc[i-1, 3] d2 = modeFD else: d1 = modeFD - freq.iloc[i-1, 3] d2 = modeFD - freq.iloc[i+1, 3] newMode = freq.iloc[i, 0] + d1/(d1 + d2) * (freq.iloc[i, 1] - freq.iloc[i, 0]) if ff==0: mode = newMode ff = ff + 1 else: mode = str(mode) + ", " + str(newMode) elif value == "none": ff = 0 for i in range(0,k): if freq.iloc[i, 3] == modeFD: newMode = str(freq.iloc[i, 0]) + " < " + str(freq.iloc[i, 1]) if ff==0: mode = newMode ff = ff + 1 else: mode = str(mode) + ", " + str(newMode) res = pd.DataFrame(list([[mode, modeFD]]), columns = ["mode", "mode fd."]) return (res)