Module stikpetP.measures.meas_qv
Expand source code
import pandas as pd
import math
from numpy import log
def me_qv(data, measure="vr", var1=2, var2=1):
'''
Measures of Qualitative Variation
---------------------------------
The mode is the measure of central tendancy, to indicate the center for categorical data. Similar as the arithmetic mean is for numeric data. As with numeric data, the center alone is not always so informative. If your head is in a burning oven, and your feet are in a freezer, you are on average fine.
This is one of the reasons, why it is often recommended to add a measure of dispersion. It gives a clearer picture of the data, and can indicate how diverse it was (how much variation).
For categorical data there are a lot of different measures proposed, but I don't often see them being used. The most common one is probably the Variation Ratio. This is simply the percentage of cases that were not in the modal category.
The specific name of the type of measure for this qualitative variation can vary quite a lot. Some talk about dominance, differentiation, evenness, entropy, equitability, diversity, and apportionment.
I've tried to categorise the measures a bit, based on the calculations. Below is the overview of all measures available in this function.
|nr.|group|measure|source|original type|
|---|-----|-------|------|-------------|
|1|mode|Freeman Variation Ratio|(Freeman, 1965)| |
|2|mode|Berger-Parker Index|(Berger & Parker, 1970, p. 1345)|dominance|
|3|mode|Wilcox MODVR|(Wilcox, 1973, p. 7)| |
|4|mode|Wilcox RANVR|(Wilcox, 1973, p. 8)| |
|5|mean|Wilcox AVDEV|(Wilcox, 1973, p. 9)| |
|6|mean|Gibbs-Poston M4|(Gibbs & Poston, 1975, p. 473)|differentiation|
|7|mean|Gibbs-Poston M5|(Gibbs & Poston, 1975, p. 474)|differentiation|
|8|mean|Gibbs-Poston M6|(Gibbs & Poston, 1975, p. 474)|differentiation|
|9|mean|Wilcox VARNC = |(Wilcox, 1973, p. 11)| |
|9|mean|Gibbs-Poston M2 = |(Gibbs & Poston, 1975, p. 472)|differentiation|
|9|mean|Smith-Wilson E1*|(Smith & Wilson, 1996, p. 71)|evenness|
|10|mean|Wilcox STDEV|(Wilcox, 1973, p. 14)| |
|11|entropy|Shannon-Weaver Entropy|(Shannon & Weaver, 1949, p. 20)|entropy|
|12|entropy|Rényi Entropy|(Rényi, 1961, p. 549)|entropy|
|13|entropy|Wilcox HREL = |(Wilcox, 1973, p. 16)| |
|13|entropy|Pielou J|(Pielou, 1966, p. 141)|diversity|
|14|entropy|Sheldon Index|(Sheldon, 1969, p. 467)|equitability = relative diversity|
|15|entropy|Heip Evenness|(Heip, 1974, p. 555)|evenness|
|16|evenness|Hill Diversity|(Hill, 1973, p. 428)|diversity|
|17|evenness|Hill Evenness|(Hill, 1973, p. 429)|evenness|
|18|evenness|Bulla E|(Bulla, 1994, pp. 168-169)|evenness|
|19|evenness|Bulla D|(Bulla, 1994, p. 169)|diversity|
|20a|evenness|Simpson D|(Simpson, 1949, p. 688)|diversity|
|20b|evenness|Simpson D biased|(Smith & Wilson, 1996, p. 71)| |
|20c|evenness|Simpson D as diversity|(Wikipedia, n.d.)| |
|20d|evenness|Simpson D as diversity biased =|(Berger & Parker, 1970, p. 1345)| |
|20d|evenness|Gibbs-Poston M1|(Gibbs & Poston, 1975, p. 471)|differentiation|
|21|evenness|Gibbs-Poston M3|(Gibbs & Poston, 1975, p. 472)|differentiation|
|22|evenness|Smith-Wilson E2|(Smith & Wilson, 1996, p. 71)|evenness|
|23|evenness|Smith-Wilson E3|(Smith & Wilson, 1996, p. 71)|evenness|
|24|evenness|Fisher alpha|(Fisher et al., 1943, p. 55)|diversity|
|25|other|Wilcox MNDIF|(Wilcox, 1973, p. 9)| |
|26|other|Kaiser b|(Kaiser, 1968, p. 211)|apportionment|
\* Smith-Wilson E1 is listed with the mean group, since it uses the average frequency. It could of course also be placed in the evenness group.
This function is shown in this [YouTube video](https://bit.ly/47uYXPe) and the measures are also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QualitativeVariation.html)
Parameters
----------
data : list or pandas series
measure : string, optional
to indicate which method to use. Either "vr" (default), "modvr", "ranvr", "avdev", "mndif", "varnc", "stdev", "hrel", "b", "m1", "m2", "m3", "m4", "m5", "m6", "d1", "d2", "d3", "d4", "bpi", "hd", "he", "swe", "re", "sw1", "sw2", "sw3", "hi", "si", "j", "b", "be", "bd", "fisher"
var1 : float, optional
additional value for some measures
var2 : float, optional
additional value for some measures
Returns
-------
pandas.DataFrame
A dataframe with the following columns:
* *value*, the value of the requested measure
* *measure*, description of the measure calculated
* *source*, source used for calculation
Notes
-----
The following measures can be determined:
* *"modvr"*, Wilcox MODVR
* *"ranvr"*, Wilcox RANVR
* *"avdev"*, Wilcox AVDEV
* *"mndif"*, Wilcox MNDIF
* *"varnc"*, Wilcox VARNC (equal to Gibbs-Poston M2 and Smith-Wilson E1)
* *"stdev"*, Wilcox STDEV
* *"hrel"*, Wilcox HREL (equal to Pielou J)
* *"m1"*, Gibbs-Poston M1
* *"m2"*, Gibbs-Poston M2 (equal to Wilcox VARNC and Smith-Wilson E1)
* *"m3"*, Gibbs-Poston M3
* *"m4"*, Gibbs-Poston M4
* *"m5"*, Gibbs-Poston M5
* *"m6"*, Gibbs-Poston M6
* *"b"*, Kaiser b
* *"bd"*, Bulla D
* *"be"*, Bulla E
* *"bpi"*, Berger-Parker index
* *"d1"*, *"d2"*, *"d3"*, *"d4"*, Simpson D and variations
* *"hd"*, Hill Diversity, requires a value for *var1*
* *"he"*, Hill Eveness, requires a value for *var1* and *var2*
* *"hi"*, Heip Index
* *"j"*, Pielou J (equal to Wilcox HREL)
* *"si"*, Sheldon Index
* *"sw1"*, Smith & Wilson E1 (equal to Wilcox VARNC and Gibbs-Poston M2)
* *"sw2"*, Smith-Wilson E2
* *"sw3"*, Smith-Wilson E3
* *"swe"*, Shannon-Weaver Entropy
* *"re"*, Rényi entropy, requires a value for *var1*
* *"vr"*, Freeman's variation ratio
* *"fisher"*, Fisher alpha
**MODE BASED MEASURES**
Dispersion can be seen as how much variation there is, using as a norm the center. For nominal data the measure of central tendancy is the mode, and therefor some measures of qualitative variation use the mode as the starting point.
The frequency of the modal category is then useful. This is simply the maximum of the frequencies.
**Freeman Variation Ratio** ("vr")
Perhaps one of the most popular measures of qualitative variation uses the mode. The (Freeman) Variation Ratio. It is simply the proportion of scores that do not belong to the modal category. In formula notation (Freeman, 1965, p. 41):
Formula used from Freeman (1965, p. 41):
$$v = 1 - \\frac{F_{mode}}{n}$$
This variation ratio would become 0% if all cases fitted in the modal category, and all other categories don't have any cases.
A 0 (0%) would mean that all cases were in the modal category. A 1 (100%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all.
**Berger–Parker index** ("bpi")
The variation ratio is the opposite of the Berger-Parker Index, which is simply the proportion of scores that did fit in the modal category. In formula notation (Berger & Parker, 1970, p. 1345):
$$BPI = \\frac{F_{mode}}{n}$$
Berger and Parker refer to this as a dominance measure, to indicate how "dominant" the modal category is.
A 1 (100%) would mean that all cases were in the modal category. A 0 (0%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all.
**Wilcox MODVR** ("modvr")
This looks at the difference of the frequency for each category with the modal frequency. This then gets divided by \\(n\\times \\left(k -1\\right)\\) to standardize the results to 0 to 1.
It is a modification of the Freeman Variation Ratio, hence the name MODVR. Wilcox noted that the Freeman VR can never reach the maximum value of 1.
The formula used is (Wilcox, 1973, p. 7):
$$\\text{MODVR} = \\frac{\\sum_{i=1}^k F_{mode} - F_i}{n\\times \\left(k - 1\\right)} = \\frac{k\\times F_{mode}-n}{n\\times \\left(k - 1\\right)}$$
**Wilcox RANVR** ("ranvr")
Short for 'range variation ratio' this measure is very similar to Freeman's VR. Instead of looking simply at the mode, it looks at the range.
The formula used is (Wilcox, 1973, p. 8):
$$\\text{RANVR} = 1 - \\frac{F_{mode} - F_{min}}{F_{mode}}$$
**MEAN BASED MEASURES**
The following measures use the average count to determine the variation. i.e.
$$\\bar{F} = \\frac{\\sum_{i=1}^k F_i}{k} = \\frac{n}{k}$$
**Wilcox AVDEV** ("avdev")
This simply follows the mean absolute deviation analogue but then using frequencies.
Again this is then standardized.
The formula used is (Wilcox, 1973, p. 9):
$$\\text{AVDEV} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times \\frac{n}{k}\\times \\left(k-1\\right)}= 1-\\frac{k\\times \\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n \\times \\left(k-1\\right)}$$
**Gibbs-Poston M4** ("m4")
The formula used (Gibbs & Poston, 1975, p. 473):
$$\\text{M4} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n}$$
**Gibbs-Poston M5** ("m5")
The problem with M4 is that it can never be 0, so to adjust for this M5 could be used but is computationally then more difficult.
The formula used (Gibbs & Poston, 1975, p. 474):
$$\\text{M5} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times\\left(n-k+1-\\bar{F}\\right)}$$
**Gibbs-Poston M6** ("m6")
The formula used (Gibbs & Poston, 1975, p. 474):
$$\\text{M6} = k\\times\\left(1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n}\\right) = k\\times\\text{M4}$$
**Wilcox VARNC** ("varnc"), **Gibbs-Poston M2** ("m2"), and **Smith & Wilson E1** ("sw1")
This is similar as the variance for scale variables.
The formula used is (Wilcox, 1973, p. 11):
$$\\text{VARNC} = 1-\\frac{\\sum_{i=1}^{k}\\left(F_i-\\bar{F}\\right)^2}{\\frac{n^2\\times\\left(k-1\\right)}{k}} = \\frac{k\\times\\left(n^2-\\sum_{i=1}^k F_i^2\\right)}{n^2\\times\\left(k-1\\right)}$$
This is the same as Gibbs and Poston's **M2** ("m2"). Their formula looks different but has the same result (Gibbs & Poston, 1975, p. 472)
$$\\text{M2} = \\frac{1-\\sum_{i=1}^k p_i^2}{1-\\frac{1}{k}} = \\frac{\\text{M1}}{1-\\frac{1}{k}} = \\frac{k}{k-1}\\times\\text{M1}$$
It is also the same as Smith and Wilson's first evenness measure ("sw1").
The formula used (Smith & Wilson, 1996, p. 71):
$$E_1 = \\frac{1 - D_s}{1 - \\frac{1}{k}}$$
With \\(D_s\\) being Simpson's D, but defined as:
$$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$
**Wilcox STDEV** ("stdev")
As with the variance for scale variables, we can take the square root to obtain the standard deviation.
The formula used can be from the VARNC or the MNDIF (Wilcox, 1973, p. 14):
$$\\text{STDEV} = 1-\\sqrt{\\frac{\\sum_{i=1}^k \\left(F_i-\\bar{F}\\right)^2}{\\left(n-\\bar{F}\\right)^2+\\left(k-1\\right)\\bar{F}^2}}= 1-\\sqrt{\\frac{\\sum_{i=1}^{k-1}\\sum_{j=i+1}^k \\left(F_i-F_j\\right)^2}{n^2\\times\\left(k-1\\right)}}$$
**ENTROPY**
Entropy is sometimes referred to as the expected value of the surprise. It tells on average how surprised we might be about the outcome, and is also used as a measure with qualitative data.
I enjoyed the simple explanation on entropy from StatQuest, their video is available <a href="https://www.youtube.com/watch?v=YtebGVx-Fxw">here</a>.
It deals a lot with proportions rather than the counts themselves
**Shannon-Weaver Entropy** ("swe")
The formula used (Shannon & Weaver, 1949, p. 20):
$$H_{sw}=-\\sum_{i=1}^k p_i\\times\\ln\\left(p_i\\right)$$
**Rényi entropy** ("re")
This is a generalisation for Shannon entropy.
The formula used is (Rényi, 1961, p. 549):
$$H_q = \\frac{1}{1 - q}\\times\\log_2\\left(\\sum_{i=1}^k p_i^q\\right)$$
**Wilcox HREL** ("hrel") and **Pielou J** ("j")
This uses Shannon's entropy but divides it over the maximum possible uncertainty.
The formula used (Wilcox, 1973, p. 16):
$$\\text{HREL} = \\frac{-\\sum_{i=1}^k p_i \\times \\text{log}_2 p_i}{\\text{log}_2 k}$$
This is the same as Pielou J. ("j")
The formula used (Pielou, 1966, p. 141):
$$J=\\frac{H_{sw}}{\\ln\\left(k\\right)}$$
**Sheldon Index** ("si")
The formula used (Sheldon, 1969, p. 467):
$$E = \\frac{e^{H_{sw}}}{k}$$
**Heip Index** ("hi")
The formula used is (Heip, 1974, p. 555):
$$E_h = \\frac{e^{H_{sw}} - 1}{k - 1}$$
**EVENNESS and DIVERSITY**
**Hill Diversity** ("hd")
The formula used is (Hill, 1973, p. 428):
$$N_a = \\begin{cases}\\left(\\sum_{i=1}^k p_i^a\\right)^{\\frac{1}{1-a}} & \\text{ if } a\\neq 1 \\\\ e^{H_{sw}} & \\text{ if } =1 \\end{cases}$$
**Hill Eveness** ("he")
The formula used is (Hill, 1973, p. 429):
$$E_{a,b} = \\frac{N_a}{N_b}$$
Where \\(N_a\\) and \\(N_b\\) are Hill's diversity values for a and b.
**Bulla E** ("be")
Bulla's evenness measure.
The formula used is (Bulla, 1994, pp. 168-169):
$$E_b = \\frac{O - \\frac{1}{k} - \\frac{k - 1}{n}}{1 - \\frac{1}{k} - \\frac{k - 1}{n}}$$
With:
$$O = \\sum_{i=1}^k \\min\\left(p_i, \\frac{1}{k}\\right)$$
**Bulla D** ("bd")
Bulla's Evenness measure converted to a diversity measure.
The formula used is (Bulla, 1994, p. 169):
$$D_b = E_b\\times k$$
Where \\(E_b\\) is Bulla E value.
With:
$$O = \\sum_{i=1}^k \\min\\left(p_i, \\frac{1}{k}\\right)$$
**Simpson D** ("d1", "d2", "d3", "d4" = Gibbs-Poston M1)
The formula used is based on Simpson (1949, p. 688):
$$D_1 = \\frac{\\sum_{i=1}^k F_i\\times\\left(F_i-1\\right)}{n\\times\\left(n-1\\right)}$$
Another alternative is for a population:
$$D_2 = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$
Often the result is subtracted from 1 to reverse the scale.
$$D_3 = 1-\\frac{\\sum_{i=1}^k F_i\\times\\left(F_i-1\\right)}{n\\times\\left(n-1\\right)}$$
and
$$D_4 = 1 - \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$
This last one is then the same as Gibb-Poston M1 (Gibbs & Poston, 1975, p. 471):
$$\\text{M1} = 1 - \\sum_{i=1}^k p_i^2$$
**Gibbs-Poston M3** ("m3")
The formula used (Gibbs & Poston, 1975, p. 472):
$$\\text{M3} = \\frac{1-\\sum_{i=1}^k p_i^2-p_{min}}{1-\\frac{1}{k}-p_{min}}$$
With \\(p_{min}\\) the lowest proportion
**Smith & Wilson E2** ("sw2")
The formula used (Smith & Wilson, 1996, p. 71):
$$E_2 = \\frac{\\ln\\left(D_s\\right)}{\\ln\\left(k\\right)}$$
With \\(D_s\\) being Simpson's D, but defined as:
$$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$
**Smith & Wilson E3** ("sw3")
The formula used (Smith & Wilson, 1996, p. 71):
$$E_3 = \\frac{1}{D_s \\times k}$$
With \\(D_s\\) being Simpson's D, but defined as:
$$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$
**Fisher alpha** ("fisher")
The formula used (Fisher et al., 1943, p. 55):
$$k = \\alpha \\times \\ln\\left(1 + \\frac{n}{\\alpha}\\right)$$
The function uses a simple binary search to find the value for \\(\\alpha\\) such that the result of the above formula will produce the number of categories ( \\(k\\) ).
**OTHER***
**Wilcox MNDIF** ("mndif")
Analog of the mean difference measure for scale variables.
The formula used is (Wilcox, 1973, p. 9):
$$\\text{MNDIF} = 1-\\frac{\\sum_{i=1}^{k-1}\\sum_{j=i+1}^k \\left|F_i-F_j\\right|}{n\\times\\left(k-1\\right)}$$
**Kaiser b**
The formula used (Kaiser, 1968, p. 211):
$$B = 1 - \\sqrt{1 - \\left(\\sqrt[k]{\\prod_{i=1}^k\\frac{f_i\\times k}{n}}\\right)^2}$$
Kaiser also provides rules-of-thumb for interpretation. See **th_kaiser_b()** for more details on this.
Before, After and Alternatives
------------------------------
Before this an impression using a frequency table or a visualisation might be helpful:
* [tab_frequency](../other/table_frequency.html#tab_frequency)
* [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
* [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
* [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
* [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart
* [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
After this you might want to perform a test:
* [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
* [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
* [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
* [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
* [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
* [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
* [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
* [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
References
----------
Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. *Science, 168*(3937), 1345–1347. doi:10.1126/science.168.3937.1345
Bulla, L. (1994). An index of evenness and its associated diversity measure. *Oikos, 70*(1), 167–171. doi:10.2307/3545713
Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. *The Journal of Animal Ecology, 12*(1), 42–58. doi:10.2307/1411
Freeman, L. C. (1965). *Elementary applied statistics: For students in behavioral science*. Wiley.
Gibbs, J. P., & Poston, D. L. (1975). The division of labor: Conceptualization and related measures. *Social Forces, 53*(3), 468. doi:10.2307/2576589
Heip, C. (1974). A new index measuring evenness. *Journal of the Marine Biological Association of the United Kingdom, 54*(3), 555–557. doi:10.1017/S0025315400022736
Hill, M. O. (1973). Diversity and evenness: A unifying notation and its consequences. *Ecology, 54*(2), 427–432. doi:10.2307/1934352
Kaiser, H. F. (1968). A measure of the population quality of legislative apportionment. *American Political Science Review, 62*(1), 208–215. doi:10.2307/1953335
Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. *Journal of Theoretical Biology, 13*, 131–144. doi:10.1016/0022-5193(66)90013-0
Rényi, A. (1961). On measures of entropy and information. *Contributions to the Theory of Statistics, 1*, 547–562.
Shannon, C. E., & Weaver, W. (1949). *The mathematical theory of communication*. The university of Illinois press.
Sheldon, A. L. (1969). Equitability indices: Dependence on the species count. *Ecology, 50*(3), 466–467. doi:10.2307/1933900
Simpson, E. H. (1949). Measurement of diversity. *Nature, 163*(4148), Article 4148. doi:10.1038/163688a0
Smith, B., & Wilson, J. B. (1996). A consumer’s guide to evenness indices. *Oikos, 76*(1), 70–82. doi:10.2307/3545749
Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. *Political Research Quarterly, 26*(2), 325–343. doi:10.1177/106591297302600209
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['mar1']
>>> me_qv(ex1)
value measure source
0 0.499227 Freeman Variation Ratio (Freeman, 1965)
Example 2: a list
>>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
>>> me_qv(ex2, "swe")
value measure source
0 1.296892 Shannon-Weaver Entropy (Shannon & Weaver, 1949, p. 20)
'''
if type(data) is list:
data = pd.Series(data)
freqs = data.value_counts().values
k = len(freqs)
n = sum(freqs)
fm = max(freqs)
props = freqs/n
if measure=="modvr":
#Modified Variation Ratio
src = "(Wilcox, 1973, p. 7)"
lbl = "Wilcox MODVR"
qv = sum(fm - freqs)/(n*(k - 1))
elif measure=="ranvr":
#Range Variation Ratio
src = "(Wilcox, 1973, p. 8)"
lbl = "Wilcox RANVR"
fl = min(freqs)
qv = 1 - (fm - fl)/fm
elif measure=="avdev":
#Average Deviation
src = "(Wilcox, 1973, p. 9)"
lbl = "Wilcox AVDEV"
qv = 1-sum(abs(freqs-n/k)) / (2*n/k*(k-1))
elif measure=="mndif":
#MNDif
src = "(Wilcox, 1973, p. 9)"
lbl = "Wilcox MNDIF"
mndif = 0
for i in range(0, k-1):
for j in range(i+1,k):
mndif = mndif + abs(freqs[i]-freqs[j])
qv = 1 - mndif/(n*(k-1))
elif measure=="varnc":
#VarNC
src = "(Wilcox, 1973, p. 11)"
lbl = "Wilcox VARNC"
qv = 1 - sum((freqs-n/k)**2)/(n**2*(k-1)/k)
elif measure=="stdev":
src = "(Wilcox, 1973, p. 14)"
lbl = "Wilcox STDEV"
qv = 1 - (sum((freqs-n/k)**2)/((n-n/k)**2+(k-1)*(n/k)**2))**0.5
elif measure=="hrel":
#HRel
src = "(Wilcox, 1973, p. 16)"
lbl = "Wilcox HREL"
hrel = 0
for i in range(k):
hrel = hrel + props[i]*math.log2(props[i])
qv = -hrel/math.log2(k)
elif measure=="m1":
src = "(Gibbs & Poston, 1975, p. 471)"
lbl = "Gibbs-Poston M1"
qv = 1 - sum(props**2)
elif measure=="m2":
#equal to varnc
src = "(Gibbs & Poston, 1975, p. 472)"
lbl = "Gibbs-Poston M2"
qv = (1 - sum(props**2)) / (1-1/k)
elif measure=="m3":
src = "(Gibbs & Poston, 1975, p. 472)"
lbl = "Gibbs-Poston M3"
pl = min(props)
qv = (1 - sum(props**2) - pl) / (1-1/k - pl)
elif measure=="m4":
src = "(Gibbs & Poston, 1975, p. 473)"
lbl = "Gibbs-Poston M4"
fmean = n/k
qv = 1 - sum(abs(freqs-fmean))/(2*n)
elif measure=="m5":
src = "(Gibbs & Poston, 1975, p. 474)"
lbl = "Gibbs-Poston M5"
fmean = n/k
qv = 1 - sum(abs(freqs-fmean))/(2*(n-k+1-fmean))
elif measure=="m6":
src = "(Gibbs & Poston, 1975, p. 474)"
lbl = "Gibbs-Poston M6"
fmean = n/k
qv = k*(1 - sum(abs(freqs-fmean))/(2*n))
elif measure=="b":
#Kaiser B index
src = "(Kaiser, 1968, p. 211)"
lbl = "Kaiser b"
qv = 1 - (1 - ((math.prod(freqs*k/n))**(1/k))**2)**0.5
elif measure=="bd":
#Bulla D
src = "(Bulla, 1994, p. 169)"
lbl = "Bulla D"
o = 0
for p in props:
o = o + min(p, 1/k)
qv = k*(o - 1/k + (k - 1)/n)/(1 - 1/k + (k-1)/n)
elif measure=="be":
#Bulla e
src = "(Bulla, 1994, pp. 168-169)"
lbl = "Bulla E"
o = 0
for p in props:
o = o + min(p, 1/k)
qv = (o - 1/k + (k - 1)/n)/(1 - 1/k + (k-1)/n)
elif measure=="bpi":
#Berger-Parker Index
src = "(Berger & Parker, 1970, p. 1345)"
lbl = "Berger-Parker D"
qv = fm/n
elif measure=="d1":
#Simpson's D
src = "(Simpson, 1949, p. 688)"
lbl = "Simpson D"
qv = sum(freqs*(freqs-1))/(n*(n-1))
elif measure=="d2":
#Simpson's D
src = "(Smith & Wilson, 1996, p. 71)"
lbl = "Simpson D biased"
qv = sum((freqs/n)**2)
elif measure=="d3":
#Simpson's D
src = "(Wikipedia, n.d.)"
lbl = "Simpson D as diversity"
qv = 1 - sum(freqs*(freqs-1))/(n*(n-1))
elif measure=="d4":
#Simpson's D
src = "(Berger & Parker, 1970, p. 1345)"
lbl = "Simpson D as diversity biased"
qv = 1 - sum((freqs/n)**2)
elif measure=="hd":
#Hill's Diversity
src = "(Hill, 1973, p. 428)"
lbl = "Hill Diversity"
if var1 == 1:
qv = math.exp(-1*sum(props*log(props)))
else:
qv = (sum(props**var1)**(1/(1-var1)))
elif measure=="he":
#Hill's Evenness
src = "(Hill, 1973, p. 429)"
lbl = "Hill Evenness"
qv = me_qv(data, measure="hd", var1=var1)['value']/me_qv(data, measure="hd", var1=var2)['value']
qv = qv.values
elif measure=="hi":
#Heip Index
src = "(Heip, 1974, p. 555)"
lbl = "Heip Evenness"
h = -1*sum(props*log(props))
qv = (math.exp(h) - 1)/(k - 1)
elif measure=="j":
#Pielou J
src = "(Pielou, 1966, p. 141)"
lbl = "Pielou J"
h = -1*sum(props*log(props))
qv = h/log(k)
elif measure=="si":
#Sheldon Index
src = "(Sheldon, 1969, p. 467)"
lbl = "Sheldon Evenness"
h = -1*sum(props*log(props))
qv = math.exp(h)/k
elif measure=="sw1":
#Smith and Wilson Index 1
src = "(Smith & Wilson, 1996, p. 71)"
lbl = "Smith-Wilson Evenness Index 1"
d = sum(props**2)
qv = (1 - d)/(1 - 1/k)
elif measure=="sw2":
#Smith and Wilson Index 2
src = "(Smith & Wilson, 1996, p. 71)"
lbl = "Smith-Wilson Evenness Index 2"
d = sum(props**2)
qv = -log(d)/log(k)
elif measure=="sw3":
#Smith and Wilson Index 3
src = "(Smith & Wilson, 1996, p. 71)"
lbl = "Smith-Wilson Evenness Index 3"
d = sum(props**2)
qv = 1/(d*k)
elif measure=="swe":
#Shannon-Weaver Entropy
src = "(Shannon & Weaver, 1949, p. 20)"
lbl = "Shannon-Weaver Entropy"
qv = -1*sum(props*log(props))
elif measure=="re":
#Rényi Entropy
src = "(Rényi, 1961, p. 549)"
lbl = "Reneyi Entropy"
qv = 1/(1 - var1)*math.log2(sum(props**var1))
elif measure=="vr":
#Variation Ratio
src = "(Freeman, 1965)"
lbl = "Freeman Variation Ratio"
pm = fm/n
qv = 1 - pm
elif measure=="fisher":
src ="(Fisher et al., 1943, p. 55)"
lbl = "Fisher alpha"
maxIter=100
a1 = 1
k1 = a1 * log(1 + n/a1)
if k1 != k:
if k1 > k:
a2 = 0.5
else:
a2 = 2
k2 = a2 * log(1 + n / a2)
if k2 != k:
k3 = k2
iters = 0
while iters < maxIter and k3 != k:
iters = iters + 1
if k2 > k:
if k1 > k:
a3 = a2 - abs(a2 - a1)
else:
a3 = a2 - abs(a2 - a1) / 2
else:
if k1 < k:
a3 = a2 + abs(a2 - a1)
else:
a3 = a2 + abs(a2 - a1) / 2
if a3 == 0:
a3 = a2 - abs(a2 - a1) / 2
k3 = a3 * log(1 + n / a3)
a1 = a2
a2 = a3
k1 = k2
k2 = k3
else:
a3 = a2
else:
a3 = a1
qv = a3
results = pd.DataFrame([[qv, lbl, src]], columns=["value", "measure", "source"])
pd.set_option('display.max_colwidth', None)
return (results)
Functions
def me_qv(data, measure='vr', var1=2, var2=1)
-
Measures Of Qualitative Variation
The mode is the measure of central tendancy, to indicate the center for categorical data. Similar as the arithmetic mean is for numeric data. As with numeric data, the center alone is not always so informative. If your head is in a burning oven, and your feet are in a freezer, you are on average fine.
This is one of the reasons, why it is often recommended to add a measure of dispersion. It gives a clearer picture of the data, and can indicate how diverse it was (how much variation).
For categorical data there are a lot of different measures proposed, but I don't often see them being used. The most common one is probably the Variation Ratio. This is simply the percentage of cases that were not in the modal category.
The specific name of the type of measure for this qualitative variation can vary quite a lot. Some talk about dominance, differentiation, evenness, entropy, equitability, diversity, and apportionment.
I've tried to categorise the measures a bit, based on the calculations. Below is the overview of all measures available in this function.
nr. group measure source original type 1 mode Freeman Variation Ratio (Freeman, 1965) 2 mode Berger-Parker Index (Berger & Parker, 1970, p. 1345) dominance 3 mode Wilcox MODVR (Wilcox, 1973, p. 7) 4 mode Wilcox RANVR (Wilcox, 1973, p. 8) 5 mean Wilcox AVDEV (Wilcox, 1973, p. 9) 6 mean Gibbs-Poston M4 (Gibbs & Poston, 1975, p. 473) differentiation 7 mean Gibbs-Poston M5 (Gibbs & Poston, 1975, p. 474) differentiation 8 mean Gibbs-Poston M6 (Gibbs & Poston, 1975, p. 474) differentiation 9 mean Wilcox VARNC = (Wilcox, 1973, p. 11) 9 mean Gibbs-Poston M2 = (Gibbs & Poston, 1975, p. 472) differentiation 9 mean Smith-Wilson E1* (Smith & Wilson, 1996, p. 71) evenness 10 mean Wilcox STDEV (Wilcox, 1973, p. 14) 11 entropy Shannon-Weaver Entropy (Shannon & Weaver, 1949, p. 20) entropy 12 entropy Rényi Entropy (Rényi, 1961, p. 549) entropy 13 entropy Wilcox HREL = (Wilcox, 1973, p. 16) 13 entropy Pielou J (Pielou, 1966, p. 141) diversity 14 entropy Sheldon Index (Sheldon, 1969, p. 467) equitability = relative diversity 15 entropy Heip Evenness (Heip, 1974, p. 555) evenness 16 evenness Hill Diversity (Hill, 1973, p. 428) diversity 17 evenness Hill Evenness (Hill, 1973, p. 429) evenness 18 evenness Bulla E (Bulla, 1994, pp. 168-169) evenness 19 evenness Bulla D (Bulla, 1994, p. 169) diversity 20a evenness Simpson D (Simpson, 1949, p. 688) diversity 20b evenness Simpson D biased (Smith & Wilson, 1996, p. 71) 20c evenness Simpson D as diversity (Wikipedia, n.d.) 20d evenness Simpson D as diversity biased = (Berger & Parker, 1970, p. 1345) 20d evenness Gibbs-Poston M1 (Gibbs & Poston, 1975, p. 471) differentiation 21 evenness Gibbs-Poston M3 (Gibbs & Poston, 1975, p. 472) differentiation 22 evenness Smith-Wilson E2 (Smith & Wilson, 1996, p. 71) evenness 23 evenness Smith-Wilson E3 (Smith & Wilson, 1996, p. 71) evenness 24 evenness Fisher alpha (Fisher et al., 1943, p. 55) diversity 25 other Wilcox MNDIF (Wilcox, 1973, p. 9) 26 other Kaiser b (Kaiser, 1968, p. 211) apportionment * Smith-Wilson E1 is listed with the mean group, since it uses the average frequency. It could of course also be placed in the evenness group.
This function is shown in this YouTube video and the measures are also described at PeterStatistics.com
Parameters
data
:list
orpandas series
measure
:string
, optional- to indicate which method to use. Either "vr" (default), "modvr", "ranvr", "avdev", "mndif", "varnc", "stdev", "hrel", "b", "m1", "m2", "m3", "m4", "m5", "m6", "d1", "d2", "d3", "d4", "bpi", "hd", "he", "swe", "re", "sw1", "sw2", "sw3", "hi", "si", "j", "b", "be", "bd", "fisher"
var1
:float
, optional- additional value for some measures
var2
:float
, optional- additional value for some measures
Returns
pandas.DataFrame
-
A dataframe with the following columns:
- value, the value of the requested measure
- measure, description of the measure calculated
- source, source used for calculation
Notes
The following measures can be determined:
- "modvr", Wilcox MODVR
- "ranvr", Wilcox RANVR
- "avdev", Wilcox AVDEV
- "mndif", Wilcox MNDIF
- "varnc", Wilcox VARNC (equal to Gibbs-Poston M2 and Smith-Wilson E1)
- "stdev", Wilcox STDEV
- "hrel", Wilcox HREL (equal to Pielou J)
- "m1", Gibbs-Poston M1
- "m2", Gibbs-Poston M2 (equal to Wilcox VARNC and Smith-Wilson E1)
- "m3", Gibbs-Poston M3
- "m4", Gibbs-Poston M4
- "m5", Gibbs-Poston M5
- "m6", Gibbs-Poston M6
- "b", Kaiser b
- "bd", Bulla D
- "be", Bulla E
- "bpi", Berger-Parker index
- "d1", "d2", "d3", "d4", Simpson D and variations
- "hd", Hill Diversity, requires a value for var1
- "he", Hill Eveness, requires a value for var1 and var2
- "hi", Heip Index
- "j", Pielou J (equal to Wilcox HREL)
- "si", Sheldon Index
- "sw1", Smith & Wilson E1 (equal to Wilcox VARNC and Gibbs-Poston M2)
- "sw2", Smith-Wilson E2
- "sw3", Smith-Wilson E3
- "swe", Shannon-Weaver Entropy
- "re", Rényi entropy, requires a value for var1
- "vr", Freeman's variation ratio
- "fisher", Fisher alpha
MODE BASED MEASURES Dispersion can be seen as how much variation there is, using as a norm the center. For nominal data the measure of central tendancy is the mode, and therefor some measures of qualitative variation use the mode as the starting point.
The frequency of the modal category is then useful. This is simply the maximum of the frequencies.
Freeman Variation Ratio ("vr")
Perhaps one of the most popular measures of qualitative variation uses the mode. The (Freeman) Variation Ratio. It is simply the proportion of scores that do not belong to the modal category. In formula notation (Freeman, 1965, p. 41):
Formula used from Freeman (1965, p. 41): v = 1 - \frac{F_{mode}}{n}
This variation ratio would become 0% if all cases fitted in the modal category, and all other categories don't have any cases.
A 0 (0%) would mean that all cases were in the modal category. A 1 (100%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all.
Berger–Parker index ("bpi")
The variation ratio is the opposite of the Berger-Parker Index, which is simply the proportion of scores that did fit in the modal category. In formula notation (Berger & Parker, 1970, p. 1345):
BPI = \frac{F_{mode}}{n}
Berger and Parker refer to this as a dominance measure, to indicate how "dominant" the modal category is.
A 1 (100%) would mean that all cases were in the modal category. A 0 (0%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all.
Wilcox MODVR ("modvr")
This looks at the difference of the frequency for each category with the modal frequency. This then gets divided by n\times \left(k -1\right) to standardize the results to 0 to 1.
It is a modification of the Freeman Variation Ratio, hence the name MODVR. Wilcox noted that the Freeman VR can never reach the maximum value of 1.
The formula used is (Wilcox, 1973, p. 7): \text{MODVR} = \frac{\sum_{i=1}^k F_{mode} - F_i}{n\times \left(k - 1\right)} = \frac{k\times F_{mode}-n}{n\times \left(k - 1\right)}
Wilcox RANVR ("ranvr")
Short for 'range variation ratio' this measure is very similar to Freeman's VR. Instead of looking simply at the mode, it looks at the range.
The formula used is (Wilcox, 1973, p. 8): \text{RANVR} = 1 - \frac{F_{mode} - F_{min}}{F_{mode}}
MEAN BASED MEASURES
The following measures use the average count to determine the variation. i.e. \bar{F} = \frac{\sum_{i=1}^k F_i}{k} = \frac{n}{k}
Wilcox AVDEV ("avdev")
This simply follows the mean absolute deviation analogue but then using frequencies. Again this is then standardized.
The formula used is (Wilcox, 1973, p. 9): \text{AVDEV} = 1-\frac{\sum_{i=1}^k \left|F_i-\bar{F}\right|}{2\times \frac{n}{k}\times \left(k-1\right)}= 1-\frac{k\times \sum_{i=1}^k \left|F_i-\bar{F}\right|}{2\times n \times \left(k-1\right)}
Gibbs-Poston M4 ("m4")
The formula used (Gibbs & Poston, 1975, p. 473): \text{M4} = 1-\frac{\sum_{i=1}^k \left|F_i-\bar{F}\right|}{2\times n}
Gibbs-Poston M5 ("m5")
The problem with M4 is that it can never be 0, so to adjust for this M5 could be used but is computationally then more difficult.
The formula used (Gibbs & Poston, 1975, p. 474): \text{M5} = 1-\frac{\sum_{i=1}^k \left|F_i-\bar{F}\right|}{2\times\left(n-k+1-\bar{F}\right)}
Gibbs-Poston M6 ("m6")
The formula used (Gibbs & Poston, 1975, p. 474): \text{M6} = k\times\left(1-\frac{\sum_{i=1}^k \left|F_i-\bar{F}\right|}{2\times n}\right) = k\times\text{M4}
Wilcox VARNC ("varnc"), Gibbs-Poston M2 ("m2"), and Smith & Wilson E1 ("sw1")
This is similar as the variance for scale variables.
The formula used is (Wilcox, 1973, p. 11): \text{VARNC} = 1-\frac{\sum_{i=1}^{k}\left(F_i-\bar{F}\right)^2}{\frac{n^2\times\left(k-1\right)}{k}} = \frac{k\times\left(n^2-\sum_{i=1}^k F_i^2\right)}{n^2\times\left(k-1\right)}
This is the same as Gibbs and Poston's M2 ("m2"). Their formula looks different but has the same result (Gibbs & Poston, 1975, p. 472) \text{M2} = \frac{1-\sum_{i=1}^k p_i^2}{1-\frac{1}{k}} = \frac{\text{M1}}{1-\frac{1}{k}} = \frac{k}{k-1}\times\text{M1}
It is also the same as Smith and Wilson's first evenness measure ("sw1").
The formula used (Smith & Wilson, 1996, p. 71): E_1 = \frac{1 - D_s}{1 - \frac{1}{k}}
With D_s being Simpson's D, but defined as: D_s = \sum_{i=1}^k\left(\frac{F_i}{n}\right)^2
Wilcox STDEV ("stdev")
As with the variance for scale variables, we can take the square root to obtain the standard deviation.
The formula used can be from the VARNC or the MNDIF (Wilcox, 1973, p. 14): \text{STDEV} = 1-\sqrt{\frac{\sum_{i=1}^k \left(F_i-\bar{F}\right)^2}{\left(n-\bar{F}\right)^2+\left(k-1\right)\bar{F}^2}}= 1-\sqrt{\frac{\sum_{i=1}^{k-1}\sum_{j=i+1}^k \left(F_i-F_j\right)^2}{n^2\times\left(k-1\right)}}
ENTROPY
Entropy is sometimes referred to as the expected value of the surprise. It tells on average how surprised we might be about the outcome, and is also used as a measure with qualitative data.
I enjoyed the simple explanation on entropy from StatQuest, their video is available here.
It deals a lot with proportions rather than the counts themselves
Shannon-Weaver Entropy ("swe")
The formula used (Shannon & Weaver, 1949, p. 20): H_{sw}=-\sum_{i=1}^k p_i\times\ln\left(p_i\right)
Rényi entropy ("re")
This is a generalisation for Shannon entropy.
The formula used is (Rényi, 1961, p. 549): H_q = \frac{1}{1 - q}\times\log_2\left(\sum_{i=1}^k p_i^q\right)
Wilcox HREL ("hrel") and Pielou J ("j")
This uses Shannon's entropy but divides it over the maximum possible uncertainty.
The formula used (Wilcox, 1973, p. 16): \text{HREL} = \frac{-\sum_{i=1}^k p_i \times \text{log}_2 p_i}{\text{log}_2 k}
This is the same as Pielou J. ("j")
The formula used (Pielou, 1966, p. 141): J=\frac{H_{sw}}{\ln\left(k\right)}
Sheldon Index ("si")
The formula used (Sheldon, 1969, p. 467): E = \frac{e^{H_{sw}}}{k}
Heip Index ("hi")
The formula used is (Heip, 1974, p. 555): E_h = \frac{e^{H_{sw}} - 1}{k - 1}
EVENNESS and DIVERSITY
Hill Diversity ("hd")
The formula used is (Hill, 1973, p. 428): N_a = \begin{cases}\left(\sum_{i=1}^k p_i^a\right)^{\frac{1}{1-a}} & \text{ if } a\neq 1 \\ e^{H_{sw}} & \text{ if } =1 \end{cases}
Hill Eveness ("he")
The formula used is (Hill, 1973, p. 429): E_{a,b} = \frac{N_a}{N_b}
Where N_a and N_b are Hill's diversity values for a and b.
Bulla E ("be")
Bulla's evenness measure.
The formula used is (Bulla, 1994, pp. 168-169): E_b = \frac{O - \frac{1}{k} - \frac{k - 1}{n}}{1 - \frac{1}{k} - \frac{k - 1}{n}}
With: O = \sum_{i=1}^k \min\left(p_i, \frac{1}{k}\right)
Bulla D ("bd")
Bulla's Evenness measure converted to a diversity measure.
The formula used is (Bulla, 1994, p. 169): D_b = E_b\times k
Where E_b is Bulla E value.
With: O = \sum_{i=1}^k \min\left(p_i, \frac{1}{k}\right)
Simpson D ("d1", "d2", "d3", "d4" = Gibbs-Poston M1)
The formula used is based on Simpson (1949, p. 688): D_1 = \frac{\sum_{i=1}^k F_i\times\left(F_i-1\right)}{n\times\left(n-1\right)}
Another alternative is for a population: D_2 = \sum_{i=1}^k\left(\frac{F_i}{n}\right)^2
Often the result is subtracted from 1 to reverse the scale. D_3 = 1-\frac{\sum_{i=1}^k F_i\times\left(F_i-1\right)}{n\times\left(n-1\right)}
and D_4 = 1 - \sum_{i=1}^k\left(\frac{F_i}{n}\right)^2
This last one is then the same as Gibb-Poston M1 (Gibbs & Poston, 1975, p. 471): \text{M1} = 1 - \sum_{i=1}^k p_i^2
Gibbs-Poston M3 ("m3")
The formula used (Gibbs & Poston, 1975, p. 472): \text{M3} = \frac{1-\sum_{i=1}^k p_i^2-p_{min}}{1-\frac{1}{k}-p_{min}}
With p_{min} the lowest proportion
Smith & Wilson E2 ("sw2")
The formula used (Smith & Wilson, 1996, p. 71): E_2 = \frac{\ln\left(D_s\right)}{\ln\left(k\right)}
With D_s being Simpson's D, but defined as: D_s = \sum_{i=1}^k\left(\frac{F_i}{n}\right)^2
Smith & Wilson E3 ("sw3")
The formula used (Smith & Wilson, 1996, p. 71): E_3 = \frac{1}{D_s \times k}
With D_s being Simpson's D, but defined as: D_s = \sum_{i=1}^k\left(\frac{F_i}{n}\right)^2
Fisher alpha ("fisher")
The formula used (Fisher et al., 1943, p. 55): k = \alpha \times \ln\left(1 + \frac{n}{\alpha}\right)
The function uses a simple binary search to find the value for \alpha such that the result of the above formula will produce the number of categories ( k ).
OTHER*
Wilcox MNDIF ("mndif")
Analog of the mean difference measure for scale variables.
The formula used is (Wilcox, 1973, p. 9): \text{MNDIF} = 1-\frac{\sum_{i=1}^{k-1}\sum_{j=i+1}^k \left|F_i-F_j\right|}{n\times\left(k-1\right)}
Kaiser b
The formula used (Kaiser, 1968, p. 211): B = 1 - \sqrt{1 - \left(\sqrt[k]{\prod_{i=1}^k\frac{f_i\times k}{n}}\right)^2}
Kaiser also provides rules-of-thumb for interpretation. See th_kaiser_b() for more details on this.
Before, After and Alternatives
Before this an impression using a frequency table or a visualisation might be helpful: * tab_frequency * vi_bar_simple for Simple Bar Chart * vi_cleveland_dot_plot for Cleveland Dot Plot * vi_dot_plot for Dot Plot * vi_pareto_chart for Pareto Chart * vi_pie for Pie Chart
After this you might want to perform a test: * ts_pearson_gof for Pearson Chi-Square Goodness-of-Fit Test * ts_freeman_tukey_gof for Freeman-Tukey Test of Goodness-of-Fit * ts_freeman_tukey_read for Freeman-Tukey-Read Test of Goodness-of-Fit * ts_g_gof for G (Likelihood Ratio) Goodness-of-Fit Test * ts_mod_log_likelihood_gof for Mod-Log Likelihood Test of Goodness-of-Fit * ts_multinomial_gof for Multinomial Goodness-of-Fit Test * ts_neyman_gof for Neyman Test of Goodness-of-Fit * ts_powerdivergence_gof for Power Divergence GoF Test
References
Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. Science, 168(3937), 1345–1347. doi:10.1126/science.168.3937.1345
Bulla, L. (1994). An index of evenness and its associated diversity measure. Oikos, 70(1), 167–171. doi:10.2307/3545713
Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12(1), 42–58. doi:10.2307/1411
Freeman, L. C. (1965). Elementary applied statistics: For students in behavioral science. Wiley.
Gibbs, J. P., & Poston, D. L. (1975). The division of labor: Conceptualization and related measures. Social Forces, 53(3), 468. doi:10.2307/2576589
Heip, C. (1974). A new index measuring evenness. Journal of the Marine Biological Association of the United Kingdom, 54(3), 555–557. doi:10.1017/S0025315400022736
Hill, M. O. (1973). Diversity and evenness: A unifying notation and its consequences. Ecology, 54(2), 427–432. doi:10.2307/1934352
Kaiser, H. F. (1968). A measure of the population quality of legislative apportionment. American Political Science Review, 62(1), 208–215. doi:10.2307/1953335
Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13, 131–144. doi:10.1016/0022-5193(66)90013-0
Rényi, A. (1961). On measures of entropy and information. Contributions to the Theory of Statistics, 1, 547–562.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. The university of Illinois press.
Sheldon, A. L. (1969). Equitability indices: Dependence on the species count. Ecology, 50(3), 466–467. doi:10.2307/1933900
Simpson, E. H. (1949). Measurement of diversity. Nature, 163(4148), Article 4148. doi:10.1038/163688a0
Smith, B., & Wilson, J. B. (1996). A consumer’s guide to evenness indices. Oikos, 76(1), 70–82. doi:10.2307/3545749
Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. Political Research Quarterly, 26(2), 325–343. doi:10.1177/106591297302600209
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'] >>> me_qv(ex1) value measure source 0 0.499227 Freeman Variation Ratio (Freeman, 1965)
Example 2: a list
>>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> me_qv(ex2, "swe") value measure source 0 1.296892 Shannon-Weaver Entropy (Shannon & Weaver, 1949, p. 20)
Expand source code
def me_qv(data, measure="vr", var1=2, var2=1): ''' Measures of Qualitative Variation --------------------------------- The mode is the measure of central tendancy, to indicate the center for categorical data. Similar as the arithmetic mean is for numeric data. As with numeric data, the center alone is not always so informative. If your head is in a burning oven, and your feet are in a freezer, you are on average fine. This is one of the reasons, why it is often recommended to add a measure of dispersion. It gives a clearer picture of the data, and can indicate how diverse it was (how much variation). For categorical data there are a lot of different measures proposed, but I don't often see them being used. The most common one is probably the Variation Ratio. This is simply the percentage of cases that were not in the modal category. The specific name of the type of measure for this qualitative variation can vary quite a lot. Some talk about dominance, differentiation, evenness, entropy, equitability, diversity, and apportionment. I've tried to categorise the measures a bit, based on the calculations. Below is the overview of all measures available in this function. |nr.|group|measure|source|original type| |---|-----|-------|------|-------------| |1|mode|Freeman Variation Ratio|(Freeman, 1965)| | |2|mode|Berger-Parker Index|(Berger & Parker, 1970, p. 1345)|dominance| |3|mode|Wilcox MODVR|(Wilcox, 1973, p. 7)| | |4|mode|Wilcox RANVR|(Wilcox, 1973, p. 8)| | |5|mean|Wilcox AVDEV|(Wilcox, 1973, p. 9)| | |6|mean|Gibbs-Poston M4|(Gibbs & Poston, 1975, p. 473)|differentiation| |7|mean|Gibbs-Poston M5|(Gibbs & Poston, 1975, p. 474)|differentiation| |8|mean|Gibbs-Poston M6|(Gibbs & Poston, 1975, p. 474)|differentiation| |9|mean|Wilcox VARNC = |(Wilcox, 1973, p. 11)| | |9|mean|Gibbs-Poston M2 = |(Gibbs & Poston, 1975, p. 472)|differentiation| |9|mean|Smith-Wilson E1*|(Smith & Wilson, 1996, p. 71)|evenness| |10|mean|Wilcox STDEV|(Wilcox, 1973, p. 14)| | |11|entropy|Shannon-Weaver Entropy|(Shannon & Weaver, 1949, p. 20)|entropy| |12|entropy|Rényi Entropy|(Rényi, 1961, p. 549)|entropy| |13|entropy|Wilcox HREL = |(Wilcox, 1973, p. 16)| | |13|entropy|Pielou J|(Pielou, 1966, p. 141)|diversity| |14|entropy|Sheldon Index|(Sheldon, 1969, p. 467)|equitability = relative diversity| |15|entropy|Heip Evenness|(Heip, 1974, p. 555)|evenness| |16|evenness|Hill Diversity|(Hill, 1973, p. 428)|diversity| |17|evenness|Hill Evenness|(Hill, 1973, p. 429)|evenness| |18|evenness|Bulla E|(Bulla, 1994, pp. 168-169)|evenness| |19|evenness|Bulla D|(Bulla, 1994, p. 169)|diversity| |20a|evenness|Simpson D|(Simpson, 1949, p. 688)|diversity| |20b|evenness|Simpson D biased|(Smith & Wilson, 1996, p. 71)| | |20c|evenness|Simpson D as diversity|(Wikipedia, n.d.)| | |20d|evenness|Simpson D as diversity biased =|(Berger & Parker, 1970, p. 1345)| | |20d|evenness|Gibbs-Poston M1|(Gibbs & Poston, 1975, p. 471)|differentiation| |21|evenness|Gibbs-Poston M3|(Gibbs & Poston, 1975, p. 472)|differentiation| |22|evenness|Smith-Wilson E2|(Smith & Wilson, 1996, p. 71)|evenness| |23|evenness|Smith-Wilson E3|(Smith & Wilson, 1996, p. 71)|evenness| |24|evenness|Fisher alpha|(Fisher et al., 1943, p. 55)|diversity| |25|other|Wilcox MNDIF|(Wilcox, 1973, p. 9)| | |26|other|Kaiser b|(Kaiser, 1968, p. 211)|apportionment| \* Smith-Wilson E1 is listed with the mean group, since it uses the average frequency. It could of course also be placed in the evenness group. This function is shown in this [YouTube video](https://bit.ly/47uYXPe) and the measures are also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Measures/QualitativeVariation.html) Parameters ---------- data : list or pandas series measure : string, optional to indicate which method to use. Either "vr" (default), "modvr", "ranvr", "avdev", "mndif", "varnc", "stdev", "hrel", "b", "m1", "m2", "m3", "m4", "m5", "m6", "d1", "d2", "d3", "d4", "bpi", "hd", "he", "swe", "re", "sw1", "sw2", "sw3", "hi", "si", "j", "b", "be", "bd", "fisher" var1 : float, optional additional value for some measures var2 : float, optional additional value for some measures Returns ------- pandas.DataFrame A dataframe with the following columns: * *value*, the value of the requested measure * *measure*, description of the measure calculated * *source*, source used for calculation Notes ----- The following measures can be determined: * *"modvr"*, Wilcox MODVR * *"ranvr"*, Wilcox RANVR * *"avdev"*, Wilcox AVDEV * *"mndif"*, Wilcox MNDIF * *"varnc"*, Wilcox VARNC (equal to Gibbs-Poston M2 and Smith-Wilson E1) * *"stdev"*, Wilcox STDEV * *"hrel"*, Wilcox HREL (equal to Pielou J) * *"m1"*, Gibbs-Poston M1 * *"m2"*, Gibbs-Poston M2 (equal to Wilcox VARNC and Smith-Wilson E1) * *"m3"*, Gibbs-Poston M3 * *"m4"*, Gibbs-Poston M4 * *"m5"*, Gibbs-Poston M5 * *"m6"*, Gibbs-Poston M6 * *"b"*, Kaiser b * *"bd"*, Bulla D * *"be"*, Bulla E * *"bpi"*, Berger-Parker index * *"d1"*, *"d2"*, *"d3"*, *"d4"*, Simpson D and variations * *"hd"*, Hill Diversity, requires a value for *var1* * *"he"*, Hill Eveness, requires a value for *var1* and *var2* * *"hi"*, Heip Index * *"j"*, Pielou J (equal to Wilcox HREL) * *"si"*, Sheldon Index * *"sw1"*, Smith & Wilson E1 (equal to Wilcox VARNC and Gibbs-Poston M2) * *"sw2"*, Smith-Wilson E2 * *"sw3"*, Smith-Wilson E3 * *"swe"*, Shannon-Weaver Entropy * *"re"*, Rényi entropy, requires a value for *var1* * *"vr"*, Freeman's variation ratio * *"fisher"*, Fisher alpha **MODE BASED MEASURES** Dispersion can be seen as how much variation there is, using as a norm the center. For nominal data the measure of central tendancy is the mode, and therefor some measures of qualitative variation use the mode as the starting point. The frequency of the modal category is then useful. This is simply the maximum of the frequencies. **Freeman Variation Ratio** ("vr") Perhaps one of the most popular measures of qualitative variation uses the mode. The (Freeman) Variation Ratio. It is simply the proportion of scores that do not belong to the modal category. In formula notation (Freeman, 1965, p. 41): Formula used from Freeman (1965, p. 41): $$v = 1 - \\frac{F_{mode}}{n}$$ This variation ratio would become 0% if all cases fitted in the modal category, and all other categories don't have any cases. A 0 (0%) would mean that all cases were in the modal category. A 1 (100%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all. **Berger–Parker index** ("bpi") The variation ratio is the opposite of the Berger-Parker Index, which is simply the proportion of scores that did fit in the modal category. In formula notation (Berger & Parker, 1970, p. 1345): $$BPI = \\frac{F_{mode}}{n}$$ Berger and Parker refer to this as a dominance measure, to indicate how "dominant" the modal category is. A 1 (100%) would mean that all cases were in the modal category. A 0 (0%) would indicate that no cases were in the modal category. However, this seems impossible to ever occur, since the modal category is the category with the highest frequency, which is impossible to be 0, unless there are no cases at all. **Wilcox MODVR** ("modvr") This looks at the difference of the frequency for each category with the modal frequency. This then gets divided by \\(n\\times \\left(k -1\\right)\\) to standardize the results to 0 to 1. It is a modification of the Freeman Variation Ratio, hence the name MODVR. Wilcox noted that the Freeman VR can never reach the maximum value of 1. The formula used is (Wilcox, 1973, p. 7): $$\\text{MODVR} = \\frac{\\sum_{i=1}^k F_{mode} - F_i}{n\\times \\left(k - 1\\right)} = \\frac{k\\times F_{mode}-n}{n\\times \\left(k - 1\\right)}$$ **Wilcox RANVR** ("ranvr") Short for 'range variation ratio' this measure is very similar to Freeman's VR. Instead of looking simply at the mode, it looks at the range. The formula used is (Wilcox, 1973, p. 8): $$\\text{RANVR} = 1 - \\frac{F_{mode} - F_{min}}{F_{mode}}$$ **MEAN BASED MEASURES** The following measures use the average count to determine the variation. i.e. $$\\bar{F} = \\frac{\\sum_{i=1}^k F_i}{k} = \\frac{n}{k}$$ **Wilcox AVDEV** ("avdev") This simply follows the mean absolute deviation analogue but then using frequencies. Again this is then standardized. The formula used is (Wilcox, 1973, p. 9): $$\\text{AVDEV} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times \\frac{n}{k}\\times \\left(k-1\\right)}= 1-\\frac{k\\times \\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n \\times \\left(k-1\\right)}$$ **Gibbs-Poston M4** ("m4") The formula used (Gibbs & Poston, 1975, p. 473): $$\\text{M4} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n}$$ **Gibbs-Poston M5** ("m5") The problem with M4 is that it can never be 0, so to adjust for this M5 could be used but is computationally then more difficult. The formula used (Gibbs & Poston, 1975, p. 474): $$\\text{M5} = 1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times\\left(n-k+1-\\bar{F}\\right)}$$ **Gibbs-Poston M6** ("m6") The formula used (Gibbs & Poston, 1975, p. 474): $$\\text{M6} = k\\times\\left(1-\\frac{\\sum_{i=1}^k \\left|F_i-\\bar{F}\\right|}{2\\times n}\\right) = k\\times\\text{M4}$$ **Wilcox VARNC** ("varnc"), **Gibbs-Poston M2** ("m2"), and **Smith & Wilson E1** ("sw1") This is similar as the variance for scale variables. The formula used is (Wilcox, 1973, p. 11): $$\\text{VARNC} = 1-\\frac{\\sum_{i=1}^{k}\\left(F_i-\\bar{F}\\right)^2}{\\frac{n^2\\times\\left(k-1\\right)}{k}} = \\frac{k\\times\\left(n^2-\\sum_{i=1}^k F_i^2\\right)}{n^2\\times\\left(k-1\\right)}$$ This is the same as Gibbs and Poston's **M2** ("m2"). Their formula looks different but has the same result (Gibbs & Poston, 1975, p. 472) $$\\text{M2} = \\frac{1-\\sum_{i=1}^k p_i^2}{1-\\frac{1}{k}} = \\frac{\\text{M1}}{1-\\frac{1}{k}} = \\frac{k}{k-1}\\times\\text{M1}$$ It is also the same as Smith and Wilson's first evenness measure ("sw1"). The formula used (Smith & Wilson, 1996, p. 71): $$E_1 = \\frac{1 - D_s}{1 - \\frac{1}{k}}$$ With \\(D_s\\) being Simpson's D, but defined as: $$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$ **Wilcox STDEV** ("stdev") As with the variance for scale variables, we can take the square root to obtain the standard deviation. The formula used can be from the VARNC or the MNDIF (Wilcox, 1973, p. 14): $$\\text{STDEV} = 1-\\sqrt{\\frac{\\sum_{i=1}^k \\left(F_i-\\bar{F}\\right)^2}{\\left(n-\\bar{F}\\right)^2+\\left(k-1\\right)\\bar{F}^2}}= 1-\\sqrt{\\frac{\\sum_{i=1}^{k-1}\\sum_{j=i+1}^k \\left(F_i-F_j\\right)^2}{n^2\\times\\left(k-1\\right)}}$$ **ENTROPY** Entropy is sometimes referred to as the expected value of the surprise. It tells on average how surprised we might be about the outcome, and is also used as a measure with qualitative data. I enjoyed the simple explanation on entropy from StatQuest, their video is available <a href="https://www.youtube.com/watch?v=YtebGVx-Fxw">here</a>. It deals a lot with proportions rather than the counts themselves **Shannon-Weaver Entropy** ("swe") The formula used (Shannon & Weaver, 1949, p. 20): $$H_{sw}=-\\sum_{i=1}^k p_i\\times\\ln\\left(p_i\\right)$$ **Rényi entropy** ("re") This is a generalisation for Shannon entropy. The formula used is (Rényi, 1961, p. 549): $$H_q = \\frac{1}{1 - q}\\times\\log_2\\left(\\sum_{i=1}^k p_i^q\\right)$$ **Wilcox HREL** ("hrel") and **Pielou J** ("j") This uses Shannon's entropy but divides it over the maximum possible uncertainty. The formula used (Wilcox, 1973, p. 16): $$\\text{HREL} = \\frac{-\\sum_{i=1}^k p_i \\times \\text{log}_2 p_i}{\\text{log}_2 k}$$ This is the same as Pielou J. ("j") The formula used (Pielou, 1966, p. 141): $$J=\\frac{H_{sw}}{\\ln\\left(k\\right)}$$ **Sheldon Index** ("si") The formula used (Sheldon, 1969, p. 467): $$E = \\frac{e^{H_{sw}}}{k}$$ **Heip Index** ("hi") The formula used is (Heip, 1974, p. 555): $$E_h = \\frac{e^{H_{sw}} - 1}{k - 1}$$ **EVENNESS and DIVERSITY** **Hill Diversity** ("hd") The formula used is (Hill, 1973, p. 428): $$N_a = \\begin{cases}\\left(\\sum_{i=1}^k p_i^a\\right)^{\\frac{1}{1-a}} & \\text{ if } a\\neq 1 \\\\ e^{H_{sw}} & \\text{ if } =1 \\end{cases}$$ **Hill Eveness** ("he") The formula used is (Hill, 1973, p. 429): $$E_{a,b} = \\frac{N_a}{N_b}$$ Where \\(N_a\\) and \\(N_b\\) are Hill's diversity values for a and b. **Bulla E** ("be") Bulla's evenness measure. The formula used is (Bulla, 1994, pp. 168-169): $$E_b = \\frac{O - \\frac{1}{k} - \\frac{k - 1}{n}}{1 - \\frac{1}{k} - \\frac{k - 1}{n}}$$ With: $$O = \\sum_{i=1}^k \\min\\left(p_i, \\frac{1}{k}\\right)$$ **Bulla D** ("bd") Bulla's Evenness measure converted to a diversity measure. The formula used is (Bulla, 1994, p. 169): $$D_b = E_b\\times k$$ Where \\(E_b\\) is Bulla E value. With: $$O = \\sum_{i=1}^k \\min\\left(p_i, \\frac{1}{k}\\right)$$ **Simpson D** ("d1", "d2", "d3", "d4" = Gibbs-Poston M1) The formula used is based on Simpson (1949, p. 688): $$D_1 = \\frac{\\sum_{i=1}^k F_i\\times\\left(F_i-1\\right)}{n\\times\\left(n-1\\right)}$$ Another alternative is for a population: $$D_2 = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$ Often the result is subtracted from 1 to reverse the scale. $$D_3 = 1-\\frac{\\sum_{i=1}^k F_i\\times\\left(F_i-1\\right)}{n\\times\\left(n-1\\right)}$$ and $$D_4 = 1 - \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$ This last one is then the same as Gibb-Poston M1 (Gibbs & Poston, 1975, p. 471): $$\\text{M1} = 1 - \\sum_{i=1}^k p_i^2$$ **Gibbs-Poston M3** ("m3") The formula used (Gibbs & Poston, 1975, p. 472): $$\\text{M3} = \\frac{1-\\sum_{i=1}^k p_i^2-p_{min}}{1-\\frac{1}{k}-p_{min}}$$ With \\(p_{min}\\) the lowest proportion **Smith & Wilson E2** ("sw2") The formula used (Smith & Wilson, 1996, p. 71): $$E_2 = \\frac{\\ln\\left(D_s\\right)}{\\ln\\left(k\\right)}$$ With \\(D_s\\) being Simpson's D, but defined as: $$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$ **Smith & Wilson E3** ("sw3") The formula used (Smith & Wilson, 1996, p. 71): $$E_3 = \\frac{1}{D_s \\times k}$$ With \\(D_s\\) being Simpson's D, but defined as: $$D_s = \\sum_{i=1}^k\\left(\\frac{F_i}{n}\\right)^2$$ **Fisher alpha** ("fisher") The formula used (Fisher et al., 1943, p. 55): $$k = \\alpha \\times \\ln\\left(1 + \\frac{n}{\\alpha}\\right)$$ The function uses a simple binary search to find the value for \\(\\alpha\\) such that the result of the above formula will produce the number of categories ( \\(k\\) ). **OTHER*** **Wilcox MNDIF** ("mndif") Analog of the mean difference measure for scale variables. The formula used is (Wilcox, 1973, p. 9): $$\\text{MNDIF} = 1-\\frac{\\sum_{i=1}^{k-1}\\sum_{j=i+1}^k \\left|F_i-F_j\\right|}{n\\times\\left(k-1\\right)}$$ **Kaiser b** The formula used (Kaiser, 1968, p. 211): $$B = 1 - \\sqrt{1 - \\left(\\sqrt[k]{\\prod_{i=1}^k\\frac{f_i\\times k}{n}}\\right)^2}$$ Kaiser also provides rules-of-thumb for interpretation. See **th_kaiser_b()** for more details on this. Before, After and Alternatives ------------------------------ Before this an impression using a frequency table or a visualisation might be helpful: * [tab_frequency](../other/table_frequency.html#tab_frequency) * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot * [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart After this you might want to perform a test: * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test References ---------- Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. *Science, 168*(3937), 1345–1347. doi:10.1126/science.168.3937.1345 Bulla, L. (1994). An index of evenness and its associated diversity measure. *Oikos, 70*(1), 167–171. doi:10.2307/3545713 Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. *The Journal of Animal Ecology, 12*(1), 42–58. doi:10.2307/1411 Freeman, L. C. (1965). *Elementary applied statistics: For students in behavioral science*. Wiley. Gibbs, J. P., & Poston, D. L. (1975). The division of labor: Conceptualization and related measures. *Social Forces, 53*(3), 468. doi:10.2307/2576589 Heip, C. (1974). A new index measuring evenness. *Journal of the Marine Biological Association of the United Kingdom, 54*(3), 555–557. doi:10.1017/S0025315400022736 Hill, M. O. (1973). Diversity and evenness: A unifying notation and its consequences. *Ecology, 54*(2), 427–432. doi:10.2307/1934352 Kaiser, H. F. (1968). A measure of the population quality of legislative apportionment. *American Political Science Review, 62*(1), 208–215. doi:10.2307/1953335 Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. *Journal of Theoretical Biology, 13*, 131–144. doi:10.1016/0022-5193(66)90013-0 Rényi, A. (1961). On measures of entropy and information. *Contributions to the Theory of Statistics, 1*, 547–562. Shannon, C. E., & Weaver, W. (1949). *The mathematical theory of communication*. The university of Illinois press. Sheldon, A. L. (1969). Equitability indices: Dependence on the species count. *Ecology, 50*(3), 466–467. doi:10.2307/1933900 Simpson, E. H. (1949). Measurement of diversity. *Nature, 163*(4148), Article 4148. doi:10.1038/163688a0 Smith, B., & Wilson, J. B. (1996). A consumer’s guide to evenness indices. *Oikos, 76*(1), 70–82. doi:10.2307/3545749 Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. *Political Research Quarterly, 26*(2), 325–343. doi:10.1177/106591297302600209 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: pandas series >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'] >>> me_qv(ex1) value measure source 0 0.499227 Freeman Variation Ratio (Freeman, 1965) Example 2: a list >>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> me_qv(ex2, "swe") value measure source 0 1.296892 Shannon-Weaver Entropy (Shannon & Weaver, 1949, p. 20) ''' if type(data) is list: data = pd.Series(data) freqs = data.value_counts().values k = len(freqs) n = sum(freqs) fm = max(freqs) props = freqs/n if measure=="modvr": #Modified Variation Ratio src = "(Wilcox, 1973, p. 7)" lbl = "Wilcox MODVR" qv = sum(fm - freqs)/(n*(k - 1)) elif measure=="ranvr": #Range Variation Ratio src = "(Wilcox, 1973, p. 8)" lbl = "Wilcox RANVR" fl = min(freqs) qv = 1 - (fm - fl)/fm elif measure=="avdev": #Average Deviation src = "(Wilcox, 1973, p. 9)" lbl = "Wilcox AVDEV" qv = 1-sum(abs(freqs-n/k)) / (2*n/k*(k-1)) elif measure=="mndif": #MNDif src = "(Wilcox, 1973, p. 9)" lbl = "Wilcox MNDIF" mndif = 0 for i in range(0, k-1): for j in range(i+1,k): mndif = mndif + abs(freqs[i]-freqs[j]) qv = 1 - mndif/(n*(k-1)) elif measure=="varnc": #VarNC src = "(Wilcox, 1973, p. 11)" lbl = "Wilcox VARNC" qv = 1 - sum((freqs-n/k)**2)/(n**2*(k-1)/k) elif measure=="stdev": src = "(Wilcox, 1973, p. 14)" lbl = "Wilcox STDEV" qv = 1 - (sum((freqs-n/k)**2)/((n-n/k)**2+(k-1)*(n/k)**2))**0.5 elif measure=="hrel": #HRel src = "(Wilcox, 1973, p. 16)" lbl = "Wilcox HREL" hrel = 0 for i in range(k): hrel = hrel + props[i]*math.log2(props[i]) qv = -hrel/math.log2(k) elif measure=="m1": src = "(Gibbs & Poston, 1975, p. 471)" lbl = "Gibbs-Poston M1" qv = 1 - sum(props**2) elif measure=="m2": #equal to varnc src = "(Gibbs & Poston, 1975, p. 472)" lbl = "Gibbs-Poston M2" qv = (1 - sum(props**2)) / (1-1/k) elif measure=="m3": src = "(Gibbs & Poston, 1975, p. 472)" lbl = "Gibbs-Poston M3" pl = min(props) qv = (1 - sum(props**2) - pl) / (1-1/k - pl) elif measure=="m4": src = "(Gibbs & Poston, 1975, p. 473)" lbl = "Gibbs-Poston M4" fmean = n/k qv = 1 - sum(abs(freqs-fmean))/(2*n) elif measure=="m5": src = "(Gibbs & Poston, 1975, p. 474)" lbl = "Gibbs-Poston M5" fmean = n/k qv = 1 - sum(abs(freqs-fmean))/(2*(n-k+1-fmean)) elif measure=="m6": src = "(Gibbs & Poston, 1975, p. 474)" lbl = "Gibbs-Poston M6" fmean = n/k qv = k*(1 - sum(abs(freqs-fmean))/(2*n)) elif measure=="b": #Kaiser B index src = "(Kaiser, 1968, p. 211)" lbl = "Kaiser b" qv = 1 - (1 - ((math.prod(freqs*k/n))**(1/k))**2)**0.5 elif measure=="bd": #Bulla D src = "(Bulla, 1994, p. 169)" lbl = "Bulla D" o = 0 for p in props: o = o + min(p, 1/k) qv = k*(o - 1/k + (k - 1)/n)/(1 - 1/k + (k-1)/n) elif measure=="be": #Bulla e src = "(Bulla, 1994, pp. 168-169)" lbl = "Bulla E" o = 0 for p in props: o = o + min(p, 1/k) qv = (o - 1/k + (k - 1)/n)/(1 - 1/k + (k-1)/n) elif measure=="bpi": #Berger-Parker Index src = "(Berger & Parker, 1970, p. 1345)" lbl = "Berger-Parker D" qv = fm/n elif measure=="d1": #Simpson's D src = "(Simpson, 1949, p. 688)" lbl = "Simpson D" qv = sum(freqs*(freqs-1))/(n*(n-1)) elif measure=="d2": #Simpson's D src = "(Smith & Wilson, 1996, p. 71)" lbl = "Simpson D biased" qv = sum((freqs/n)**2) elif measure=="d3": #Simpson's D src = "(Wikipedia, n.d.)" lbl = "Simpson D as diversity" qv = 1 - sum(freqs*(freqs-1))/(n*(n-1)) elif measure=="d4": #Simpson's D src = "(Berger & Parker, 1970, p. 1345)" lbl = "Simpson D as diversity biased" qv = 1 - sum((freqs/n)**2) elif measure=="hd": #Hill's Diversity src = "(Hill, 1973, p. 428)" lbl = "Hill Diversity" if var1 == 1: qv = math.exp(-1*sum(props*log(props))) else: qv = (sum(props**var1)**(1/(1-var1))) elif measure=="he": #Hill's Evenness src = "(Hill, 1973, p. 429)" lbl = "Hill Evenness" qv = me_qv(data, measure="hd", var1=var1)['value']/me_qv(data, measure="hd", var1=var2)['value'] qv = qv.values elif measure=="hi": #Heip Index src = "(Heip, 1974, p. 555)" lbl = "Heip Evenness" h = -1*sum(props*log(props)) qv = (math.exp(h) - 1)/(k - 1) elif measure=="j": #Pielou J src = "(Pielou, 1966, p. 141)" lbl = "Pielou J" h = -1*sum(props*log(props)) qv = h/log(k) elif measure=="si": #Sheldon Index src = "(Sheldon, 1969, p. 467)" lbl = "Sheldon Evenness" h = -1*sum(props*log(props)) qv = math.exp(h)/k elif measure=="sw1": #Smith and Wilson Index 1 src = "(Smith & Wilson, 1996, p. 71)" lbl = "Smith-Wilson Evenness Index 1" d = sum(props**2) qv = (1 - d)/(1 - 1/k) elif measure=="sw2": #Smith and Wilson Index 2 src = "(Smith & Wilson, 1996, p. 71)" lbl = "Smith-Wilson Evenness Index 2" d = sum(props**2) qv = -log(d)/log(k) elif measure=="sw3": #Smith and Wilson Index 3 src = "(Smith & Wilson, 1996, p. 71)" lbl = "Smith-Wilson Evenness Index 3" d = sum(props**2) qv = 1/(d*k) elif measure=="swe": #Shannon-Weaver Entropy src = "(Shannon & Weaver, 1949, p. 20)" lbl = "Shannon-Weaver Entropy" qv = -1*sum(props*log(props)) elif measure=="re": #Rényi Entropy src = "(Rényi, 1961, p. 549)" lbl = "Reneyi Entropy" qv = 1/(1 - var1)*math.log2(sum(props**var1)) elif measure=="vr": #Variation Ratio src = "(Freeman, 1965)" lbl = "Freeman Variation Ratio" pm = fm/n qv = 1 - pm elif measure=="fisher": src ="(Fisher et al., 1943, p. 55)" lbl = "Fisher alpha" maxIter=100 a1 = 1 k1 = a1 * log(1 + n/a1) if k1 != k: if k1 > k: a2 = 0.5 else: a2 = 2 k2 = a2 * log(1 + n / a2) if k2 != k: k3 = k2 iters = 0 while iters < maxIter and k3 != k: iters = iters + 1 if k2 > k: if k1 > k: a3 = a2 - abs(a2 - a1) else: a3 = a2 - abs(a2 - a1) / 2 else: if k1 < k: a3 = a2 + abs(a2 - a1) else: a3 = a2 + abs(a2 - a1) / 2 if a3 == 0: a3 = a2 - abs(a2 - a1) / 2 k3 = a3 * log(1 + n / a3) a1 = a2 a2 = a3 k1 = k2 k2 = k3 else: a3 = a2 else: a3 = a1 qv = a3 results = pd.DataFrame([[qv, lbl, src]], columns=["value", "measure", "source"]) pd.set_option('display.max_colwidth', None) return (results)