Module stikpetP.visualisations.vis_pareto_chart
Expand source code
import matplotlib.pyplot as plt
import pandas as pd
def vi_pareto_chart(data, varname=None):
'''
Pareto Chart
------------
The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896).
Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order.
According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this.
The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939).
A video on Pareto charts is available [here](https://youtu.be/kDp5zPfK-Po).
This function is shown in this [YouTube video](https://youtu.be/lj25-nNRyBM) and the visualisation is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/ParetoChart.html)
Parameters
----------
data : list or pandas series
varname : string, optional
a name for the data, if not provided the name of the data variable is used
Notes
-----
Uses matplotlib pyplot
See Also
--------
Before the visualisation you might first want to get an impression using a frequency table:
* [tab_frequency](../other/table_frequency.html#tab_frequency)
After visualisation you might want some descriptive measures:
* [me_mode](../measures/meas_mode.html#me_mode) for the mode
* [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variation
or perform a test:
* [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
* [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
* [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
* [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
* [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
* [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
* [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
* [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
Alternatives for this visualisation could be:
* [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
* [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
* [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
* [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
References
----------
Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates.
Kemp, S. M., & Kemp, S. (2004). *Business statistics demystified*. McGraw-Hill.
Kenney, J. F. (1939). *Mathematics of statistics; Part one*. Chapman & Hall.
Pareto, V. (1896). *Cours d’économie politique* (Vol. 1). Lausanne.
Upton, G. J. G., & Cook, I. (2014). *Dictionary of statistics* (3rd ed.). Oxford University Press.
Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.
WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['mar1']
>>> vi_pareto_chart(ex1);
Example 2: a list
>>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
>>> vi_pareto_chart(ex2);
'''
if type(data) == list:
data = pd.Series(data)
freq = data.value_counts()
myKeys = freq.keys()
myVals = freq.values
myFreqTable = pd.DataFrame({'category': myKeys, 'Frequency': myVals})
myFreqTable['Percent'] = myFreqTable['Frequency']/myFreqTable['Frequency'].sum()*100
myFreqTable = myFreqTable.sort_values(by=['Frequency'], ascending=False)
myFreqTable = myFreqTable.reset_index(drop=True)
myFreqTable['Cumulative Percent'] = myFreqTable['Frequency'].cumsum() / myFreqTable['Frequency'].sum() * 100
fig,ax=plt.subplots()
ax.set_xlabel(varname)
ax.bar('category', 'Frequency', data = myFreqTable)
ax.set_ylabel("count")
ax.set_ylim(ymin=0)
ax2=ax.twinx()
ax2.plot('category', 'Cumulative Percent', data = myFreqTable, marker='o', color='red')
ax2.set_ylabel("cumulative percent")
ax2.set_ylim(ymin=0)
plt.show()
return
Functions
def vi_pareto_chart(data, varname=None)-
Pareto Chart
The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896).
Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order.
According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this.
The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939).
A video on Pareto charts is available here.
This function is shown in this YouTube video and the visualisation is also described at PeterStatistics.com
Parameters
data:listorpandas seriesvarname:string, optional- a name for the data, if not provided the name of the data variable is used
Notes
Uses matplotlib pyplot
See Also
Before the visualisation you might first want to get an impression using a frequency table:* [tab_frequency](../other/table_frequency.html#tab_frequency)After visualisation you might want some descriptive measures:* [me_mode](../measures/meas_mode.html#me_mode) for the mode* [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variationor perform a test:* [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test* [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit* [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit* [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test* [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit* [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test* [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit* [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF TestAlternatives for this visualisation could be:* [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart* [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot* [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot* [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie ChartReferences
Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates.
Kemp, S. M., & Kemp, S. (2004). Business statistics demystified. McGraw-Hill.
Kenney, J. F. (1939). Mathematics of statistics; Part one. Chapman & Hall.
Pareto, V. (1896). Cours d’économie politique (Vol. 1). Lausanne.
Upton, G. J. G., & Cook, I. (2014). Dictionary of statistics (3rd ed.). Oxford University Press.
Weisstein, E. W. (2002). CRC concise encyclopedia of mathematics (2nd ed.). Chapman & Hall/CRC.
WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'] >>> vi_pareto_chart(ex1);Example 2: a list
>>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> vi_pareto_chart(ex2);Expand source code
def vi_pareto_chart(data, varname=None): ''' Pareto Chart ------------ The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896). Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order. According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this. The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939). A video on Pareto charts is available [here](https://youtu.be/kDp5zPfK-Po). This function is shown in this [YouTube video](https://youtu.be/lj25-nNRyBM) and the visualisation is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/ParetoChart.html) Parameters ---------- data : list or pandas series varname : string, optional a name for the data, if not provided the name of the data variable is used Notes ----- Uses matplotlib pyplot See Also -------- Before the visualisation you might first want to get an impression using a frequency table: * [tab_frequency](../other/table_frequency.html#tab_frequency) After visualisation you might want some descriptive measures: * [me_mode](../measures/meas_mode.html#me_mode) for the mode * [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variation or perform a test: * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test Alternatives for this visualisation could be: * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart References ---------- Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates. Kemp, S. M., & Kemp, S. (2004). *Business statistics demystified*. McGraw-Hill. Kenney, J. F. (1939). *Mathematics of statistics; Part one*. Chapman & Hall. Pareto, V. (1896). *Cours d’économie politique* (Vol. 1). Lausanne. Upton, G. J. G., & Cook, I. (2014). *Dictionary of statistics* (3rd ed.). Oxford University Press. Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC. WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: pandas series >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'] >>> vi_pareto_chart(ex1); Example 2: a list >>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> vi_pareto_chart(ex2); ''' if type(data) == list: data = pd.Series(data) freq = data.value_counts() myKeys = freq.keys() myVals = freq.values myFreqTable = pd.DataFrame({'category': myKeys, 'Frequency': myVals}) myFreqTable['Percent'] = myFreqTable['Frequency']/myFreqTable['Frequency'].sum()*100 myFreqTable = myFreqTable.sort_values(by=['Frequency'], ascending=False) myFreqTable = myFreqTable.reset_index(drop=True) myFreqTable['Cumulative Percent'] = myFreqTable['Frequency'].cumsum() / myFreqTable['Frequency'].sum() * 100 fig,ax=plt.subplots() ax.set_xlabel(varname) ax.bar('category', 'Frequency', data = myFreqTable) ax.set_ylabel("count") ax.set_ylim(ymin=0) ax2=ax.twinx() ax2.plot('category', 'Cumulative Percent', data = myFreqTable, marker='o', color='red') ax2.set_ylabel("cumulative percent") ax2.set_ylim(ymin=0) plt.show() return