Module stikpetP.visualisations.vis_pareto_chart

Expand source code
import matplotlib.pyplot as plt
import pandas as pd

def vi_pareto_chart(data, varname=None):
    '''
    Pareto Chart
    ------------
    
    The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896).
    
    Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order. 
    
    According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this.
    
    The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939).
    
    A video on Pareto charts is available [here](https://youtu.be/kDp5zPfK-Po).

    This function is shown in this [YouTube video](https://youtu.be/lj25-nNRyBM) and the visualisation is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/ParetoChart.html)
    
    Parameters
    ----------
    data : list or pandas series
    varname : string, optional 
        a name for the data, if not provided the name of the data variable is used
    
    Notes
    -----
    Uses matplotlib pyplot

    See Also
    --------
    Before the visualisation you might first want to get an impression using a frequency table:
    * [tab_frequency](../other/table_frequency.html#tab_frequency)

    After visualisation you might want some descriptive measures:
    * [me_mode](../measures/meas_mode.html#me_mode) for the mode
    * [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variation

    or perform a test:
    * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
    * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
    * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
    * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
    * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
    * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
    * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
    * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
    
    Alternatives for this visualisation could be:
    * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
    * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
    * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
    * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
    
    References 
    ----------
    Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates.
    
    Kemp, S. M., & Kemp, S. (2004). *Business statistics demystified*. McGraw-Hill.
    
    Kenney, J. F. (1939). *Mathematics of statistics; Part one*. Chapman & Hall.
    
    Pareto, V. (1896). *Cours d’économie politique* (Vol. 1). Lausanne.
    
    Upton, G. J. G., & Cook, I. (2014). *Dictionary of statistics* (3rd ed.). Oxford University Press.
    
    Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.
    
    WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: pandas series
    >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['mar1']
    >>> vi_pareto_chart(ex1);
    
    Example 2: a list
    >>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
    >>> vi_pareto_chart(ex2);
    
    '''
    
    if type(data) == list:
        data = pd.Series(data)
        
        
    freq = data.value_counts()
    myKeys = freq.keys()
    myVals = freq.values

    myFreqTable = pd.DataFrame({'category': myKeys, 'Frequency': myVals})

    myFreqTable['Percent'] = myFreqTable['Frequency']/myFreqTable['Frequency'].sum()*100
    myFreqTable = myFreqTable.sort_values(by=['Frequency'], ascending=False)
    myFreqTable = myFreqTable.reset_index(drop=True)
    myFreqTable['Cumulative Percent'] = myFreqTable['Frequency'].cumsum() / myFreqTable['Frequency'].sum() * 100
    fig,ax=plt.subplots()
    ax.set_xlabel(varname)
    ax.bar('category', 'Frequency', data = myFreqTable)
    ax.set_ylabel("count")
    ax.set_ylim(ymin=0)

    ax2=ax.twinx()
    ax2.plot('category', 'Cumulative Percent', data = myFreqTable, marker='o', color='red')
    ax2.set_ylabel("cumulative percent")
    ax2.set_ylim(ymin=0)    
    
    plt.show()
    return

Functions

def vi_pareto_chart(data, varname=None)

Pareto Chart

The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896).

Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order.

According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this.

The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939).

A video on Pareto charts is available here.

This function is shown in this YouTube video and the visualisation is also described at PeterStatistics.com

Parameters

data : list or pandas series
 
varname : string, optional
a name for the data, if not provided the name of the data variable is used

Notes

Uses matplotlib pyplot

See Also

Before the visualisation you might first want to get an impression using a frequency table: * [tab_frequency](../other/table_frequency.html#tab_frequency)

After visualisation you might want some descriptive measures: * [me_mode](../measures/meas_mode.html#me_mode) for the mode * [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variation

or perform a test: * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test

Alternatives for this visualisation could be: * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart

References

Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates.

Kemp, S. M., & Kemp, S. (2004). Business statistics demystified. McGraw-Hill.

Kenney, J. F. (1939). Mathematics of statistics; Part one. Chapman & Hall.

Pareto, V. (1896). Cours d’économie politique (Vol. 1). Lausanne.

Upton, G. J. G., & Cook, I. (2014). Dictionary of statistics (3rd ed.). Oxford University Press.

Weisstein, E. W. (2002). CRC concise encyclopedia of mathematics (2nd ed.). Chapman & Hall/CRC.

WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

Example 1: pandas series

>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['mar1']
>>> vi_pareto_chart(ex1);

Example 2: a list

>>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
>>> vi_pareto_chart(ex2);
Expand source code
def vi_pareto_chart(data, varname=None):
    '''
    Pareto Chart
    ------------
    
    The Pareto Chart gets its name from the Pareto Principle, which is named after Vilfredo Pareto. This principle states that roughly 80% of consequencies come from 20% of causes (Pareto, 1896).
    
    Unfortunately, there is no general agreed upon definition of a Pareto diagram. The most general description I’ve found was by Kemp and Kemp (2004) who mention it is a name for a bar chart if the order of the bars have no meaning (i.e. for a nominal variable), and they only mention that often the bars are then placed in decreasing order. 
    
    According to some authors a Pareto diagram is any diagram with the bars in order of size (Joiner, 1995; WhatIs.com, n.d.), while others suggest that a line representing the cumulative relative frequencies should also be included (Weisstein, 2002). Upton and Cook (2014) also add that the bars should not have any gaps, but many other authors ignore this.
    
    The following definition by the author is used: a bar chart where the bars are placed in descending order of frequency. Usually an ogive is added in the chart as well. An ogive (oh-jive) is: "the graphs of cumulative frequencies" (Kenney, 1939).
    
    A video on Pareto charts is available [here](https://youtu.be/kDp5zPfK-Po).

    This function is shown in this [YouTube video](https://youtu.be/lj25-nNRyBM) and the visualisation is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/ParetoChart.html)
    
    Parameters
    ----------
    data : list or pandas series
    varname : string, optional 
        a name for the data, if not provided the name of the data variable is used
    
    Notes
    -----
    Uses matplotlib pyplot

    See Also
    --------
    Before the visualisation you might first want to get an impression using a frequency table:
    * [tab_frequency](../other/table_frequency.html#tab_frequency)

    After visualisation you might want some descriptive measures:
    * [me_mode](../measures/meas_mode.html#me_mode) for the mode
    * [me_qv](../measures/meas_qv.html#me_qv) for Measures of Qualitative Variation

    or perform a test:
    * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
    * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
    * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
    * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
    * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
    * [ts_multinomial_gof](../tests/test_multinomial_gof.html#ts_multinomial_gof) for Multinomial Goodness-of-Fit Test
    * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
    * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
    
    Alternatives for this visualisation could be:
    * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
    * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
    * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
    * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
    
    References 
    ----------
    Joiner. (1995). Pareto charts: Plain & simple. Joiner Associates.
    
    Kemp, S. M., & Kemp, S. (2004). *Business statistics demystified*. McGraw-Hill.
    
    Kenney, J. F. (1939). *Mathematics of statistics; Part one*. Chapman & Hall.
    
    Pareto, V. (1896). *Cours d’économie politique* (Vol. 1). Lausanne.
    
    Upton, G. J. G., & Cook, I. (2014). *Dictionary of statistics* (3rd ed.). Oxford University Press.
    
    Weisstein, E. W. (2002). *CRC concise encyclopedia of mathematics* (2nd ed.). Chapman & Hall/CRC.
    
    WhatIs.com. (n.d.). What is Pareto chart (Pareto distribution diagram)? - Definition from WhatIs.com. Retrieved April 20, 2014, from http://whatis.techtarget.com/definition/Pareto-chart-Pareto-distribution-diagram
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    Example 1: pandas series
    >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> ex1 = df1['mar1']
    >>> vi_pareto_chart(ex1);
    
    Example 2: a list
    >>> ex2 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
    >>> vi_pareto_chart(ex2);
    
    '''
    
    if type(data) == list:
        data = pd.Series(data)
        
        
    freq = data.value_counts()
    myKeys = freq.keys()
    myVals = freq.values

    myFreqTable = pd.DataFrame({'category': myKeys, 'Frequency': myVals})

    myFreqTable['Percent'] = myFreqTable['Frequency']/myFreqTable['Frequency'].sum()*100
    myFreqTable = myFreqTable.sort_values(by=['Frequency'], ascending=False)
    myFreqTable = myFreqTable.reset_index(drop=True)
    myFreqTable['Cumulative Percent'] = myFreqTable['Frequency'].cumsum() / myFreqTable['Frequency'].sum() * 100
    fig,ax=plt.subplots()
    ax.set_xlabel(varname)
    ax.bar('category', 'Frequency', data = myFreqTable)
    ax.set_ylabel("count")
    ax.set_ylim(ymin=0)

    ax2=ax.twinx()
    ax2.plot('category', 'Cumulative Percent', data = myFreqTable, marker='o', color='red')
    ax2.set_ylabel("cumulative percent")
    ax2.set_ylim(ymin=0)    
    
    plt.show()
    return