Module stikpetP.visualisations.vis_butterfly_chart

Expand source code
import matplotlib.pyplot as plt
import pandas as pd
from ..other.table_cross import tab_cross

def vi_butterfly_chart(field1, field2, categories1=None, categories2=None, variation='butterfly'):
    '''
    Butterfly Chart / Tornado Chart / Pyramid Chart
    -----------------------------------------------
    A special case of diverging bar charts when only comparing two categories. 
    
    Depending on the ordering of the results different names exist. I've chosen to use 'butterfly' if no ordering is done, 'pyramid' if they are ordered from small to large, and 'tornado' when going from large to small.

    This function is shown in this [YouTube video](https://youtu.be/f_5dTS5gb-4) and the diagram is also discussed at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/PyramidChart.html)
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the categories to use from field1. 
    categories2 : list or dictionary, optional
        the two categories to use from field2.
    variation : {"butterfly", "tornado", "pyramid"}, optional
        order of the bars
        
    Returns
    -------
    plot
    
    Notes
    -----
    The term *butterfly chart* can for example be found in Hwang and Yoon (2021, p. 25). 
    
    The term *tornado diagrom* can be found in the guide from the Project Management Institute (2013, p. 338). The term *funnel chart* is also sometimes used (for example Jamsa (2020, p. 135)), but this is also a term sometimes used for a more analytical scatterplot used for some specific analysis.
    
    The term *pyramid chart* can for example be found in Schwabish (2021, p. 185). It is very often used for comparing age distributions.
    
    References
    ----------
    Hwang, J., & Yoon, Y. (2021). Data analytics and visualization in quality analysis using Tableau. CRC Press.
    
    Jamsa, K. (2020). Introduction to data mining and analytics: With machine learning in R and Python. Jones & Bartlett Learning.
    
    Project Management Institute (Ed.). (2013). A guide to the project management body of knowledge (5th ed.). Project Management Institute, Inc.
    
    Schwabish, J. (2021). Better data visualizations: A guide for scholars, researchers, and wonks. Columbia University Press.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'])
    
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="pyramid")
    
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="tornado")
    
    '''
    ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    k = len(ct.index)
    if variation=='tornado':
        ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="include")
        ct = ct.sort_values(by=['Total'])
        ct = ct.iloc[0:k, 0:2]
    
    if variation=='pyramid':
        ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="include")
        ct = ct.sort_values(by=['Total'], ascending=False)
        ct = ct.iloc[1:1+k, 0:2]
        
    y = ct.index       
    scores1 = ct.iloc[:,0]
    scores2 = ct.iloc[:,1]
    maxCount = max(ct.max(axis=1))
    xLim = maxCount + 0.5
    
    fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(9, 6))

    axes[0].barh(y, scores1, align='center', color='royalblue')
    axes[0].set(title=ct.columns[0])

    axes[1].barh(y, scores2, align='center', color='orange')
    axes[1].set(title=ct.columns[1])
    axes[1].grid()

    axes[0].invert_xaxis()
    axes[0].grid()

    axes[0].set_xlim([xLim,0])
    axes[1].set_xlim([0,xLim])

    plt.subplots_adjust(wspace=0, hspace=0)
    plt.show()

Functions

def vi_butterfly_chart(field1, field2, categories1=None, categories2=None, variation='butterfly')

Butterfly Chart / Tornado Chart / Pyramid Chart

A special case of diverging bar charts when only comparing two categories.

Depending on the ordering of the results different names exist. I've chosen to use 'butterfly' if no ordering is done, 'pyramid' if they are ordered from small to large, and 'tornado' when going from large to small.

This function is shown in this YouTube video and the diagram is also discussed at PeterStatistics.com

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the categories to use from field1.
categories2 : list or dictionary, optional
the two categories to use from field2.
variation : {"butterfly", "tornado", "pyramid"}, optional
order of the bars

Returns

plot
 

Notes

The term butterfly chart can for example be found in Hwang and Yoon (2021, p. 25).

The term tornado diagrom can be found in the guide from the Project Management Institute (2013, p. 338). The term funnel chart is also sometimes used (for example Jamsa (2020, p. 135)), but this is also a term sometimes used for a more analytical scatterplot used for some specific analysis.

The term pyramid chart can for example be found in Schwabish (2021, p. 185). It is very often used for comparing age distributions.

References

Hwang, J., & Yoon, Y. (2021). Data analytics and visualization in quality analysis using Tableau. CRC Press.

Jamsa, K. (2020). Introduction to data mining and analytics: With machine learning in R and Python. Jones & Bartlett Learning.

Project Management Institute (Ed.). (2013). A guide to the project management body of knowledge (5th ed.). Project Management Institute, Inc.

Schwabish, J. (2021). Better data visualizations: A guide for scholars, researchers, and wonks. Columbia University Press.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> vi_butterfly_chart(df1['mar1'], df1['sex'])
>>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="pyramid")
>>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="tornado")
Expand source code
def vi_butterfly_chart(field1, field2, categories1=None, categories2=None, variation='butterfly'):
    '''
    Butterfly Chart / Tornado Chart / Pyramid Chart
    -----------------------------------------------
    A special case of diverging bar charts when only comparing two categories. 
    
    Depending on the ordering of the results different names exist. I've chosen to use 'butterfly' if no ordering is done, 'pyramid' if they are ordered from small to large, and 'tornado' when going from large to small.

    This function is shown in this [YouTube video](https://youtu.be/f_5dTS5gb-4) and the diagram is also discussed at [PeterStatistics.com](https://peterstatistics.com/Terms/Visualisations/PyramidChart.html)
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the categories to use from field1. 
    categories2 : list or dictionary, optional
        the two categories to use from field2.
    variation : {"butterfly", "tornado", "pyramid"}, optional
        order of the bars
        
    Returns
    -------
    plot
    
    Notes
    -----
    The term *butterfly chart* can for example be found in Hwang and Yoon (2021, p. 25). 
    
    The term *tornado diagrom* can be found in the guide from the Project Management Institute (2013, p. 338). The term *funnel chart* is also sometimes used (for example Jamsa (2020, p. 135)), but this is also a term sometimes used for a more analytical scatterplot used for some specific analysis.
    
    The term *pyramid chart* can for example be found in Schwabish (2021, p. 185). It is very often used for comparing age distributions.
    
    References
    ----------
    Hwang, J., & Yoon, Y. (2021). Data analytics and visualization in quality analysis using Tableau. CRC Press.
    
    Jamsa, K. (2020). Introduction to data mining and analytics: With machine learning in R and Python. Jones & Bartlett Learning.
    
    Project Management Institute (Ed.). (2013). A guide to the project management body of knowledge (5th ed.). Project Management Institute, Inc.
    
    Schwabish, J. (2021). Better data visualizations: A guide for scholars, researchers, and wonks. Columbia University Press.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'])
    
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="pyramid")
    
    >>> vi_butterfly_chart(df1['mar1'], df1['sex'], variation="tornado")
    
    '''
    ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    k = len(ct.index)
    if variation=='tornado':
        ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="include")
        ct = ct.sort_values(by=['Total'])
        ct = ct.iloc[0:k, 0:2]
    
    if variation=='pyramid':
        ct = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="include")
        ct = ct.sort_values(by=['Total'], ascending=False)
        ct = ct.iloc[1:1+k, 0:2]
        
    y = ct.index       
    scores1 = ct.iloc[:,0]
    scores2 = ct.iloc[:,1]
    maxCount = max(ct.max(axis=1))
    xLim = maxCount + 0.5
    
    fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(9, 6))

    axes[0].barh(y, scores1, align='center', color='royalblue')
    axes[0].set(title=ct.columns[0])

    axes[1].barh(y, scores2, align='center', color='orange')
    axes[1].set(title=ct.columns[1])
    axes[1].grid()

    axes[0].invert_xaxis()
    axes[0].grid()

    axes[0].set_xlim([xLim,0])
    axes[1].set_xlim([0,xLim])

    plt.subplots_adjust(wspace=0, hspace=0)
    plt.show()