Module `stikpetP.visualisations.vis_spine_plot`

Expand source code

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ..other.table_cross import tab_cross

def vi_spine_plot(field1, field2, categories1=None, categories2=None):
    '''
    Spine Plot / Marimekko Chart / Mosaic Plot
    ------------------------------------------
    A spine plot is similar to a multiple stacked bar-chart, but "the difference is that the bars fill the plot vertically so the shading gives us proportions instead of counts. Also, the width of each bar varies, reflecting the marginal proportion of observations in each workshop" (Muenchen, 2006, p. 286)
    
    It is a chart you could use when with two nominal variables and do not have a clear independent and dependent variable. Otherwise a multiple/clustered bar-chart might be preferred.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the categories to use from field1. 
    categories2 : list or dictionary, optional
        the two categories to use from field2. 
        
    Returns
    -------
    spine plot
    
    Notes
    -----
    The naming of this diagram is unfortunately not very clear. I use the term 'spine plot' as a special case of a Mosaic Plot. Mosaic Plots are often attributed to Hartigan and Kleiner (for example by Friendly (2002, p. 90)). Earlier versions are actually known, for example Walker (1874, p. PI XX). Hartigan and Kleiner (1981) start their paper with a Mosaic Plot for a cross table, but end it with showing Mosaic Plots for multiple dimension cross tables.
    
    A Marimekko Chart is simply an alternative name for the Mosaic Plot, although according to Wikipedia "mosaic plots can be colored and shaded according to deviations from independence, whereas Marimekko charts are colored according to the category levels" (Wikipedia, 2022).
    
    The term 'Spine Plot' itself is often attributed to Hummel, but I've been unable to hunt down his original article: Linked bar charts: Analysing categorical data graphically. Computational Statistics 11: 23–33.
    
    References
    ----------
    Carvalho, T. (2021, April 10). Marimekko Charts with Python’s Matplotlib. Medium. https://towardsdatascience.com/marimekko-charts-with-pythons-matplotlib-6b9784ae73a1
    
    Friendly, M. (2002). A brief history of the mosaic display. *Journal of Computational and Graphical Statistics, 11*(1), 89–107. https://doi.org/10.1198/106186002317375631
    
    Hartigan, J. A., & Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy (Ed.), Proceedings of the 13th Symposium on the Interface (pp. 268–273). Springer. https://doi.org/10.1007/978-1-4613-9464-8_37
    
    Muenchen, R. A. (2009). *R for SAS and SPSS Users*. Springer.
    
    Walker, F. A. (1874). *Statistical atlas of the United States based on the results of the ninth census 1870*. Census Office.
    
    Wikipedia. (2022). Mosaic plot. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Mosaic_plot&oldid=1089465331
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> vi_spine_plot(df1['mar1'], df1['sex'])
    
    '''
    ct = tab_cross(field2, field1, order1=categories2, order2=categories1, percent=None, totals="exclude")
    
    x = np.array(ct.sum(axis=1))
    x_label = np.array(ct.index)
    k1 = len(x_label)
    width = x/sum(x)
    
    adjusted_x, temp = [0], 0
    for i in width[:-1]:
        temp += i
        adjusted_x.append(temp)
        
    ct_rowProp = tab_cross(field2, field1, order1=categories2, order2=categories1, percent="row", totals="exclude")/100
    legend_labels = list(ct.columns)
    k2 = len(legend_labels)
    
    ys = [np.zeros(k1)]
    for i in range(0,k2):
        ys.append(np.array(ct_rowProp.iloc[:,i]))
    
    y_bottom = np.array(ys).cumsum(axis=0)
    
    fig, ax = plt.subplots(1)
    for i in range(0,k2):
        plt.bar(adjusted_x, ys[i+1], bottom=y_bottom[i], width=width, align='edge', edgecolor='black')
        
    ax.set_yticks([0, 0.25, 0.5, 0.75, 1])
    ax.set_yticklabels(['0%', '25%', '50%', '75%', '100%'])
    ax.set_xticks([0, 0.25, 0.5, 0.75, 1])
    ax.set_xticklabels(['0%', '25%', '50%', '75%', '100%'])
    plt.ylim(0,1)
    plt.xlim(0,1)
    plt.legend(legend_labels)

    axy = ax.twiny()
    axy.set_xticks([(width[i]/2)+ v for i, v in enumerate(adjusted_x)])
    axy.set_xticklabels(x_label, fontsize=14)

    plt.show()
    
    return

Functions

def vi_spine_plot(field1, field2, categories1=None, categories2=None)

Spine Plot / Marimekko Chart / Mosaic Plot

A spine plot is similar to a multiple stacked bar-chart, but "the difference is that the bars fill the plot vertically so the shading gives us proportions instead of counts. Also, the width of each bar varies, reflecting the marginal proportion of observations in each workshop" (Muenchen, 2006, p. 286)

It is a chart you could use when with two nominal variables and do not have a clear independent and dependent variable. Otherwise a multiple/clustered bar-chart might be preferred.

Parameters

field1 : pandas series: data with categories for the rows
field2 : pandas series: data with categories for the columns
categories1 : list or dictionary, optional: the categories to use from field1.
categories2 : list or dictionary, optional: the two categories to use from field2.

Returns

spine plot

Notes

The naming of this diagram is unfortunately not very clear. I use the term 'spine plot' as a special case of a Mosaic Plot. Mosaic Plots are often attributed to Hartigan and Kleiner (for example by Friendly (2002, p. 90)). Earlier versions are actually known, for example Walker (1874, p. PI XX). Hartigan and Kleiner (1981) start their paper with a Mosaic Plot for a cross table, but end it with showing Mosaic Plots for multiple dimension cross tables.

A Marimekko Chart is simply an alternative name for the Mosaic Plot, although according to Wikipedia "mosaic plots can be colored and shaded according to deviations from independence, whereas Marimekko charts are colored according to the category levels" (Wikipedia, 2022).

The term 'Spine Plot' itself is often attributed to Hummel, but I've been unable to hunt down his original article: Linked bar charts: Analysing categorical data graphically. Computational Statistics 11: 23–33.

References

Carvalho, T. (2021, April 10). Marimekko Charts with Python’s Matplotlib. Medium. https://towardsdatascience.com/marimekko-charts-with-pythons-matplotlib-6b9784ae73a1

Friendly, M. (2002). A brief history of the mosaic display. Journal of Computational and Graphical Statistics, 11(1), 89–107. https://doi.org/10.1198/106186002317375631

Hartigan, J. A., & Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy (Ed.), Proceedings of the 13th Symposium on the Interface (pp. 268–273). Springer. https://doi.org/10.1007/978-1-4613-9464-8_37

Muenchen, R. A. (2009). R for SAS and SPSS Users. Springer.

Walker, F. A. (1874). Statistical atlas of the United States based on the results of the ninth census 1870. Census Office.

Wikipedia. (2022). Mosaic plot. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Mosaic_plot&oldid=1089465331

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> vi_spine_plot(df1['mar1'], df1['sex'])

Expand source code

def vi_spine_plot(field1, field2, categories1=None, categories2=None):
    '''
    Spine Plot / Marimekko Chart / Mosaic Plot
    ------------------------------------------
    A spine plot is similar to a multiple stacked bar-chart, but "the difference is that the bars fill the plot vertically so the shading gives us proportions instead of counts. Also, the width of each bar varies, reflecting the marginal proportion of observations in each workshop" (Muenchen, 2006, p. 286)
    
    It is a chart you could use when with two nominal variables and do not have a clear independent and dependent variable. Otherwise a multiple/clustered bar-chart might be preferred.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the categories to use from field1. 
    categories2 : list or dictionary, optional
        the two categories to use from field2. 
        
    Returns
    -------
    spine plot
    
    Notes
    -----
    The naming of this diagram is unfortunately not very clear. I use the term 'spine plot' as a special case of a Mosaic Plot. Mosaic Plots are often attributed to Hartigan and Kleiner (for example by Friendly (2002, p. 90)). Earlier versions are actually known, for example Walker (1874, p. PI XX). Hartigan and Kleiner (1981) start their paper with a Mosaic Plot for a cross table, but end it with showing Mosaic Plots for multiple dimension cross tables.
    
    A Marimekko Chart is simply an alternative name for the Mosaic Plot, although according to Wikipedia "mosaic plots can be colored and shaded according to deviations from independence, whereas Marimekko charts are colored according to the category levels" (Wikipedia, 2022).
    
    The term 'Spine Plot' itself is often attributed to Hummel, but I've been unable to hunt down his original article: Linked bar charts: Analysing categorical data graphically. Computational Statistics 11: 23–33.
    
    References
    ----------
    Carvalho, T. (2021, April 10). Marimekko Charts with Python’s Matplotlib. Medium. https://towardsdatascience.com/marimekko-charts-with-pythons-matplotlib-6b9784ae73a1
    
    Friendly, M. (2002). A brief history of the mosaic display. *Journal of Computational and Graphical Statistics, 11*(1), 89–107. https://doi.org/10.1198/106186002317375631
    
    Hartigan, J. A., & Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy (Ed.), Proceedings of the 13th Symposium on the Interface (pp. 268–273). Springer. https://doi.org/10.1007/978-1-4613-9464-8_37
    
    Muenchen, R. A. (2009). *R for SAS and SPSS Users*. Springer.
    
    Walker, F. A. (1874). *Statistical atlas of the United States based on the results of the ninth census 1870*. Census Office.
    
    Wikipedia. (2022). Mosaic plot. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Mosaic_plot&oldid=1089465331
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> vi_spine_plot(df1['mar1'], df1['sex'])
    
    '''
    ct = tab_cross(field2, field1, order1=categories2, order2=categories1, percent=None, totals="exclude")
    
    x = np.array(ct.sum(axis=1))
    x_label = np.array(ct.index)
    k1 = len(x_label)
    width = x/sum(x)
    
    adjusted_x, temp = [0], 0
    for i in width[:-1]:
        temp += i
        adjusted_x.append(temp)
        
    ct_rowProp = tab_cross(field2, field1, order1=categories2, order2=categories1, percent="row", totals="exclude")/100
    legend_labels = list(ct.columns)
    k2 = len(legend_labels)
    
    ys = [np.zeros(k1)]
    for i in range(0,k2):
        ys.append(np.array(ct_rowProp.iloc[:,i]))
    
    y_bottom = np.array(ys).cumsum(axis=0)
    
    fig, ax = plt.subplots(1)
    for i in range(0,k2):
        plt.bar(adjusted_x, ys[i+1], bottom=y_bottom[i], width=width, align='edge', edgecolor='black')
        
    ax.set_yticks([0, 0.25, 0.5, 0.75, 1])
    ax.set_yticklabels(['0%', '25%', '50%', '75%', '100%'])
    ax.set_xticks([0, 0.25, 0.5, 0.75, 1])
    ax.set_xticklabels(['0%', '25%', '50%', '75%', '100%'])
    plt.ylim(0,1)
    plt.xlim(0,1)
    plt.legend(legend_labels)

    axy = ax.twiny()
    axy.set_xticks([(width[i]/2)+ v for i, v in enumerate(adjusted_x)])
    axy.set_xticklabels(x_label, fontsize=14)

    plt.show()
    
    return