Module stikpetP.effect_sizes.eff_size_odds_ratio

Expand source code
import math
from statistics import NormalDist
import pandas as pd
from ..other.table_cross import tab_cross

def es_odds_ratio(field1, field2, categories1=None, categories2=None):
    '''
    Odds Ratio
    ----------
    
    Determines the odds ratio from a 2x2 table.
    
    Odds can sometimes be reported as 'a one in five odds', but sometimes as 1 : 4. This later notation is less often seen, but means for every one event on the left side, there will be four on the right side.
    
    The Odds is the ratio of that something will happen, over the probability that it will not. For the Odds Ratio, we compare the odds of the first category with the second group.
    
    If the result is 1, it indicates that one variable has no influence on the other. A result higher than 1, indicates the odds are higher for the first category. A result lower than 1, indicates the odds are lower for the first.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used

    Returns
    -------
    A dataframe with:
    
    * *OR*, the odds ratio
    * *n*, the sample size
    * *statistic*, the test statistic (z-value)
    * *p-value*, the significance (p-value)
    
    Notes
    -----    
    The formula used is (Fisher, 1935, p. 50):
    $$OR = \\frac{a/c}{b/d} = \\frac{a\\times d}{b\\times c}$$
    
    *Symbols used:*
    
    * \\(a\\) the count in the top-left cell of the cross table
    * \\(b\\) the count in the top-right cell of the cross table 
    * \\(c\\) the count in the bottom-left cell of the cross table 
    * \\(d\\) the count in the bottom-right cell of the cross table
    * \\(\\Phi\\left(\\dots\\right)\\) the cumulative density function of the standard normal distribution
    
    As for the test (McHugh, 2009, p. 123):
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$SE = \\sqrt{\\frac{1}{a} + \\frac{1}{b} + \\frac{1}{c} + \\frac{1}{d}}$$
    $$z = \\frac{\\ln{\\left(OR\\right)}}{SE}$$
    
    The p-value is for the null-hypothesis that the population OR is 1.
    
    The term Odds Ratio can for example be found in Cox (1958, p. 222).
    
    See Also
    --------
    stikpetP.other.thumb_odds_ratio.th_odds_ratio : rules of thumb for odds ratio
    
    stikpetP.other.convert_es.es_convert : to convert an odds ratio to Yule Q, Yule Y, or Cohen d.

    References
    ----------
    Cox, D. R. (1958). The regression analysis of binary sequences. *Journal of the Royal Statistical Society: Series B (Methodological), 20*(2), 215–232. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x
    
    Fisher, R. A. (1935). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435
    
    McHugh, M. (2009). The odds ratio: Calculation, usage, and interpretation. *Biochemia Medica, 19*(2), 120–126. https://doi.org/10.11613/BM.2009.011
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_odds_ratio(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
             OR    n  statistic   p-value
    0  1.750802  495    2.86455  0.004176
    
        
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    # odds ratio
    oddsRatio = a*d/(b*c)
    
    # significance
    se = (1/a + 1/b + 1/c + 1/d)**0.5
    z = math.log(oddsRatio)/se
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    n = a + b + c + d
    
    #the results
    colNames=["OR", "n", "statistic", "p-value"]
    results = pd.DataFrame([[oddsRatio, n, z, pValue]], columns=colNames)
    
    return (results)

Functions

def es_odds_ratio(field1, field2, categories1=None, categories2=None)

Odds Ratio

Determines the odds ratio from a 2x2 table.

Odds can sometimes be reported as 'a one in five odds', but sometimes as 1 : 4. This later notation is less often seen, but means for every one event on the left side, there will be four on the right side.

The Odds is the ratio of that something will happen, over the probability that it will not. For the Odds Ratio, we compare the odds of the first category with the second group.

If the result is 1, it indicates that one variable has no influence on the other. A result higher than 1, indicates the odds are higher for the first category. A result lower than 1, indicates the odds are lower for the first.

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used

Returns

A dataframe with:
 
  • OR, the odds ratio
  • n, the sample size
  • statistic, the test statistic (z-value)
  • p-value, the significance (p-value)

Notes

The formula used is (Fisher, 1935, p. 50): OR = \frac{a/c}{b/d} = \frac{a\times d}{b\times c}

Symbols used:

  • a the count in the top-left cell of the cross table
  • b the count in the top-right cell of the cross table
  • c the count in the bottom-left cell of the cross table
  • d the count in the bottom-right cell of the cross table
  • \Phi\left(\dots\right) the cumulative density function of the standard normal distribution

As for the test (McHugh, 2009, p. 123): sig. = 2\times\left(1 - \Phi\left(\left|z\right|\right)\right)

With: SE = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} z = \frac{\ln{\left(OR\right)}}{SE}

The p-value is for the null-hypothesis that the population OR is 1.

The term Odds Ratio can for example be found in Cox (1958, p. 222).

See Also

th_odds_ratio()
rules of thumb for odds ratio

stikpetP.other.convert_es.es_convert : to convert an odds ratio to Yule Q, Yule Y, or Cohen d.

References

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x

Fisher, R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98(1), 39–82. https://doi.org/10.2307/2342435

McHugh, M. (2009). The odds ratio: Calculation, usage, and interpretation. Biochemia Medica, 19(2), 120–126. https://doi.org/10.11613/BM.2009.011

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> es_odds_ratio(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
         OR    n  statistic   p-value
0  1.750802  495    2.86455  0.004176
Expand source code
def es_odds_ratio(field1, field2, categories1=None, categories2=None):
    '''
    Odds Ratio
    ----------
    
    Determines the odds ratio from a 2x2 table.
    
    Odds can sometimes be reported as 'a one in five odds', but sometimes as 1 : 4. This later notation is less often seen, but means for every one event on the left side, there will be four on the right side.
    
    The Odds is the ratio of that something will happen, over the probability that it will not. For the Odds Ratio, we compare the odds of the first category with the second group.
    
    If the result is 1, it indicates that one variable has no influence on the other. A result higher than 1, indicates the odds are higher for the first category. A result lower than 1, indicates the odds are lower for the first.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used

    Returns
    -------
    A dataframe with:
    
    * *OR*, the odds ratio
    * *n*, the sample size
    * *statistic*, the test statistic (z-value)
    * *p-value*, the significance (p-value)
    
    Notes
    -----    
    The formula used is (Fisher, 1935, p. 50):
    $$OR = \\frac{a/c}{b/d} = \\frac{a\\times d}{b\\times c}$$
    
    *Symbols used:*
    
    * \\(a\\) the count in the top-left cell of the cross table
    * \\(b\\) the count in the top-right cell of the cross table 
    * \\(c\\) the count in the bottom-left cell of the cross table 
    * \\(d\\) the count in the bottom-right cell of the cross table
    * \\(\\Phi\\left(\\dots\\right)\\) the cumulative density function of the standard normal distribution
    
    As for the test (McHugh, 2009, p. 123):
    $$sig. = 2\\times\\left(1 - \\Phi\\left(\\left|z\\right|\\right)\\right)$$
    
    With:
    $$SE = \\sqrt{\\frac{1}{a} + \\frac{1}{b} + \\frac{1}{c} + \\frac{1}{d}}$$
    $$z = \\frac{\\ln{\\left(OR\\right)}}{SE}$$
    
    The p-value is for the null-hypothesis that the population OR is 1.
    
    The term Odds Ratio can for example be found in Cox (1958, p. 222).
    
    See Also
    --------
    stikpetP.other.thumb_odds_ratio.th_odds_ratio : rules of thumb for odds ratio
    
    stikpetP.other.convert_es.es_convert : to convert an odds ratio to Yule Q, Yule Y, or Cohen d.

    References
    ----------
    Cox, D. R. (1958). The regression analysis of binary sequences. *Journal of the Royal Statistical Society: Series B (Methodological), 20*(2), 215–232. https://doi.org/10.1111/j.2517-6161.1958.tb00292.x
    
    Fisher, R. A. (1935). The logic of inductive inference. *Journal of the Royal Statistical Society, 98*(1), 39–82. https://doi.org/10.2307/2342435
    
    McHugh, M. (2009). The odds ratio: Calculation, usage, and interpretation. *Biochemia Medica, 19*(2), 120–126. https://doi.org/10.11613/BM.2009.011
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_odds_ratio(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
             OR    n  statistic   p-value
    0  1.750802  495    2.86455  0.004176
    
        
    '''
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    # odds ratio
    oddsRatio = a*d/(b*c)
    
    # significance
    se = (1/a + 1/b + 1/c + 1/d)**0.5
    z = math.log(oddsRatio)/se
    pValue = 2 * (1 - NormalDist().cdf(abs(z))) 
    n = a + b + c + d
    
    #the results
    colNames=["OR", "n", "statistic", "p-value"]
    results = pd.DataFrame([[oddsRatio, n, z, pValue]], columns=colNames)
    
    return (results)