Analyse a Single Binary Variable
The analysis of a single binary variable can be done with the steps shown below. Click on each step to reveal how this step can be done.
Step 1: Impression
For a quick impression of a binairy variable, you can create a frequency table. The result will be something as shown below.
Frequency | Percent | Valid Percent | ||
---|---|---|---|---|
Valid | Female | 12 |
22 |
26 |
Male | 34 |
62 |
74 |
|
Subtotal | 46 |
84 |
100 |
|
Missing | No response | 9 |
16 |
|
Subtotal | 9 |
16 |
||
Total | 55 |
100 |
Click here to see how to create a frequency table with Excel, Python, R, or SPSS.
with Python
Jupyter Notebook of video is available here.
with stikpetP library
without stikpetP library
with R (Studio)
with stikpetP library
Jupyter Notebook of video is available here.
without stikpetR library
R script of video is available here.
Datafile used in video: GSS2012-Adjusted.sav
with SPSS
There are a three different ways to create a frequency table with SPSS.
An SPSS workbook with instructions of the first two can be found here.
using Frequencies
watch the video below, or download the pdf instructions (via bitly, opens in new window/tab).
Datafile used in video: Holiday Fair.sav
using Custom Tables
watch the video below, or download the pdf instructions for versions before 24, or version 24 (via bitly, opens in new window/tab)
Datafile used in video: Holiday Fair.sav
using descriptive shortcut
watch the video below, or download the pdf instructions (via bitly, opens in new window/tab).
Datafile used in video: StudentStatistics.sav
See the explanation of frequency table for details on how to read this kind of table.
The table itself might not end up in the report, but gives a quick impression for yourself.
We could add something like the following in the report based on the example:
There appears to be relatively many male workers (N = 34, 74%), compared to the female workers (N = 12, 26%).
Step 2: Testing
With a single binairy variable, you are probably interested to compare the percentages of the two categories. An exact one-sample binomial test, can do this for you. You can check if the two percentages are significantly different, by using the assumption that they are equal in the population. The p-value (or significance) of the test, can then show if this is the case or not.
Click here to see how to perform a biniomial test with Excel, Flowgorithm, Python, R, SPSS, or Manually
with Excel
Excel file from videos TS - Binomial (one-sample) (E).xlsm.
with stikpetE
without stikpetE
with Flowgorithm
A basic implementation for a one-sample binomial test is shown in the flowchart in figure 1
Figure 1
Flowgorithm for one-sample binomial test
It takes as input the frequency of one of the categories (k) and the sample size (n). This makes use of the binomial distribution cumulative density function.
Flowgorithm file: TS - Binomial (one-sample).fprg.
with Python
Jupyter Notebook from videos TS - Binomial Exact Test.ipynb.
with stikpetP
with other libraries
without libraries
or without using any libraries:
Basic code example:
# libraries needed
import pandas as pd
from scipy.stats import binom_test
# some data
myDf = pd.read_csv('../../Data/csv/StudentStatistics.csv', sep=';')
myCd = myDf['Gen_Gender'].value_counts()
# the test
binom_test(myCd.values[0], sum(myCd.values), 1/2, alternative='two-sided')
with R (Studio)
with stikpetR
Jupyter Notebook from video TS - Binomial (one-sample) (R).ipynb.
without stikpetR
R script from video TS - Binomial Exact Test.R.
Datafile used in video: StudentStatistics.sav
Basic code example:
#one sample binomial test
#Preparation
#Getting some data
#install.packages("foreign")
library(foreign)
myData <- read.spss("../Data Files/StudentStatistics.sav", to.data.frame = TRUE)
#Remove na's
myVar <- na.omit(myData$Gen_Gender)
#Determine number of successes
k <- sum(myVar==myVar[1])
#Determine total sample size
n <- length(myVar)
#Test if expected both groups to be equal
#Perform binomial test
binom.test(k,n)
#Or use binomial distribution directly
2*pbinom(k,n,.5)
with SPSS
using non-parametric tests
Datafile used in video: StudentStatistics.sav
using Legacy Dialogs
Datafile used in video: StudentStatistics.sav
using compare means
Datafile used in video: StudentStatistics.sav
Manually (Formula's)
A one-sample binomial test, is almost 'just' the same as using the binomial distribution.
Given a probability of success (p), which for the binomial test is the expected proportion in the population, the number of trials (n), which for the binomial test is the total sample size, and the number of successes (k), which for the binomial test is number of occurences in one of the categories.
The formula for the cumulative binomial distribution (F(k; n,p)) is:
\(F\left(k;n,p\right)=\sum_{i=0}^{\left\lfloor k\right\rfloor}\binom{n}{i}\times p^{i}\times\left(1-p\right)^{n-i}\)
If p = 0.5 the formula could be simplfied into:
\(F\left(k;n,0.5\right)=0.5^{n}\times\sum_{i=0}^{\left\lfloor k\right\rfloor}\binom{n}{i}\)
In the formula ⌊k⌋ is the 'floor' function. This gives the greatest integer (whole number) less than or equal to k. So for example ⌊2.8⌋ = 2, and ⌊-2.2⌋=-3.
\(\binom{n}{i}\) is the binomial coefficient, this can be calculated using:
\(\binom{n}{i}=\frac{n!}{i!\times\left(n-i\right)!}\)
In this formula the ! indicates the factorial operation:
\(n!=\prod_{i=1}^{n}i\), and 0! is defined as 0! = 1.
These formulas are discussed in more detail in the binomial distribution section
The p-value (sig.) is the probability of the percentages as in the sample, or more extreme, if the assumption about the population (that they are equal) would be true. If this is below the pre-defined threshold (usually .05), we would reject this assumption, and conclude there is a significant difference, otherwise we would not reject the assumption.
When reporting the result of a one-sample binomial test, the only thing to show is the p-value (sig.), so for example:
An exact binomial test indicated that the percentage of female (Nf = 12, 26%), was significantly different from the male percentage (Nm = 34, 76%), p = .002.
Note that the p-value is usually reported with three decimal places. If the p-value is below .0005, it is then reported as p < .001. Note that SPSS used to often show p-values less than .0005 as .000.
Besides the one-sample binomial test, there are other tests that could be used. The binomial distribution can be approximated with a normal distribution, which leads to a one-sample proportion test. Another approach is to use a so-called goodness-of-fit test (either Pearson, or Likelihood Ratio).
Statstest.com recommend to use the exact binomial if the sample size is below 1000, and a Likelihood Ratio-test otherwise with Yates correction. The Likelihood Ratio-test is less known, so if you want a more well known test the Pearson version (with Yates) should be fine as well.
Step 3: Effect Size
Each test should be accompanied by an effect size, according to APA (2019, p. 88). One possible effect size for a one-sample binomial test, where the assumption was that both categories were equal, is Cohen g.
Click here to see how to determine Cohen's g with Excel, Flowgorithm, Python, R, SPSS, an online calculator, or Manually
with Flowgorithm
A basic implementation for Cohen g is shown in the flowchart in figure 1
Figure 2
Flowgorithm for Cohen g
It takes as input the frequency of one of the categories (k) and the sample size (n).
Flowgorithm file: ES - Cohen g.fprg.
with R (Studio)
with stikpetR
Jupyter Notebook from video ES - Cohen g (R).ipynb.
without stikpetR
R script from video: binary - effect sizes.R.
Datafile used in video: StudentStatistics.sav
with SPSS
Datafile used in video: StudentStatistics.sav
Online calculator
Enter the number of cases of the first category, then the total sample size:
Manually (using Formula)
Given a sample proportion (p) and the expected proportion in the population (π), the formula for Cohen's g will be:
\(g=p-\pi\)
The sample proportion in the example was 0.26 and the expected proportion was 0.50, in the example this therefor gives:
\(g=0.26-0.50=-0.24\)
Often the absolute value is used (the so-called nondirectional Cohen's g):
\(g=|0.26-0.50|=|-0.24|=0.24\)
Cohen g is simply the difference between the observed proportion, and 0.5. Cohen gave some rule of thumb to interpret this, shown in Table 1.
|g| | Interpretation |
---|---|
0.00 < 0.05 | Negligible |
0.05 < 0.15 | Small |
0.15 < 0.25 | Medium |
0.25 or more | Large |
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., pp. 147-149) by J. Cohen, 1988, L. Erlbaum Associates. |
The 0.24 would fall in the Medium category (but is very close to the Large). We could add this to our findings:
An exact binomial test indicated that the percentage of female (Nf = 12, 26%), was significantly different from the male percentage (Nm = 34, 76%), p = .002. Cohen’s g suggests that the difference can be classified as medium, g = .24.
Alternatives to Cohen g could be Cohen h' or the Alternative Ratio.
Step 4: Reporting
In each step, we already discussed how it could be reported. For the example used, the final report could have somthing like the following:
There appears to be relatively many male workers (N = 34, 74%), compared to the female workers (N = 12, 26%).
An exact binomial test indicated that the percentages were significantly different, p = .002. Cohen’s g suggests that the difference can be classified as medium, g = .24.
If you want to make things easy for yourself and are using Excel, Python or R, you can use my library/add-on to perform each step.
Using a stikpet Library/Add-On
Python and the stikpetP library
Jupyter Notebook from video: stikpetP - Single Binary.ipynb.
R and the stikpetE library
Jupyter Notebook from video: stikpetR - Single Binary.ipynb.
Google adds