Analysing a single scale variable

1a: impression from sample data

All the frequency types discussed for a nominal and ordinal variable, can also be applied for a scale variable. One complication however is that for a scale variable, there are often so many options that the table becomes very long. Since the point of a table is to give a clear overview, and a long table often isn’t very clear, this creates a problem. The solution is to create bins (or classes) as shown in Table 1.

Table 1
*Frequency table of a binned scale variable*
Age	Frequency
15 < 25	170
25 < 35	363
35 < 45	358
45 < 55	360
55 < 65	324
65 < 75	222
75 < 85	125
85 < 95	47

Click here to see how you can create bins...

with Excel

The Data Analysis Toolkit that comes with Excel could be used to create the bins, but those bins will then include the upper bound. It is also possible to do it without the toolkit.

without the Data Analysis Toolkit

Excel file: IM - Binning.xlsm.

with the Data Analysis Toolkit

Excel file: IM - Binning.xlsm.

with Python

Jupyter Notebook used in video: IM - Binning (Create).ipynb.

Data file used: GSS2012a.csv.

with R (Studio)

video to be uploaded

R script: IM - Binning (create).R.

Data file used: GSS2012a.csv.

with SPSS

Data file used: Holiday Fair.sav.

The table shows that there were 170 respondents in the age bin of 15 < 25. The symbol '<' is used for 'but under', so someone of the age of 25 would fit into 25 < 35, but not in 15 < 25. Sometimes = is used, which stands for 'equal or less than'. A more technical method is the use of [ or ] to indicate ‘including’ and ( or ) to indicate smaller than. The interval 15 < 25 is then the same as [15,25), and the interval 15 = 24 is the same as [15,24]. Another symbol often used is a hyphen (-). It is however sometimes used as < (Chaudhary, Kumar, & Alka, 2009; Sharma, 2007), and sometimes as = (Beri, 2010; Haighton, Haworth, & Wake, 2003).

The lower end of a bin is called the lower bound and the upper end the upper bound (e.g. the bin 15 < 25 has as a lower bound 15 and as an upper bound 25).

When creating these bins two important rules should be met :

bins should not overlap.
So do not use 15 < 25 and 20 < 35, since a person who is then 22 years would fit into both. This sometimes goes wrong when people use = instead of <.
Each score should fit into a bin.
This means that the lower bound of the first bin should be smaller than the lowest score, and the upper bound of the last bin should be higher than the highest score.

These two rules can be combined into one: each score should fit into exactly one bin.

There are also various formulas to help on deciding how many bins you should use, or how wide each bin should be. This is important because depending on how the bins are setup the results might look different. There are some formulas that can be used to determine the number of bins (e.g. Sturges’ rule (Sturges, 1926, p. 65), or Square Root Choice (Duda & Hart, 1973), and some authors simply use ‘a rule of thumb. One such rule of thumb is from Herkenhoff and Fogli (2013, p. 58) who recommend between 5 and 15 bins. Anything more than 15 might cause the table to become unclear (which is exactly what we are trying to avoid) and with anything less than 5 we might lose too much information.

Click here to see how you to determine the number of bins...

with Excel

Excel file: IM - Binning.xlsm.

with Python

Jupyter Notebook used in video: IM - Binning (nr of bins).ipynb.

Data file used in video and notebook GSS2012a.csv.

with R (Studio)

video to be uploaded

R script: IM - Binning (nr of bins).R.

Data file used: GSS2012a.csv.

manually (formula's)

For the formula's below \(k\) is the number of bins, and \(n\) the sample size. The \(\left \lceil \dots \right \rceil\) is the ceiling function, which means to round up to the nearest integer

Square-Root Choice (Duda & Hart, 1973)

\( k = \left \lceil\sqrt{n}\right \rceil \)

Sturges (1926, p. 65)

\( k = \left\lceil\log_2\left(n\right)\right\rceil+1 \)

QuadRoot (anonymous as cited in Lohaka, 2007, p. 87)

\( k = 2.5\times\sqrt[4]{n} \)

Rice (Lane, n.d.)

\( k = \left\lceil 2\times\sqrt[3]{n} \right\rceil \)

Terrell and Scott (1985, p. 209)

\( k = \sqrt[3]{2\times n} \)

Exponential(Iman & Conover as cited in Lohaka, 2007, p. 87)

\( k = \left\lceil\log_2\left(n\right)\right\rceil \)

Velleman (Velleman, 1976 as cited in Lohaka 2007)

\( \begin{cases}2\times\sqrt{n} & \text{ if } n\leq 100 \\ 10\times\log_{10}\left(n\right) & \text{ if } n > 100\end{cases} \)

Doane(1976, p. 182)

\( k = 1 + \log_2\left(n\right) + \log_2\left(1+\frac{\left|g_1\right|}{\sigma_{g_1}}\right) \)

In the formula's \(g_1\) the 3rd moment skewness:

\( g_1 = \frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^3} {n\times\sigma^3} = \frac{1}{n}\times\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma}\right)^3 \)

With \(\sigma = \sqrt{\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n}}\)

The \(\sigma_{g_1}\) is defined using the formula:

\( \sigma_{g_1}=\sqrt{\frac{6\times\left(n-2\right)}{\left(n+1\right)\left(n+3\right)}} \)

Formula's that determine the width (h) for the bins

Using the width and the range, it can be used to determine the number of categories:

\( k = \frac{\text{max}\left(x\right)-\text{min}\left(x\right)}{h} \)

Scott (1979, p. 608)

\( h = \frac{3.49\times s}{\sqrt[3]{n}} \)

Where \(s\) is the sample standard deviation:

\(s = \sqrt{\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1}}\)

Freedman-Diaconis (1981, p. 3)

\( h = 2\times\frac{\text{IQR}\left(x\right)}{\sqrt[3]{n}} \)

Where \( \text{IQR}\) the inter-quartile range.

A more complex technique is for example from Shimazaki and Shinomoto (2007). For this, we define a 'cost function' that needs to be minimized:

\( C_k = \frac{2\times\bar{f_k}-\sigma_{f_k}}{h^2} \)

With \(\bar{f_k}\) being the average of the frequencies when using \(k\) bins, and \(\sigma_{f_k}\) the population variance. In formula notation:

\(\bar{f_k}=\frac{\sum_{i=1}^k f_{i,k}}{k}\)

\(\sigma_{f_k}=\frac{\sum_{i=1}^k\left(f_{i,k}-\bar{f_k}\right)^2}{k}\)

Where \(f_{i,k}\) is the frequency of the i-th bin when using k bins.

Note that if the data are integers, it is recommended to use also bin widths that are integers.

Stone (1984, p. 3) is similar, but uses as a cost function:

\(C_k = \frac{1}{h}\times\left(\frac{2}{n-1}-\frac{n+1}{n-1}\times\sum_{i=1}^k\left(\frac{f_i}{n}\right)^2\right)\)

Knuth (2019, p. 8) suggest to use the k-value that maximizes:

\(P_k=n\times\ln\left(k\right) + \ln\Gamma\left(\frac{k}{2}\right) - k\times\ln\Gamma\left(\frac{1}{2}\right) - \ln\Gamma\left(n+\frac{k}{2}\right) + \sum_{i=1}^k\ln\Gamma\left(f_i+\frac{1}{2}\right)\)

By creating bins we lose some information since we don’t see exactly anymore what for example the ages were of the 170 people in the 15 < 25 bin.

By binning a scale variable, we actually convert it into an ordinal variable, and we could use all the types of frequencies discussed there as well. If the bin sizes are all the same this is all fine, but if some bin sizes are different then others it might actually distort the truth. If bin sizes are not equal, we should actually use something known as ‘frequency density’. This is discussed in the appendix below.

A visualisation of the sample data might also give a good impression. This is discussed in the next section.

Appendix

Single scale variable

Google adds