# Analysing two scale variables

## Part 3: Test and effect size

On the previous page we got a first impression from the sample data, and noticed there might be a relation between the current and the beginning salary. To formally test if the two variables have an association, we can perform what is known as a regression analysis, and since we keep things simple we will limit ourselves to predicting based on a straight line, so linear regression, and since we only have two variables the full name would be bivariate linear regression. The most commonly used measure to test if this relation also exists in the population, is the **Pearson Correlation Coefficient** (or in it's full name the Pearson product-moment correlation coefficient) (Pearson, 1896).

Pearson Correlation varies between -1 and +1. If it is -1 there is a perfect negative lineair relationship, if it is 0 there is no lineair relationship and at +1 there is a perfect positive lineair relationship.

A positive relation means that if one variable goes up, the other also goes up (for example number of ice cream sold versus temperature), a negative relation indicates if one goes down, the other goes up (for example number of winter jackets sold versus temperature).

We can test if Pearson Correlation might be significantly different from 0 in the population. In the example the significance of this test is .000. This is the chance of finding a correlation coefficient of .880 or even higher in a sample, if in the population it would be 0 (no association). This is such a low chance, that we can say that in the population the correlation coefficient will be indeed different from zero, and conclude that there is a significant linear association between the two variables.

To determine the strength we only look at the absolute value (which means to ignore any minus sign, so the absolute value of for example -0.4 is simply 0.4).

Unfortunately there is no formal way to determine if 0.880 is high or low (although almost everyone would agree this is pretty high), and the rules of thumb floating around on the internet vary quite a lot, often depending on the field (e.g. biology, medicine, business, etc.). For example the same rule of thumb sizes from Rea and Parker (1992):

0.00 < 0.10 - Negligible

0.10 < 0.20 - Weak

0.20 < 0.40 - Moderate

0.40 < 0.60 - Relatively strong

0.60 < 0.80 - Strong

0.80 < 1.00 - Very strong

In this example we can therefor speak of a very strong effect size. A nice website where you can see and practice this can be found here. Click here to see a table with various other rule of thumbs for the interpretation.

Often you will find instead of the correlation the determination coefficient reported. The determination coefficient is the square of the correlation coefficient and indicates the proportion of the variance in the dependent variable that is predictable from the independent variable. In the example 0.880^{2} = 0.775, which indicates that 77.5% of the variance in the current salary can be predicted based on the beginning salary.

We can now write up the results of this regression analysis in our report:

The result of the regression indicated that the two variables have a significant very strong correlation, *F*(1, 472) = 1622.12, *p *< .001, *R ^{2}* = .775.

**Click here to see how you can perform the analysis with SPSS, R (studio), Excel, Python, or Manually.**

**with SPSS**

*the Correlatation coefficient and significance only*

This method will show immediately if the coefficient is negative, and you can perform multiple tests in one time. Unfortunately though, it will not show the determination coefficient, or any other values needed to report the results.

**with R (Studio)**

*the correlation coefficient*

**with Excel**

**with Python**

**Manually (formulas and example)**

**Formulas**

There are a few different notations for the formula, but all produce the same result. Two of them for example:

Where:

With:

In these formulas are the *i*-th score of variable *x* and variable *y*. The are the mean of *x* and the mean of *y*.

Note that only the cases for which a score is known on both variables is used.

The test-statistic is a t-value, determined by:

And with degrees of freedom of:

**Example**

Note: Different than in the rest of this page.

Given are the scores on two variables:

So the first case scored a 4 on X, and a 3 on Y. Note that there are 5 scores, so *n *= 5.

First we determine the means (average):

Second the sum of squares:

Third is the covariance:

Fourth the Pearson correlation coefficient:

Fifth we can determine the t-statistic:

And sixth the degrees of freedom:

We can now complete the report by combining all the parts on the next page.

Google adds