Correlation: Measuring the degree of interdependence between two variables
This section is about different kinds of correlations. Correlations can be
calculated when two sets of measurements are made on the same entities. If the sample
consists of scores from subjects, then in order to get started, each subject's score on one
condition has to be lined up with their score on the other condition. For example, we can
look at the correlation between reading times for two kinds of sentence types. The data
file has each person's scores side by side, in two columns.
Note that the scores we are correlating can be single token scores, or they can be
totals or means from a larger sample. This makes no difference to the way in which the
correlation is calculated, but of course if the data points are means of multiple tokens then
they are more likely to be normally distributed (by the Central Limit Theorem). They will
also be good point estimators and thus the results of the correlations will be cleaner than
if the correlations were based on only one token per person. In particular the correlation
is likely to be higher because there is less item-to-item variation. Remember that the s.d.
of a sampling distribution is smaller as n increases, so the numbers going into the
correlations will be less variable if based on means of n rather than on n=1.
As in anova, the notion of analyses by subjects vs. by items holds for correlations.
In homework 3 I ask for correlations by subjects, using just the two columns of data with
no other complications. But in that assignment you also created an items file, with 16
items. However, in that case the items were all different, so there is no way to run a
correlations. There is no way to match up pairs of sentences that are somehow ‘the
same’. In the subjects analysis, though, you could ask if participants show a correlation
between how they responded to type 1 and type 2.
1. Eyeballing the data Run descriptives and look at the correlations in terms of the dispersion of scores
in a graph. Each subject is represented by a point on the graph, which represents their
score on Type 1 and Type 2. At this stage you are looking to see how spread-out the
scores are from the trend line, which can slope either up or down. Be careful to look for
extreme values which can ruin everything, as I will show in class. One outlier can really
change the correlation a lot.
1. Calculate the covariance.
Consider each data point on the scattergram. Note that each is really the
intersection of two data points, one on the x axis for variable 1, and the other on the y
axis for variable 2. Thus a single data point can be examined in terms of how far its x
value is from the mean of the x values and how far its y value is from the mean of the y
values. These are simply two deviations from two means. They are needed for the
calcluation of the correlation coefficient.
So if the mean of x is 25 and the mean of y is 29, then a point with values of
(22,22) deviates from the x mean by 22 – 25 = - 3 and from the y mean by 22 -29 = - 7.
Notice that you start with the score and subtract the mean from it; these two deviations
are therefore negative.
So we have two deviation scores per data point. The covariance is calculated by
multiplying each of these deviations together (e.g. -3 x -7 = +21), adding up all the
products for each subject (or item, depending on what you're analyzing), and then taking
an average deviation by dividing by n-1 (which approaches n with large sample sizes).
Notice that n is defined as the total number of subjects (or items). The two scores have to
be from the same subjects (or items), so df is defined just on the number of different
people or items. If there are 32 subjects, and 64 data points, then df = 32-1=31. Notice
that correlation is inherently a within-subjects design. If you had data from two groups of
subjects there would be no way to line them up in the data file; would you put Mary’s number next to Juan’s or next to Jacqueline’s? Neither – there is no sense to the question
of whether the two groups of subjects are correlated.
If two sets of scores are interdependent, such that one goes up along with the
other, or one goes up while the other goes down, we would like to develop a statistic that
is sensitive to this lockstep pattern. Of course we won't expect perfect correlations from
samples, since there will also be error and other sources of variance, so this statistic has
to be sensitive to the amount of variability ('error') in the experiment, and still be able to
tell us if the correlation is significant to a given precision (e.g. p < .05 or better).
There are two steps. First, as I said above, we caculate the covariance: the
product of each x and y deviation summed and divided by n-1. I will expand on each part
of this formula.
First note that if there is a positive correlation, i.e. as x increases y increases, then
these deviations will generally have the same sign and their total will be greater than
zero. If there is a negative correlation, i.e. as x increases y decreases or vice versa, then
one deviation will usually be positive and the other negative, so their total will be less
than zero. But if there is no relationship between the variables, there will be no tendency;
the two deviations will both be sometimes positive and sometimes negative. Half the
time they will be positive (mulitplying two positive or two negative numbers) and the
other half they will be negative (multiplying posxneg or negxpos) and the result will be
Note that on a graph, positive correlations have an upward slope toward the right,
and negative correlations have a downward slope to the right. If slope =rise/run then
positive slopes would be e.g. rise of 2 run of 1 = 2 for positive, but rise of 2 run of -1 (or
rise of -2 and run of 1) for a slope of -2.
The only problem with the covariance as a test statistic is that it is scale-sensitive.
For example, using the same data you would get different covariances if you change from, say, centimetres to metres, or pounds to kilos, although the shape of the
relationship would be the same no matter which scale you used. If we wanted to run a
hypothesis test against tabled values of the covariance, to see if our observed covariance
is bigger than some fairly improbable cutoff under the null hypothesis, then we would
need a different set of tables for every combination of variables, and that is impossible
because no statistician would work out an infinite number of tables.
Luckily we note that the product of the s.d.s will come out in the same units as
the products of the deviations used in the covariance. If you divide the covariance by this
product you eliminate the units and get just a proportion. This proportion comes out
to the same thing regardless of which of the equivalent units you used. In fact you could
use weight in kilos and height in inches or angstroms or astronomical units or lightyears
and still get exactly the same r! The bigger or smaller the units, the bigger or smaller the
R. otherwise known as the Pearson Product-Moment Correlation (or Pearson’s R)
can range between -