Published on 28 Sep 2011

Department

Political Science

Course

POL 201

Professor

A Summary of the Key Points about Correlation and Regression

At the interval level, any relationship can be analyzed in terms of both its nature and its strength.

Nature

–we can visualize the shape or nature of the relationship by plotting our cases in a scatter

diagram or scatterplot

–the vertical axis represents the dependent variable (Y) and the horizontal axis

represents the independent variable (X), and each case is plotted according to its Y and X

scores

–to summarize the nature of the relationship (if any) between X and Y, we can fit a line as

closely as possible to the cluster of points in the scattergram

–this line is called a regression line and is defined by its Y-intercept, a, and a slope, b:

Y = a + bX

–‘b’ is known as the regression coefficient

–it tells us what change in Y is produced by a one–unit change in X

–its formula is:

∑

∑

=

=

−

−−

=

n

i

i

n

i

ii

XX

YYXX

b

1

2

1

)(

))((

–‘a’ is the intercept, the value of Y when X = 0

–its formula is:

XbYa

−=

–the regression equation also allows us to make predictions of Y by substituting in a value for

X and calculating what Y-value would be produced

–for example, the regression equation linking extremist party presence to government

durability in the example used in class is

Y = 33.0 – 0.39X

1

–thus, if a particular country has an extremist vote of 31%, we substitute in the X-

score of 31 and get a predicted average duration of:

Y = 33.0 – .39(31) = 20.9 months

Strength

–The correlation coefficient (r) tells us how strong the association is between two variables

and whether it is positive or negative

–it ranges between -1 (perfect negative association) through 0 (no association at all) to

+1 (perfect positive association).

–The interpretation of the correlation is based in the principle of ‘proportional reduction in

error’ (PRE)

–All PRE measures record the proportion of our errors in guessing Y-scores that are

eliminated if we use our knowledge of the X-scores

–they are based on the idea that, the more our knowledge of one variable helps us

guess values of the other variable, the more strongly two variables must be related

–with a PRE measure, we always start with an initial guess of each case’s Y-score without

knowledge of its X-score and calculate how much error we have made, then we make a final

guess using knowledge of each case’s X-score and calculate the remaining error; the final

guess should be more accurate than the initial guess (i.e. the error should be reduced) if the

variables are related

–with interval data, our initial guess is simply the mean Y-score for the group of cases, since

the mean is in the middle of the distribution

–the error we make for each case is the deviation of its actual Y-score from our guess,

i.e. from the mean of Y

–in the example used in class, for instance, Swiss governments lasted an average of

36 months and the mean for all 27 countries was 28.9 months

–therefore our initial error for Switzerland is (36 – 28.9) = 7.1 months

–we need to sum all these deviations to get the total error, but there is a problem: the

sum is always 0, which isn’t very informative

–so we square the deviations first, then sum them up:

2

## Document Summary

A summary of the key points about correlation and regression. At the interval level, any relationship can be analyzed in terms of both its nature and its strength. We can visualize the shape or nature of the relationship by plotting our cases in a scatter diagram or scatterplot. The vertical axis represents the dependent variable (y) and the horizontal axis represents the independent variable (x), and each case is plotted according to its y and x scores. To summarize the nature of the relationship (if any) between x and y, we can fit a line as closely as possible to the cluster of points in the scattergram. This line is called a regression line and is defined by its y-intercept, a, and a slope, b: B" is known as the regression coefficient. It tells us what change in y is produced by a one unit change in x. A" is the intercept, the value of y when x = 0.