This

**preview**shows pages 1-3. to view the full**15 pages of the document.**Statistics Part Two – Exploring Relationships

Between Two Variables

Chapter Seven – Scatterplots, Association and

Correlation

Scatterplots

•Scatterplots may be the most common and most effective display for

data

○In a scatterplot, you can see patterns, trends, relationships, and

even the occasional extraordinary value sitting apart from the

others

•Scatterplots are the best way to start observing the relationship and

the ideal way to picture associations between two quantitative

variables

Roles for Variables

•It is important to determine which of the two quantitative variables

goes on the x-axis and which on the y-axis

•This determination is made based on the roles played by the variables

•When the roles are clear, the explanatory or predictor variable goes on

the x-axis and the response variable goes on the y-axis

•The roles that we choose for variables are more about how we think

about them rather than about the variables themselves

•Just placing a variable on the x-axis doesn’t necessarily mean that it

explains or predicts anything. And the variable on the y-axis may not

respond to it in any way

More on Scatterplots

•When looking at scatterplots, we will look for direction, form, strength,

and unusual features.

•Direction:

○A pattern that runs from the upper left to the lower right is said

to have negative direction

○A trend running the other way is said to have positive direction

•Form:

○If there is a straight line (linear) relationship, it will appear as a

cloud or swarm of points stretched out in a generally consistent,

straight form

○If the relationship isn’t straight, but curves gently (exponential

growth for example), while still increasing or decreasing

steadily, we can often find ways to make it more nearly straight

○If the relationship curves sharply however, the methods will not

work

•Strength:

○At one extreme, the points appear to follow a single stream

(whether straight, curved, or bending all over)

○At the other extreme, the points appear as a vague cloud with

no discernable trend or pattern

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

•Unusual Features:

○Look for the unexpected

○Often the most interesting thing to see in a scatterplot is the

thing you never thought to look for

○One example of such a surprise is an outlier standing away from

the overall pattern of the scatterplot

○Clusters or subgroups should also raise questions

Correlation: Quantifying the Strength of Linear

Association

•Data collected from students in stats classes

included their heights and weights

•Here is a positive association and a fairly

straight form with one high outlier.

•So how strong is the association between

weight and height of stats students?

•If we had to put a number on the strength,

we would not want it to depend on the units

we used because no matter the units, the pattern is

the same

•So since units do not matter, why not remove them?

•We could standardize both variables and write the

coordinates of a point as (zx, zy)

•Here is a scatterplot of the standardized weights and

heights

○Note that the underlying linear pattern seems

steeper in the standardized plot than in the

original

○That’s because we made the scales of the axis

the same

○Equal scaling gives a neutral way of drawing

the scatterplot and a fairer impression of the

strength of association

•Some points strengthen the impression of a positive association (along

linear line), others weaken the positive (outliers) and some don’t vote

either way (z-scores of zero)

•The correlation coefficient (r) gives us a numerical measurement of the

strength of the linear relationship between the explanatory and the

response variables

1

x y

z z

rn

=−

∑

•(So the formula means multiply each zx by zy, add up all those values,

then divide by the number of data minus one)

Correlation Conditions

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

•Correlation measures the strength of the linear association between

two quantitative variables

•Before you use correlation, you must check several conditions

○Quantitative Variables Condition

Correlation applies only to quantitative variables

Don’t apply correlation to categorical data masquerading

as quantitative

Check that you the variables’ units and what they

measure

○Straight Enough Condition

You can calculate a correlation coefficient for any pair of

variables

But correlation measures the strength only of the linear

association, and will be misleading if the relationship is

not linear

○Outlier Condition

Outliers can distort the correlation dramatically

An outlier can make an otherwise small correlation look

big or hide a large correlation

It can even give an otherwise positive association a

negative correlation coefficient (and vice versa)

When you see an outlier, it’s often a good idea to report

the correlations with and without that point

Correlation Properties

•The sign of a correlation coefficient gives the direction of the

association

•Correlation is always between -1 and +1

○Correlation can be exactly equal to -1 or +1, but these values

are unusual in real data because they mean that all the data

points fall exactly on a single straight line

○A correlation near zero corresponds to a weak linear association

•Correlation treats x and y symmetrically:

○The correlation of x with y is the same as the correlation of y

with x

•Correlation has no units

•Correlation is not affected by changes in the center or scale of either

variable

○Correlation depends only on the z-scores, and they are

unaffected by changes in center or scale

Correlation DOES NOT EQUAL Causation

•Whenever we have a strong correlation, it is tempting to explain it by

imagining that the predictor variable has caused the response to help

•Scatterplots and correlation coefficients never prove causation

•A hidden variable that stands behind a relationship and determines it

by simultaneously affecting the other two variables is called a lurking

variable

Also:

###### You're Reading a Preview

Unlock to view full version