Ch 7 Scatterplots, Association, and Correlation
We will be investigating the relationship and association between
two quantitative variables (bivariate data), such as height and
weight, the concentration of an injected drug and heart rate, or the
consumption level of some nutrient and weight gain.
Sometimes the purpose of a study is to show that one variable can
explain the outcome of another variable.
- Response (or dependent) variable (symbol: y) - measures
an outcome of a study
- Explanatory (or independent) variable (symbol: x)
explains or causes changes in the response variable.
Example 1: Distinguish the x and y variables
a) What is the effect of rainfall on crop yield?
b) What is the effect of the midterm score on the final grade?
1 of 25 Data:
- we measure x and y for each individual
- observations are recorded in the form (x, y)
- our sample of n bivariate observations is
(x1, y1), (2 ,2y ), …, (n ,ny )
- is the best way to start observing the relationship and the
ideal way to picture associations between two quantitative
- is a plot of pairs of observed values of two different
quantitative variables. It helps to evaluate the quality of the
- The x-axis is the horizontal axis and y-axis is the vertical
- Each observation is then plotted according to its value from
the x variable and its value from the y variable.
Does the number of years invested in schooling pay off in the job
2 of 25 Thought: the better educated you are, the more money you will
The data in the following table give the median annual income of
full-time workers age 25 or older by the number of years of
x = Years of Schooling y = Salary (dollars)
Create a scatterplot for x and y.
Scatterplot for salary vs. years
7 9 11 13 15 17 19 21
NOTE: If you want to make a scatterplot with more than 1 group,
then use different symbols for each group.
NOTE: Axes need not to intersect at (0, 0).
3 of 25 Examining a Scatterplot:
In any graph of data, look for the overall pattern and for striking
deviations (ex. outliers) from this pattern. You can describe the
overall pattern of a scatterplot by the form, direction, and strength
of the relationship.
1) Form of relationship
- linear – where the points roughly follow a straight line
- curved relationship and clusters
2) Strength of the Relationship
- determined by how close the points in the scatterplot lie to a
simple form such as a line
- the closer the observations appear to fit a line, the stronger the
3) Direction (positive and negative associations)
- 2 variables are positively associated when x increases, y also
- 2 variables are negatively associated when x increases, y
4 of 25 4) outliers or unusual observations
- look for any striking deviations from the overall pattern
Describe the pattern of the scatterplot above.
0 20 40 60 80
25 30 35 40 45 50 55
5 of 25 If the scatterplot shows a reasonable linear relationship, calculate
correlation coefficient to evaluate the direction and strength of
the linear relationship between two numerical variables.
Correlation coefficient r:
- a numerical measurement of the strength of the linear
relationship between the explanatory and response variables
xi x yi y
zxzy s s
r x y .
n 1 n 1
- This is the sum of the products of the standardized values
for each paired observation, all divided by n – 1.
Example: Calculate the correlation coefficient between years of
schooling and salary. What does this number imply?
x = Years of Schooling y = Salary (dollars)
6 of 25 NOTE: Summary statistics:
Column n Mean Variance Std. Dev. Sum
x 6 13.166667 16.166666 4.020779 79
y 6 27633.334 6.8718664E7 8289.672 165800
Facts about Pearson's correlation coefficient (r):
1) Correlation measures the strength of a linear relationship
between two quantitative variables. Check a scatterplot
a. Correlation requires both variables to be numerical;
Cannot be applied to categorical data
b. does NOT apply to nonlinear relations
c. outliers can distort the correlation dramatically
7 of 25 2) Correlation makes no distinction between explanatory and
response variables, ie. The correlation of x with y is the
same as the correlation of y with x.
3) Correlation has no units
4) Correlation is a number between –1 and 1
5) The absolute value of the coefficient measures how closely
the variables are related.
The closer it is to 1, the closer the relationship.
| r | > 0.8 a strong correlation between the variables.
r ≈ 0 a weak linear association
6) Like the mean and standard deviation, the correlation is
strongly affected by outliers.
7) Correlation is not affected by changes in the center or scale
of either variable.
- Correlation depends only on the z-scores, and they are
unaffected by changes in center or scale.
8) The sign of the correlation coefficient tells you of the trend
in the relationship.
r > 0 indicates a positive relation between the variables
r < 0 indicates a negative relation between the variables
8 of 25 Straightening Scatterplot (Ch10)
- Correlation is a measure of the strength for straight
relationships only. When a scatterplot shows a bent form that
consistently increases or decreases, we can often straighten the
form of the plot by re-expressing one or both variables.
- We can often find transformations that straighten the
y vs x ln(y) vs x
0 2 4 6 0 2 4 6
It is common in some fields to compute the correlations between
every pair of variables in a collection of variables and arrange
these correlations in a table.
9 of 25 Ch 8 Linear Regression & Ch 9 Regression Wisdom
Idea: To fit a straight line through the data so that we can predict
values of the response at specified values of x.
When we have one dependent variable and one independent
variable and the relationship between two variables follows a
linear pattern, it is possible to describe the relationship by a
straight line and by an equation of the form:
y = b0+ b 1
where b is called the y-intercept and b the slope of the equation.
The b’s are called the coefficients of the linear model.
10 of 25 The slope is the amount by which y increases when x increases by
Salary vs Years of Schooling
7 9 11 13 15 17 19 21
Years of Schooling
How do we find the line that best describes the linear
Estimate: y b0 b 1
- gives an estimate (predicted response) for y for a given
value of x
- y b b x is called the line of best fit or the least squares
Note 1: y y. The vertical distance from a data point (x, y) to
the line is called the error of prediction or deviation or
11 of 25 Deviation of the i data point (x, y) is:
y y y b b x
i i i 0 1 i
-A negative residual means the predicted value is too big
-A positive residual means the predicted value is too
small (an underestimate)
Note 2: Sum of the residuals is always 0. Thus, we can’t assess
how well the line fits by adding up all the residuals.
Note 3: Similar to what we did with deviations, we square the
residuals and add the squares.
Note 4: the smaller the sum, the better the fit.
Conclusion: The best fitted line is the one that minimizes the
sum of the squared differences between the data points and the
n 2 n 2