Textbook Notes
(362,734)

Canada
(158,032)

University of Alberta
(2,662)

Statistics
(112)

STAT151
(74)

Paul Cartledge
(8)

Chapter 3

# Ch3.pdf

Unlock Document

University of Alberta

Statistics

STAT151

Paul Cartledge

Fall

Description

Ch. 3 – Intro to Correlation and Regression
Ch. 2 deals with univariate data. This chapter, however, considers bivariate data and
how two numerical variables are related. Methods of description are introduced here and
formalized in Ch. 11.
Terminology:
x y
Explanatory variable Response variable
Independent variable Dependent variable
Predictor variable Predicted variable
Notation:
- bivariate sample of size n: { (x1, 1 ), (2 ,2y ), …, nx n y ) }
- sample means: x , y
- sample std dev.: sx, sy
Displaying relationships:
Def’n: An association exists between two variables if a particular value for one variable
is more likely to occur with certain values of the other variable.
A scatterplot is a graphical display of two quantitative variables.
- x-variable goes on the x-axis, y-variable on the y-axis
- origin (0,0) may be included
Look for: - form of relationship (i.e. any obvious pattern)
- strength of relationship (i.e. closeness of fitting to a line)
- direction of relationship (i.e. positive or negative association)
- any unusual observations or outliers
x y
1 1
2 2
4 1
3 2
(graph of above data used to discuss scatterplot traits further)
Correlation:
Def’n: Pearson’s Sample Correlation Coefficient r is given by
n ⎛ ⎞⎛ ⎞
r = 1 ⎜ xi − x ⎟⎜y i− y ⎟ = 1 z z
n 1 ∑i=⎝ sx ⎠⎜ sy ⎟ n − ∑ i i
⎝ ⎠
where z ixithe “standardized” observation for x andiz is thi “standardized”
observation for y ior i = 1, …, n
(example graphs of correlation drawn in class: 1. strong positive linear; 2. weak positive
linear; 3. strong negative linear; 4. no pattern; 5. parabola; 6. exponential) Properties of r:
• Can only be calculated for numerical data.
• A measure of the LINEAR relationship between two variables.
• -1 ≤ r ≤ 1
• The magnitude of r (or absolute value) measures the strength of the relationship:
o If r = ± 1, then the points follow a straight line.
o If r = 0, then the pattern of scatter suggest no linear relationship.
• The sign of r indicates the nature of the relationship:
o Positive association if r > 0,
o Negative association if r < 0.
• The two variables x and y play symmetric roles.
• Location and scale invariance (unitless).
• We can have r = 0, even when the data reveal a strong nonlinear relationship.
o e.g. y = x 2
• Correlation does not imply causation (or vice versa).
• Since r depends on the mean and std. dev., it is sensitive to outliers.
3.3 Intro to Simple Linear Regression
Ex3.1) Suppose you had 4 variables for the Oilers roster: height, weight, jersey, age
- which relationships might be valid?
- how can we describe the relationship between any pair?
- how do we use the description to make predictions?
- how do we quantify errors in estimates and predictions?
Def’n: The regression line predicts the value for the response variable y as a straight-line
function of the value x of the explanatory variable.
Equation for the regression line: ŷ = a + bx
- a is the intercept: the height of the line at x = 0.
- b is the slope: the amount by which y increases when x increases by 1 unit.
- ŷ (“y-hat”) denotes the predicted value of y (or mean y for a given value of x). What about a new student who gets a mark of 80.1%? No observation so can we estimate
the final mark based on the pattern of the other observations? Try and fit a line through
the data and use it as a model for final percentage given midterm percentage; then, use
the line to estimate (or, interpolate) the final percentage for a student that gets 8

More
Less
Related notes for STAT151