Class Notes
(809,201)

Canada
(493,576)

University of Ottawa
(32,198)

Administration
(2,631)

ADM2304
(69)

Tony Quon
(67)

Lecture 22

Unlock Document

University of Ottawa

Administration

ADM2304

Tony Quon

Fall

Description

Multicollinearity:
Multicollinearity occurs when at least one predictor is or is
close to being a linear combination of the other predictor
variables
Pairwise collinearity occurs when two predictors are highly
correlated (as in previous
slide)
Multicollinearity will affect
our interpretation of the
coefﬁcients but does not
affect our predictions
The easiest way to deal
with multicollinearity is
simply to drop some predictors
They don’t add anything to the model anyway
Detecting Multicollinearity:
Multicollinearity can be detected by regressing
each predictor variable Xj on the other predictor variables
Make Xj the response variable and determine if it can be derived as a linear
combination of the other predictors
We then calculate the Variance Inﬂation Factor, VIFj = 1/(1-Rj2) where Rj2 is the
multiple coefﬁcient of determination in the model with Xj as the response variable
If the VIF is 100 or higher (i.e., Rj2 > .99) then we say that multicollinearity is severe
Any VIF greater than 10 suggests that multicollinearity may be a problem
Leverage and Inﬂuence:
The leverage of an observation measures how far its x-value is from the
centre of the x-values
The farther away it is, the more “leverage” it has on the resulting
regression line
For a simple linear model, the leverage of observation i is:
1/n + (xi - x-bar)2 / j (xj - x-bar)2
Hard to Notice!
High leverage points are often masked if we look only at the residual plot
since these cases tend to have such high “inﬂuence” on the regression line that the resulting residuals are not outliers
Leverage Values:
Leverage values range from 0 to 1
A boxplot of the leverage values helps to determine if any observations have particularly
outlying values
An observation has “high leverage” if the value exceeds 2(k+1)/n, where k is the number
of predictor variables in the model
High leverage values indicate observations with potentially high inﬂuence on the
regression model, changing its coefﬁcients or affecting the predictions
Measures of Inﬂuence:
For each observation i, Cook’s Distance adds up the squared changes in all the
predicted values when observation i is excluded from the regression estimation, and
normalizes it by the total variance of the ﬁtted values
Formula:
Cook’s Di = j(Yj-hat - Yj(i)-hat)2/(k + 1)se2
Yj(i)-hat is the predicted value of Yj based on the regression that omits the ith
observation
Use of Minitab:
In the Regression dialog box, select the “Storage” button, and click on “Hi leverages” or
“Cook’s Distance”
You will see the values calculated in the worksheet for each observation
Recall that Minitab will ﬂag them for you using “R” to denote points with high residuals
and “X” to denote points whose x-value makes them inﬂuential
Example: GPA of Computer Science Majors:
We want to predict Y = GPA of computer science majors given:
X1 = SATM (math SAT score)
X2 = SATV (verbal SAT score)
X3 = HSM (High school math mark)
X4 = HSS (High school socials mark)
X5 = HSE (High school English mark)
What subset of these predictors best predicts Y?
Summary Statistics: VariableN

More
Less
Related notes for ADM2304