Class Notes (809,201)
Tony Quon (67)
Lecture 22

5 Pages
97 Views

School
University of Ottawa
Department
Course
Professor
Tony Quon
Semester
Fall

Description
Multicollinearity: Multicollinearity occurs when at least one predictor is or is close to being a linear combination of the other predictor variables Pairwise collinearity occurs when two predictors are highly correlated (as in previous slide) Multicollinearity will affect our interpretation of the coefﬁcients but does not affect our predictions The easiest way to deal with multicollinearity is simply to drop some predictors They don’t add anything to the model anyway Detecting Multicollinearity: Multicollinearity can be detected by regressing each predictor variable Xj on the other predictor variables Make Xj the response variable and determine if it can be derived as a linear combination of the other predictors We then calculate the Variance Inﬂation Factor, VIFj = 1/(1-Rj2) where Rj2 is the multiple coefﬁcient of determination in the model with Xj as the response variable If the VIF is 100 or higher (i.e., Rj2 > .99) then we say that multicollinearity is severe Any VIF greater than 10 suggests that multicollinearity may be a problem Leverage and Inﬂuence: The leverage of an observation measures how far its x-value is from the centre of the x-values The farther away it is, the more “leverage” it has on the resulting regression line For a simple linear model, the leverage of observation i is: 1/n + (xi - x-bar)2 / j (xj - x-bar)2 Hard to Notice! High leverage points are often masked if we look only at the residual plot since these cases tend to have such high “inﬂuence” on the regression line that the resulting residuals are not outliers Leverage Values: Leverage values range from 0 to 1 A boxplot of the leverage values helps to determine if any observations have particularly outlying values An observation has “high leverage” if the value exceeds 2(k+1)/n, where k is the number of predictor variables in the model High leverage values indicate observations with potentially high inﬂuence on the regression model, changing its coefﬁcients or affecting the predictions Measures of Inﬂuence: For each observation i, Cook’s Distance adds up the squared changes in all the predicted values when observation i is excluded from the regression estimation, and normalizes it by the total variance of the ﬁtted values Formula: Cook’s Di = j(Yj-hat - Yj(i)-hat)2/(k + 1)se2 Yj(i)-hat is the predicted value of Yj based on the regression that omits the ith observation Use of Minitab: In the Regression dialog box, select the “Storage” button, and click on “Hi leverages” or “Cook’s Distance” You will see the values calculated in the worksheet for each observation Recall that Minitab will ﬂag them for you using “R” to denote points with high residuals and “X” to denote points whose x-value makes them inﬂuential Example: GPA of Computer Science Majors: We want to predict Y = GPA of computer science majors given: X1 = SATM (math SAT score) X2 = SATV (verbal SAT score) X3 = HSM (High school math mark) X4 = HSS (High school socials mark) X5 = HSE (High school English mark) What subset of these predictors best predicts Y? Summary Statistics: VariableN
More Less

OR

Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Join to view

OR

By registering, I agree to the Terms and Privacy Policies
Just a few more details

So we can recommend you notes for your school.