Class Notes (809,201)
Canada (493,576)
Administration (2,631)
ADM2304 (69)
Tony Quon (67)
Lecture 22

ADM2304 Lecture 22: 22

5 Pages
Unlock Document

University of Ottawa
Tony Quon

Multicollinearity: Multicollinearity occurs when at least one predictor is or is close to being a linear combination of the other predictor variables Pairwise collinearity occurs when two predictors are highly correlated (as in previous slide) Multicollinearity will affect our interpretation of the coefficients but does not affect our predictions The easiest way to deal with multicollinearity is simply to drop some predictors They don’t add anything to the model anyway Detecting Multicollinearity: Multicollinearity can be detected by regressing each predictor variable Xj on the other predictor variables Make Xj the response variable and determine if it can be derived as a linear combination of the other predictors We then calculate the Variance Inflation Factor, VIFj = 1/(1-Rj2) where Rj2 is the multiple coefficient of determination in the model with Xj as the response variable If the VIF is 100 or higher (i.e., Rj2 > .99) then we say that multicollinearity is severe Any VIF greater than 10 suggests that multicollinearity may be a problem Leverage and Influence: The leverage of an observation measures how far its x-value is from the centre of the x-values The farther away it is, the more “leverage” it has on the resulting regression line For a simple linear model, the leverage of observation i is: 1/n + (xi - x-bar)2 / j (xj - x-bar)2 Hard to Notice! High leverage points are often masked if we look only at the residual plot since these cases tend to have such high “influence” on the regression line that the resulting residuals are not outliers Leverage Values: Leverage values range from 0 to 1 A boxplot of the leverage values helps to determine if any observations have particularly outlying values An observation has “high leverage” if the value exceeds 2(k+1)/n, where k is the number of predictor variables in the model High leverage values indicate observations with potentially high influence on the regression model, changing its coefficients or affecting the predictions Measures of Influence: For each observation i, Cook’s Distance adds up the squared changes in all the predicted values when observation i is excluded from the regression estimation, and normalizes it by the total variance of the fitted values Formula: Cook’s Di = j(Yj-hat - Yj(i)-hat)2/(k + 1)se2 Yj(i)-hat is the predicted value of Yj based on the regression that omits the ith observation Use of Minitab: In the Regression dialog box, select the “Storage” button, and click on “Hi leverages” or “Cook’s Distance” You will see the values calculated in the worksheet for each observation Recall that Minitab will flag them for you using “R” to denote points with high residuals and “X” to denote points whose x-value makes them influential Example: GPA of Computer Science Majors: We want to predict Y = GPA of computer science majors given: X1 = SATM (math SAT score) X2 = SATV (verbal SAT score) X3 = HSM (High school math mark) X4 = HSS (High school socials mark) X5 = HSE (High school English mark) What subset of these predictors best predicts Y? Summary Statistics: VariableN
More Less

Related notes for ADM2304

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.