MISY261 Final: Exam 4 Prep

Management Information Systems
Alex Chaplin

MISY262 Exam 4 Prep Overfitting and Variable Selection Overfitting is adding too many variables o Basically overcustomizing the model o Reduces out of sample performance in test data o Can adversely affect all types of models Sure Signs of Overfitting: o Really high r^2 (close to 1) o Much worse performance in test set than training set Poor stability Fixing overfitting can improve stability without necessarily losing accuracy How to Avoid Overfitting LinearLogistic Regression o Only add variables that make substantial improvement in explanatory power of the model R^2 o Use variable selection to eliminate unnecessary variables Hierarchical Selection Add variables known to be useful Add variables one at a time, best first Use r^2 to determine how much each addition improves the model Forced Entry Create model based on best guess Enter all variables at the same time Run the model once and assume it is accurate AllSubsets Selection Run models for every subset of variables Choose the model that fits the data best Problem: exhausting potential o The best one may just look the best by chance Forward Stepwise Regression Start with no variables At each step choose the addition that will most improve r^2 End when no major improvement comes from an addition Backward Stepwise Selection Start with all variables At each step, remove the variable that contributes the least to model fit Issues with Stepwise Selection: o Relies too much on model fit We want a generalized relationship This r^2 relies on this specific data o Researchers prefer it
