# ART111 Lecture Notes - Lecture 2: Bayes Error Rate, Overfitting, Tikhonov Regularization

45 views11 pages

Homework 3 Due data 10/25/2018

Question 1: Suppose we estimate the regression coefficients in a linear regression model by minimizing

∑(𝑦

𝑛

𝑖=1 I – β0 - ∑𝛽

𝑝

𝑗=1 j xij)

subject to ∑|𝛽

𝑝

𝑗=1 j| ≤ s for a particular value of s.

(a) As we increase s from 0, the training RSS will: Steadily decrease. As we increase s from 0, all β's

increase from 0 to their least square estimate values. Training error for 0 βs is the maximum and it

steadily decreases to the Ordinary Least Square RSS

(b) As we increase s from 0, the test RSS will: Decrease initially, and then eventually start increasing in a U

shape. When s=0, all βs are 0, the model is extremely simple and has a high test-RSS. As we increase s, βs

assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually,

as βs approach their full blown OLS values, they start overfitting to the training data, increasing test-RSS.

(c) As we increase s from 0, variance will: Steadily increase. When s=0, the model effectively predicts a

constant and has almost no variance. As we increase s, the models include more βs and their values start

increasing. At this point, the values of βs become highly dependent on training data, thus increasing the

variance.

(d) As we increase s from 0, (squared) bias will: Steadily decrease. When s=0, the model effectively

predicts a constant and hence the prediction is far from actual value. Thus, bias is high. As s increases,

more βs become non-zero and thus the model continues to fit training data better. And thus, bias

decreases.

(e) As we increase s from 0, Bayes error rate will: Remain constant. By definition, irreducible error is

model independent and hence irrespective of the choice of s, remains constant.

Question 2: It is well-known that ridge regression tends to give similar coefficient values to correlated

variables, whereas the lasso may give quite different coefficient values to correlated variables. We will

now explore this property in a very simple setting. Suppose that n = 2, p = 2, x11 = x12, x21 = x22.

Furthermore, suppose that y1+y2 = 0 and x11+x21 = 0 and x12+x22 = 0, so that the estimate for the intercept

in a least squares, ridge regression, or lasso model is zero: β

̂0 = 0.

Question 3: In this exercise, we will generate simulated data, and will then use this data to perform best

subset selection.

(a) Use the rnorm() function to generate a predictor x of length n = 100, as well as a noise vector ǫ of

length n = 100.

> set.seed(1)

> x <- rnorm(100)

> eps <- rnorm(100)

(b) Generate a response vector Y of length n = 100 according to the model Y = 𝛽0 + 𝛽1x + 𝛽2x2 + 𝛽3x3 + ǫ,

where 𝛽0, 𝛽1, 𝛽2, and 𝛽3 are constants of your choice.

Assume β0=1, β1=2, β2=−1, and β3=0.5

> b0 <- 1

> b1 <- 2

> b2 <- -1

> b3 <- 0.5

> y <- b0 + b1 * x + b2 * x^2 + b3 * x^3 + eps

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model

containing the predictors x, x2, . . . ,x10. What is the best model obtained according to Cp, BIC, AIC, and

adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the

best model obtained.

> library(leaps)

> data.full <- data.frame(y = y, x = x)

> regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^

6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)

> reg.summary <- summary(regfit.full)