STA302H1 Study Guide - Final Guide: Simple Linear Regression, Statistical Inference, Type I And Type Ii Errors

245 views2 pages

orchidlouse828

22 Jun 2018

School

Department

Course

Professor

For unlimited access to Study Guides, a Grade+ subscription is required.

Chapter 5: Multiple Linear Regression –

Estimation and Inference in Multiple Linear

Regression

1. Model: ,



•: random fluctuation (error) in  such that



• p+2 parameters: 

• p coefficients: 

• ,



2. Matrix Formulation of Least Square Estimates







, 



 



  

   ,







, 







 , where 

i. 

 ,









 

 

ii. , ,











 (unbiased)

3. Tests of Linearity

i. Test whether there is a linear association

between Y and all 







ii. T-test: 







ifis true (test each  one at a time)

iii. F-test:  



ifis true (test all  at once)

• Total sample variability:







• Variability explained by the model:









• Residual sum of squares:









•  if there is a linear

relationship between Y and all 

• 

 

 ,  : how much

variation in y can be explained by the

model

Chapter 6: Diagnostics and Transformations

for Multiple Linear Regression

I. Regression Diagnostics for Multiple Regression

1. Regression Diagnostics: (i) The validity of the

model: standardized residual-fitted value (

 ) plot, standardized residual-predictor

variable () plots, marginal model plots; (ii)

Determine whether there are leverage points;

(iii) Determine whether there are outliers; (iv)

The effect of each predictor variable on the

response variable: added-variable plots; (v)

The extent of collinearity among the predictor

variables: variance inflation factors; (vi)

Determine whether the error variance is

constant; (vii) If the data are collected over time,

examine whether the data are correlated over

time.

2. Leverage Points in Multiple Regression

i. Hat Matrix:  ()

• 



ii. If 

 (in multiple regression with

p predictors), the ith point is a leverage point

• 





 , where





 









*: the (i, j)th element of H,

: the ith diagonal element of H

3. Properties of Residuals in Multiple Regression

i. 

 , 

ii. Standardized Residual: 



,

where 









iii. Using Residuals and Standardized

Residuals for Model Checking

(a) When a valid model has been fit, a plot of

- (or linear combination) will have the

following features:

• A random scatter plot of points around

the horizontal axis ()

• Constant variability as we look along the

horizontal axis

(b) Any non-random (deterministic) pattern in

plots of  indicates an invalid model has

been fit to the data

residuals provide direct information on

how the model is misspecified when the

following two conditions hold:

• 

 plot -



• 

 plot  (linear relationship)

*If both conditions do not hold, then a

pattern in a residual plot indicates that an

incorrect model has been fit, but the pattern

itself does not provide direct information on

how the model is misspecified.

*Premise: We already know that the model is

invalid, then we use the conditions to check

whether it is possible to improve the model

II. Using Transformation to Overcome

Nonlinearity – Transforming Only the

Response Variable Using Inverse Regression

1. Suppose that the true regression model

between Y and  is given by:



Turn the model into a multiple regression

model by transforming Y by :



e.g. 



2. If we want to estimate g, we plot -

;

if we want to estimate , we plot 

.

III. Multicollinearity and Variance Inflation

Factors

1. Multicollinearity: A number of important

issues arise when strong correlations exist

among the predictor variables

2. Variance Inflation Factors

i. First, consider a multiple regression model

with two predictors:



* : Pearson correlation coefficient

betweenand

* : the standard deviation of 

 

 



 ()



: variance inflation factor

 Correlation amongst the predictors ()

increases the variance of the estimated

regression coefficients

ii. Next consider the general regression model:



* : the value of  obtained from the

regression of  on the other 

(variability explained by the model)

 

 



 (),



: the jth variance inflation factor

 If the predictor variables are correlated,

would be close to 1. Then

would

be very large, p-value would be very large

(statistically insignificant), and the

confidence interval would be wide.

Chapter 7: Variable Selection

I. Evaluating Potential Subsets of Predictor

Variables–AIC (Akaike’s Information

Criterion)

1. Definition: AIC is an estimator of the relative

quality of statistical models for a given set of

data

2. Derivation:

• Suppose  are the observed

values of normal random variables

• 

• 







• Likelihood:  



 





 

Let 

,









• Log-Likelihood: 





 

,

where 







• 

(R-output: 

)

*The smaller the AIC, the better the model.

3. Use: model selection. Given a collection of

models for the data, AIC estimates the quality

of each models, relative to each of the other

models.

4. AIC tells nothing about testing a null

hypothesis and the absolute quality of a model.

AIC only tells the quality relative to other

models.

II. Deciding on the Collection of Potential Subsets

of Predictor Variables

1. “Best”: the best choice is the set of predictors

with the smallest value of RSS

* max(or 

)  min RSS  min AIC

2. Forward Stepwise Regression

i. Definition: Forward stepwise starts with no

potential predictor variables in the regression

equation. Then at each step, it adds the

predictor such that the resulting model has the

lowest value of an information criterion. This

process is continued until all variables have

Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Regression: model: (cid:1851)(cid:3036)=(cid:2010)(cid:2868)+(cid:2010)(cid:2869)(cid:1876)(cid:2869)(cid:3036)++(cid:2010)(cid:1876)(cid:3036)+(cid:1857)(cid:3036), (cid:1857)(cid:3036)~(cid:1861)(cid:1861)(cid:1856) (cid:4666)(cid:882),(cid:2870)(cid:4667),(cid:1861)=(cid:883),(cid:884),,(cid:1866, (cid:1857)(cid:3036): random fluctuation (error) in (cid:1851)(cid:3036) such that (cid:4666)(cid:1857)(cid:3036)|x(cid:4667)=(cid:882, p+2 parameters: (cid:2010)(cid:2868),(cid:2010)(cid:2869),,(cid:2010), p coefficients: (cid:2010)(cid:2869),(cid:2010)(cid:2870),,(cid:2010, (cid:4666)(cid:1877)(cid:3036)|x(cid:4667)=(cid:2010)(cid:2868)+(cid:2010)(cid:2869)(cid:1876)(cid:2869)(cid:3036)++(cid:2010)(cid:1876)(cid:3036), Y=(cid:3438)(cid:1877)(cid:2869)(cid:1877)(cid:2870)(cid:1709)(cid:1877)(cid:3041)), x=(cid:3438)(cid:883)(cid:883) (cid:1876)(cid:2869)(cid:2869)(cid:1876)(cid:2870)(cid:2869) (cid:1710) (cid:1876)(cid:2869)(cid:1876)(cid:2870) (cid:883) (cid:1876)(cid:3041)(cid:2869) (cid:1710) (cid:1876)(cid:3041)), (cid:1709) (cid:1709) (cid:574)=(cid:3438)(cid:2010)(cid:2869)(cid:2010)(cid:2870)(cid:1709)(cid:2010)), e=(cid:3438)(cid:1857)(cid:2869)(cid:1857)(cid:2870)(cid:1709)(cid:1857)(cid:3041)) (cid:1372) y=x(cid:574)+e, where var(cid:4666)e(cid:4667)=(cid:2870)(cid:1835)(cid:3041) (cid:3041: (cid:2010) =(cid:4666)(cid:1850) (cid:1850)(cid:4667) (cid:2869)(cid:1850) (cid:1851) (cid:1372) (cid:4666)(cid:2010) |x(cid:4667)=(cid:574), (cid:1876) ii. (cid:1851) =(cid:1850)(cid:2010) , (cid:1857) =(cid:1851) (cid:1851) =(cid:1851) (cid:1850)(cid:2010) , (cid:3041) (cid:2869)= (cid:2869)(cid:3041) (cid:2869) (cid:1857)(cid:3114) (cid:2870) (cid:1871)(cid:2870)= (cid:3019)(cid:3020)(cid:3020) (cid:3041)(cid:3036)=(cid:2869) between y and all (cid:1876)(cid:3036) (cid:1834)(cid:2868): (cid:2010)(cid:2869)=(cid:2010)(cid:2870)==(cid:2010)=(cid:882) (cid:1834)(cid:2869):(cid:1853)(cid:1872) (cid:1864)(cid:1857)(cid:1853)(cid:1871)(cid:1872) (cid:1871)(cid:1867)(cid:1865)(cid:1857) (cid:1867)(cid:1858) (cid:1872) (cid:1857) (cid:2010)(cid:3036) (cid:882) (cid:4666)(cid:1861)=(cid:883),(cid:884),,(cid:1868)(cid:4667) (cid:3046)(cid:3032)(cid:4666)(cid:3081)(cid:3362) (cid:4667)~(cid:1872)(cid:3041) (cid:2869) ii. T-test: (cid:1846)(cid:3036)=(cid:3081)(cid:3362) (cid:3081)(cid:3284) if (cid:1834)(cid:2868) is true (test each (cid:2010)(cid:3036) one at a time) iii. F-test: = (cid:3020)(cid:3020)/ (cid:3019)(cid:3020)(cid:3020)/(cid:3041) (cid:2869)~(cid:1832),(cid:3041) (cid:2869) if (cid:1834)(cid:2868) is true (test all (cid:2010)(cid:3036) at once) Rss= (cid:4666)(cid:1877)(cid:3036) (cid:1877)(cid:3114) (cid:4667)(cid:2870) (cid:3041)(cid:3036)=(cid:2869) relationship between y and all (cid:1876)(cid:3036: sst=(cid:1845)(cid:1845)(cid:3045)(cid:3032)+(cid:1844)(cid:1845)(cid:1845) if there is a linear, (cid:1844)(cid:2870)=(cid:3020)(cid:3020)(cid:3020)(cid:3020)(cid:3021) =(cid:883) (cid:3019)(cid:3020)(cid:3020)(cid:3020)(cid:3020)(cid:3021) , (cid:1844)(cid:2870) : how much, residual sum of squares, tests of linearity i.

STA302H1 Study Guide - Final Guide: Simple Linear Regression, Statistical Inference, Type I And Type Ii Errors

Document Summary

Get access

Related Documents

STA302H1 Study Guide - Midterm Guide: Covariance, Independent And Identically Distributed Random Variables, Simple Linear Regression

STA302H1 Midterm: UTSG STA 302 tsolf10

STA302H1 Midterm: UTSG STA 302 examf07