Chapter 18: Inference for Regression
18.1 The Population and the Sample
• The OLS line was used to describe data:
y=b +0 x 1
• Since samples will vary (resulting in different OLS lines), to make an inference
about the true relationship between x and y, we must use a different formula:
μy=β +0 x+1
• μ is used because the means of the y values for each value of x fall exactly on
o The y values at each x are distributed around the means that lie on the line.
• ε is used to soak up the deviation at each point so the model gives a value of y
for each value of x.
• To account for the uncertainty in our estimates, we use CI’s.
18.2 Assumptions and Conditions
• Check for each assumption and condition in the following order:
1. Linearity Assumption
The linearity condition is satisfied if a scatterplot of the variables
• Can also check a scatterplot of the residuals against x and if
no pattern emerges, then it is straight enough.
2. Independence Assumption
The errors in the true underlying regression model (the ε ’s)
must be independent.
Are the samples collected independently?
For timeseries data, check for autocorrelation of the errors.
• Is the error our model makes today similar to the one it
• To check this, plot the residuals against time.
3. Equal Variance Assumption
There should be homoscedasticity (not heteroscedasticity).
To make CI’s, the SD of the residuals needs to be used.
• The SD of the residuals “pools” information across all the
individual distributions of y at each xvalue, and pooled
estimates are appropriate only when they combine
information for groups with the same variance.
4. Normal Population Assumption
The errors around the idealized regression line at each value of x
are assumed to follow a Normal model.
• Needed to use the Student’s tmodel for inference.
Must check the Nearly Normal Condition. • As the sample size grows, the assumption becomes less
important as the CLT will begin to kick in.
• You can look at the histogram of the data or look at a
Normal probability plot to check this condition.
o Normal probability plot compares each value (in
this case the x number of residuals) with the value
we would have expected to get if we’d just drawn a
sample of x values from a standard Normal model.
If it is straight, then the condition passes.
At each x value, there’s a
distribution of yvalues that
follows a Normal model. Each of
these Normal models is centered
on the line and has the same
μy=β +0 x+1
• Steps are as follows to check the conditions:
o Make a scatterplot of the data (is it linear?)
o If the data are straight enough, fit a regression and find the residuals, e,
and predicted values, y .
o If timeseries, plot the residuals against time to check independence.
o Make scatterplot of the residuals against x or the predicted values.
Ensure that no pattern exists and that variance is constant.