ADMS 3330 WINTER 2014 MIDTERM #1
Chapter 16: Simple Linear Regression and Correlation
Regression Analysis: used to predict the value of one variable (dependent) on the basis of the other variable
Type 1: Deterministic Model: set of equations that allow is to fully
determine the value of the dependent variable from the values of the
Type 2: Probabilistic Model: method used to capture the
randomness that is part of real-life process.
- To create this model, we start with deterministic since it
approximates the relationship we want and then add the random
term which measures the error of the deterministic.
The cost of building a new house is about $75 per square foot and
most lots sell for about $25,000. Hence the approximate selling price (y)
would be: 2
y = $25,000 + (75$/ft )(x)
(where x is the size of the house in square feet)
- House size is independent and house price is the dependent
- Probabilistic model: y = 25,000 + 75x + ɛ
ɛ as the error variable. It is the difference between the actual selling rpice
and the estimated price based on the size of the house. It will vary even is
x is the same.
Simple Linear Regression Model (First Order Linear model): straight line model with one independent variable
B0 and B1 are
which are usually
unknown and estimated
from the data.
16.2 Estimating the Coefficients
We estimate B0 on b0 and B1 on b1, the y-intercept and slope of the
least square or regression line by:
b1 is the y-intercept and b1 is the slope
Sum of Squares for Error (SSE) and Standard error of estimate (Sɛ)
- Used in the calculation of the Standard error of estimate (Sɛ)
- If Sɛ is zero, all pints fall on regression line
- Compare Sɛ and y-bar to judge value
Sɛ= 4.5 and ybar= 8.3
Interpretation: it appears to be large, meaning linear regression model is bad
16.3 Error Variable: Required Conditions
To have a valid regression: 4 conditions for the error variable (ɛ)
• The probability distribution of ɛ is normal.
• The mean of the distribution is 0; that is, E(ɛ ) = 0.
• The standard deviation of ɛ is STDVɛ , which is a constant regardless of the value of x.
• The value of ɛ associated with any particular value of y is independent of ɛ associated with any other value of y.
16.4 Assessing the Model The least squares method will always produce a straight line, even if there is no relationship between the variables, or if
the relationship is something other than linear.
Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it “fits” the
data. We’ll see these evaluation methods now. They’re based on the sum of squares for errors (SSE).
Testing the Slope: If no linear relationship exists between the two variables, we would expect the regression line to be
horizontal, that is, to have a slope of zero.
1. We want to see if there is a linear relationship, i.e. we want to see if the slope (B1) is something other than zero.
Our research hypothesis becomes:
Null hypothesis becomes:
H 0 B1= 0
And H :1B1 ≠ 0
where Sb1 is the STDV of b1:
Degrees of freedom: n-2
We can also estimate (to some level of confidence) and
interval for the slope parameter B1,
2. If we want to test for positive or negative linear relationships, we conduct one-tail tests, i.e. our research
H 0 B1= 0
H1: B1 < 0 for negative slope or H : B1 > 0 for positive slope
Coefficient of Determination
Tests thus far have shown if a linear relationship exists; it is also
useful to measure the strength of the relationship. This is done by
calculating the coefficient of determination – R . 2
- The coefficient of determination is the square of the
coefficient of correlation (r), hence R = (r)
Variation in y = SSE + SSR
SSE – Sum of Squares for Error – measures the amount of variation in y that remains unexplained (i.e. due to error)
SSR – Sum of Squares for Regression – measures the amount of variation in y explained by variation in the independent
R has a value of .49. This means 49% of the variation in the y is explained by the variation in the x. The remaining 51% is
unexplained, i.e. due to error.
Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to
- The higher the value of R , the better the model fits the data.
R = 1: Perfect match between the line and the data points.
R = 0: There are no linear relationship between x and y.
Coefficient of Correlation
We can use the coefficient of correlation to test for a linear relationship between two variables.
The coefficient of correlation’s range is between –1 and +1.
• If r = –1 (negative association) or r = +1 (positive association) every point falls on the regression line.
• If r = 0 there is no linear pattern
The population coefficient of correlation is denoted P (rho)
We estimate its value from sample data with the sample coefficient of correlation:
The test statistic for testing if P = 0 is:
Degrees of freedom: n-2
T-Test of coefficient correlation as an alternate means to determine whether two variables are linearly related.
Our research hypothesis is:
H 1 p ≠ 0 (i.e. there is a linear relationship) and our null hypothesis is:
H 0p = 0 (i.e. there is no linear relationship when rho = 0)
16.5 Using the Regression Equation
Point prediction: ybar calculated when x is given. These can estimate through interval. Prediction Interval: used when we want to predict one particular value of the dependent variable, given a specific value
of the independent variable.
x g is given the value of x we’re interested in
Confidence Interval Estimator of the expected value
of y: we are estimating the mean of y given a value of x.
- used for infinitely large populations.
- The confidence interval estimate of the expected value
of y will be narrower than the prediction interval for the
same given value of x and confidence level because
there is less error in estimating a mean value as opposed
to predicting an individual value.
16.6 Regression Diagnostics 1
Residual Analysis: examine the differences betweent he
actual data points and those preicted by the linear equation
ri yi− yˆi
- Standardized residuals for point i = =
ri = yi− yˆi
- Standardized residuals for point i = s s using minitab
s = s 1− h
- Where standard deviation of the ith residual ri ε i
Where h = 1 + (xi− x)
i n (n−1)s 2
Nonnormality: put into histogram and look for be shaped with the mean slope to
Heteroscedasticity: when the requirement of a constant variance is violated: tunnel
Nonindependence of the Error Variable: when the dates are time series, the errors
are often correlated. Error terms that are correlated over time are said to be auto
correlated or seriously correlated.
- We can detect auto correlation by graphing residuals against time periods. Is pattern emerges, it is likely that the
independence requirement is violated
Patterns in the appearance of the residuals over time
indicate that autocorrelation exist:
Outliers: an observation that is unusually small or
Possible reasons for the existence of outliers include:
• There was an error in recording the value
• The point should not have been included in the
* Perhaps the observation is indeed valid.
- can be identified from a scatter plot
If the absolute value of the standardized residual is >2, we suspect the
point may be an outlier and investigate further since they can easily
influence the least square line.
Steps in Calculating Least Square
1. Calculate the variance: Sxy
2. Calculate the variance of x : Sx²
3. Calculate average x and average y
4. Calculate b1 and b0
5. Least square line (sample regression)
Procedure for Regression Diagnostics 1. Develop a model that has a theoretical basis.
2. Gather data for the two variables in the model.
3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible
4. Determine the regression equation.
5. Calculate the residuals and check the required conditions
6. Assess the model’s fit.
7. If the model fits the data, use the regression equation to predict a particular value of the dependent variable
and/or estimate its mean.
Chapter 17: Multiple Regressions
Multiple Regression: allows for any number of independent variables
17.1 Model and Required Conditions
- We now have k independent variables potentially related to the one dependent variable.
First-Order linear equation
For these regression methods to be valid the following four conditions
for the error variable (ɛ) must be met:
• The probability distribution of the error variable (ɛ) is normal.
• The mean of the error variable is 0.
• The standard deviation of ɛ is STDVɛ, which is a constant.
• The errors are independent.
17.2 Estimating the Coefficients and Assessing the Model
Sample regression equation is expressed as:
We will use computer output to:
Assess the model… Employ the model…
How well it fits the data: Interpreting the coefficients
Is it useful? Predictions using the prediction equation
Are any required conditions violated? Estimating the expected value of the dependent
Regression Analysis Step
1. Use a computer and software to generate the coefficients and the statistics used to assess the model.
2. Diagnose violations of required conditions. If there are problems, attempt to remedy them.
3. Assess the model’s fit.
Standard error of estimate,
Coefficient of determination,
F-test of the analysis of variance. (Page 705)
4. If step 1,2, and 3 are OK, use the model to predict or estimate the expected value of the dependent variable.
1. Standard error of estimate,
2. Coefficient of determination, and
3. F-test of the analysis of variance.
Standard Error of Estimate
n = sample size
k = # of independent variables
Compare Sɛ to ybar (average of
Coefficient of Determination Interpretation: This means that 33.74% of the variation in income is explained by the eight independent variables, but
66.26% remains unexplained.
Adjusted R² value: the coefficient of determination adjusted for degrees of freedom. It takes into account the sample size
n, and k, the number of independent variables, and is given by:
In this model the coefficient of determination adjusted for degrees of
freedom is .3180.
Testing the Validity of the Model
In a multiple regression model (i.e. more than one independent