Study Guides (238,185)
Canada (114,990)
York University (9,812)
ADMS 3330 (37)
all (12)


9 Pages
Unlock Document

York University
Administrative Studies
ADMS 3330

ADMS 3330 WINTER 2014 MIDTERM #1 Chapter 16: Simple Linear Regression and Correlation Regression Analysis: used to predict the value of one variable (dependent) on the basis of the other variable (independent) 16.1 Model Type 1: Deterministic Model: set of equations that allow is to fully determine the value of the dependent variable from the values of the independent variables. Type 2: Probabilistic Model: method used to capture the randomness that is part of real-life process. - To create this model, we start with deterministic since it approximates the relationship we want and then add the random term which measures the error of the deterministic. Example: The cost of building a new house is about $75 per square foot and most lots sell for about $25,000. Hence the approximate selling price (y) would be: 2 y = $25,000 + (75$/ft )(x) (where x is the size of the house in square feet) - House size is independent and house price is the dependent - Probabilistic model: y = 25,000 + 75x + ɛ ɛ as the error variable. It is the difference between the actual selling rpice and the estimated price based on the size of the house. It will vary even is x is the same. Simple Linear Regression Model (First Order Linear model): straight line model with one independent variable B0 and B1 are population parameters which are usually unknown and estimated from the data. 16.2 Estimating the Coefficients We estimate B0 on b0 and B1 on b1, the y-intercept and slope of the least square or regression line by: b1 is the y-intercept and b1 is the slope Sum of Squares for Error (SSE) and Standard error of estimate (Sɛ) - Used in the calculation of the Standard error of estimate (Sɛ) - If Sɛ is zero, all pints fall on regression line - Compare Sɛ and y-bar to judge value Example: Sɛ= 4.5 and ybar= 8.3 Interpretation: it appears to be large, meaning linear regression model is bad 16.3 Error Variable: Required Conditions To have a valid regression: 4 conditions for the error variable (ɛ) • The probability distribution of ɛ is normal. • The mean of the distribution is 0; that is, E(ɛ ) = 0. • The standard deviation of ɛ is STDVɛ , which is a constant regardless of the value of x. • The value of ɛ associated with any particular value of y is independent of ɛ associated with any other value of y. 16.4 Assessing the Model The least squares method will always produce a straight line, even if there is no relationship between the variables, or if the relationship is something other than linear. Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it “fits” the data. We’ll see these evaluation methods now. They’re based on the sum of squares for errors (SSE). Testing the Slope: If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. 1. We want to see if there is a linear relationship, i.e. we want to see if the slope (B1) is something other than zero. Our research hypothesis becomes: Null hypothesis becomes: H 0 B1= 0 And H :1B1 ≠ 0 T-Test Statistics: where Sb1 is the STDV of b1: Degrees of freedom: n-2 We can also estimate (to some level of confidence) and interval for the slope parameter B1, 2. If we want to test for positive or negative linear relationships, we conduct one-tail tests, i.e. our research hypothesis become: Null hypothesis: H 0 B1= 0 H1: B1 < 0 for negative slope or H : B1 > 0 for positive slope Coefficient of Determination Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R . 2 - The coefficient of determination is the square of the 2 2 coefficient of correlation (r), hence R = (r) Variation in y = SSE + SSR SSE – Sum of Squares for Error – measures the amount of variation in y that remains unexplained (i.e. due to error) SSR – Sum of Squares for Regression – measures the amount of variation in y explained by variation in the independent variable x. In2erpretation: R has a value of .49. This means 49% of the variation in the y is explained by the variation in the x. The remaining 51% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. - The higher the value of R , the better the model fits the data. R = 1: Perfect match between the line and the data points. 2 R = 0: There are no linear relationship between x and y. Coefficient of Correlation We can use the coefficient of correlation to test for a linear relationship between two variables. The coefficient of correlation’s range is between –1 and +1. • If r = –1 (negative association) or r = +1 (positive association) every point falls on the regression line. • If r = 0 there is no linear pattern The population coefficient of correlation is denoted P (rho) We estimate its value from sample data with the sample coefficient of correlation: The test statistic for testing if P = 0 is: Degrees of freedom: n-2 T-Test of coefficient correlation as an alternate means to determine whether two variables are linearly related. Our research hypothesis is: H 1 p ≠ 0 (i.e. there is a linear relationship) and our null hypothesis is: H 0p = 0 (i.e. there is no linear relationship when rho = 0) 16.5 Using the Regression Equation Point prediction: ybar calculated when x is given. These can estimate through interval. Prediction Interval: used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable. x g is given the value of x we’re interested in Confidence Interval Estimator of the expected value of y: we are estimating the mean of y given a value of x. - used for infinitely large populations. - The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level because there is less error in estimating a mean value as opposed to predicting an individual value. 16.6 Regression Diagnostics 1 Residual Analysis: examine the differences betweent he actual data points and those preicted by the linear equation ri yi− yˆi - Standardized residuals for point i = = sε sε ri = yi− yˆi - Standardized residuals for point i = s s using minitab i ri s = s 1− h - Where standard deviation of the ith residual ri ε i 2 Where h = 1 + (xi− x) i n (n−1)s 2 x Nonnormality: put into histogram and look for be shaped with the mean slope to zero Heteroscedasticity: when the requirement of a constant variance is violated: tunnel shape Nonindependence of the Error Variable: when the dates are time series, the errors are often correlated. Error terms that are correlated over time are said to be auto correlated or seriously correlated. - We can detect auto correlation by graphing residuals against time periods. Is pattern emerges, it is likely that the independence requirement is violated Patterns in the appearance of the residuals over time indicate that autocorrelation exist: Outliers: an observation that is unusually small or unusually large. Possible reasons for the existence of outliers include: • There was an error in recording the value • The point should not have been included in the sample * Perhaps the observation is indeed valid. - can be identified from a scatter plot If the absolute value of the standardized residual is >2, we suspect the point may be an outlier and investigate further since they can easily influence the least square line. Remember: Steps in Calculating Least Square 1. Calculate the variance: Sxy 2. Calculate the variance of x : Sx² 3. Calculate average x and average y 4. Calculate b1 and b0 5. Least square line (sample regression) Procedure for Regression Diagnostics 1. Develop a model that has a theoretical basis. 2. Gather data for the two variables in the model. 3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. 4. Determine the regression equation. 5. Calculate the residuals and check the required conditions 6. Assess the model’s fit. 7. If the model fits the data, use the regression equation to predict a particular value of the dependent variable and/or estimate its mean. Chapter 17: Multiple Regressions Multiple Regression: allows for any number of independent variables 17.1 Model and Required Conditions - We now have k independent variables potentially related to the one dependent variable. First-Order linear equation Required Condition: For these regression methods to be valid the following four conditions for the error variable (ɛ) must be met: • The probability distribution of the error variable (ɛ) is normal. • The mean of the error variable is 0. • The standard deviation of ɛ is STDVɛ, which is a constant. • The errors are independent. 17.2 Estimating the Coefficients and Assessing the Model Sample regression equation is expressed as: We will use computer output to: Assess the model… Employ the model… How well it fits the data: Interpreting the coefficients Is it useful? Predictions using the prediction equation Are any required conditions violated? Estimating the expected value of the dependent variable Regression Analysis Step 1. Use a computer and software to generate the coefficients and the statistics used to assess the model. 2. Diagnose violations of required conditions. If there are problems, attempt to remedy them. 3. Assess the model’s fit. Standard error of estimate, Coefficient of determination, F-test of the analysis of variance. (Page 705) 4. If step 1,2, and 3 are OK, use the model to predict or estimate the expected value of the dependent variable. Example: Asses Model: 1. Standard error of estimate, 2. Coefficient of determination, and 3. F-test of the analysis of variance. Standard Error of Estimate n = sample size k = # of independent variables Compare Sɛ to ybar (average of y) Coefficient of Determination Interpretation: This means that 33.74% of the variation in income is explained by the eight independent variables, but 66.26% remains unexplained. Adjusted R² value: the coefficient of determination adjusted for degrees of freedom. It takes into account the sample size n, and k, the number of independent variables, and is given by: In this model the coefficient of determination adjusted for degrees of freedom is .3180. Testing the Validity of the Model In a multiple regression model (i.e. more than one independent
More Less

Related notes for ADMS 3330

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.