PSY248 Lecture Notes - Lecture 4: Error Bar, Multicollinearity, Linear Regression
PSY248: Week 4 – multiple regression
→ what we have done so far is simple linear regression, where we fit a straight line to
describe the relationship between a predicted variable and an outcome variable
→ y = a +bX + e
• It is possible to have multiple predictor variables (so multiple X’s)
• Y = dependent variable
• You will still go through the 3 stages: univariate, bivariate, perform regression
& check assumptions
• Only difference is we have more than one predictor
Multiple regression for mental impairment
• Outcome (DV) is a measure of Mental Impairment, general psychiatric
symptoms
• Possible IVs are the two predictor variables:
o Life events score
o Socioeconomic status
• We start to recognize that our variable of primary interest may not be the only
relevant IV
• DV is a measure of general mental impairment (psychiatric symptoms
including depression and anxiety)
• The two IVs used here are X1 = life events and X2 = socioeconomic status
• Life events refers to score on a life events index, including both number of life
events and severity of events experienced in the past 3 years
• Life events is our IV of primary interest. The research question is whether
more frequent (and severe) life events predicts higher mental impairment
Steps in doing this multiple regression
• 1. Recognise problem as a multiple regression
• 2. Remember RQ
• 3. Univariate data description – graphical and numerical
• 4. Bivariate Graphical data description
• 5. Produce correlation matrix (Pearson’s r)
• 6. Fit full model (if appropriate to 2.)
• 7. Reduce Full Model (if appropriate)
• 8. Fit Final Model and Report
Steps 1-2:
• Recognize problem as a multiple regression
o 1 numeric, DV, 2 numeric IVs
• Consider theory: more frequent and severe life events should generate more
psychiatric symptoms
• Write (draw) RQ: do life events and socioeconomic status together predict
mental impairment? If so, are both predictors required?
• Y = mental impairment
• X1 = life events
• X2 = SES
find more resources at oneclass.com
find more resources at oneclass.com
Understand population and data
• Understand sampling population: Florida adults
• Understand unit of analysis: general community members
• Check all IVs numeric → yes
• Ordinal variables checked before upgrading → no ordinal variable
• Consider variable-to-predictor/IV ratio rule of thumb =
o Look at the number of IVs and then you multiply that by two random
numbers
o N > 5*p is bare minimum
o N > 10*p more desirable
o Here we have a sample of 40 >> 10*2
3. Univariate data description
• Produce graphical summaries (histogram, error bar plot)
• Comment on distributions for BOTH IVs (central tendency, variability, skew,
kurtosis etc.)
• Summarise with appropriate numerical values
• Write global summary statement of what you have found
• SPSS menu: graph → legacy dialogs → graph (tick display normal curve)
• Descriptive statistics
o You can see the three means of the variables and the sample sizes (N)
o Listwise N = number of cases that have valid values for all variables in
the table
• The three variables were approx. normally distributed. The dependent
variable, Mental Impairment, ranged from 17 to 41 (mean 27, SD, 5.5). For
the two predictors life events ranged from 3 to 97 and parents years of
education ranged from 3 to 96
4. Bivariate data description
• Plot DV against each IV
• Comment on scatterplots (7 points)
• Consider outliers
• Write global summary statement of what u have found
• The scatterplots for mental impairment against life events and SES show a
positive and negative linear relationship respectively. Both relationships
appear low-moderate strength and only low correlation. The graphs show no
unusual characteristics although there may be one outlier for the relationship
between MI and SES
5. Produce correlation matrix
• Consider colinearity, multicolinearity
• Consider statistical significance of DV correlation with each IV
• Consider correlations between the two IVs
• Write summary statement on what you found
Collinearity (multicollinearity) occurs when two or more IVs are so correlated that
one can be predicted (almost) exactly from one or more of the others
• Made possibly by a combination of
o Multiple IVs
find more resources at oneclass.com
find more resources at oneclass.com
o Non-orthogonality due to observational design
• To occur requires strong correlations between IVs
• Multicollinearity exists whenever an IV can be exactly/nearly calculated from
a linear combination of other IVs
• Indicators
o Large correlations among IVs
o Large changes in coefficients and/or SE, when a new IV is added to
the model
Scatterplot of IVs to look at correlation/collinearity
• Appearance of IVs confirms weak correlation (look at diagram), thus no
collinearity
• This plot and the earlier correlation work for 2 IVs but if >2 IVs there maybe a
more complex pattern of IV correlations
Correlation table:
• You can see that mental impairment (DV) is positively correlated with life
events
• But negatively correlated with SES
• Correlation between life events and SES = positively correlated but very small
• Correlation of each X and Y (IV and DV) = both moderate and significant
(0.05)
• Thus, the chance of obtaining a Pearson correlation of .372 based on a sample
of 40, when the null hypothesis is true, is 1.8% (.018) → ask yourself if it is
sufficiently unlikely?
• Can define sufficiently unlikely as whatever percent you want depending on
the %
Rule of thumb – if the correlation between two predictors is above +0.7, it is possible
collinearity. If the correlation between two is above 0.8, we have definite collinearity
(we find these stats in Pearson Correlation)
• A change in the sample will help change stat to collinearity (e.g. changing
from 40 to 40,000)
But… bivariate correlations may not be sufficient to identify collinearity
• One IV may be a non-obvious linear combination of several other IVs
• SPSS calculates a thing called tolerance and variance inflation factor; these are
found on the coefficients table under ‘collinearity’ – the statistic shown shows
the degree to which the IVs are correlated with each other (how much variance
they share when predicting mental impairment)
Summary of step 5 (correlation)
• Both predictors are statistically significantly correlated with the DV. Life
events is positively correlated with Mental impairment (r = 0.37; p = 0.018)
while SES has a correlation of similar degree, but negative, with Mental
impairment (r = -0.40, p = 0.011). The two predictors are very weakly
correlated (r = 0.12, p = 0.45) and thus give us no reason to expect
collinearity.
find more resources at oneclass.com
find more resources at oneclass.com
Document Summary
What we have done so far is simple linear regression, where we fit a straight line to describe the relationship between a predicted variable and an outcome variable. It is possible to have multiple predictor variables (so multiple x"s: y = dependent variable, you will still go through the 3 stages: univariate, bivariate, perform regression. & check assumptions: only difference is we have more than one predictor. The research question is whether more frequent (and severe) life events predicts higher mental impairment. Steps in doing this multiple regression: 1. Recognise problem as a multiple regression: 2. Univariate data description graphical and numerical: 4. Fit full model (if appropriate to 2. : 7. If so, are both predictors required: y = mental impairment, x1 = life events, x2 = ses. The dependent variable, mental impairment, ranged from 17 to 41 (mean 27, sd, 5. 5). Both relationships appear low-moderate strength and only low correlation.