Canada
(493,713)

University of Toronto Mississauga
(23,347)

Sociology
(3,988)

SOC350H5
(9)

Lecture 8

University of Toronto Mississauga

Sociology

SOC350H5

David Pettinicchio

Winter

Description

Lecture 8
Index Variable
Count data
Does not follow a distribution for OLS
How to create an index variable
People call it index/scale
Count measure
When creating scale
You ask spss to create scale, it will count what you categorize as 1
If you have victimization in random labels – need to recode data in the same
direction so one in this variable is one in the other
Why is OLS problematic – too few variables added together you wont get
statistical sig
It produces a small rage
Not enough variation in that range – due to less cases
Only go with adding index if you have five possible outcomes
Have to have enough variables
Things you add have to be same
How to create index
Start with clean and consistent variables
IF one variable is coded one and two – spss will just add two
When you add variables together they have to be exactly same otherwise
adding things that are not comparable
Everything has to be same
Compute – add variables that you are going to add into total variable and use
operations to add them
Will end up with a range
If you have 15 possibilities then max you can have is 15
Must not exceed the maximum
Have to also look at distribution
Always double check
For OLS it is problematic when there is little variation in that spread
Caveat – shouldn’t be doing OLS with count data
Because it is not a normal distribution
Can fix it to get significance
People that do index use cronbach’s alpha
Anything less than .7 means you have an unreliable scale
Outliers
Normally dist data – symmetric – no influential outliers
Line of fit based on model that attempts to minimize error
Outliers have undue influence on the model (line of fit) Outliers have large error and that effects slope
Model with mother and fathers occ pres and educ determining respondents
occ pres
Doing a scatter plot on spss - most data is clustered
Everything that starts to move away could be problematic
Standardized residuals
Just because it seems to be an outlier doesn’t mean it is
Use standard residual tool – we want to compare errors
We want everything to be standardized
We are talking about the distance of points – that’s what we standardized
Allows us to talk about outliers beyond standard scores
Allows you to use normal dist to see if the outliers are beyond certain z
scores
Those are def case for concerns – beyond 2.5 is problem
Idea is that you are now able to use properties of distributions to see how far
stand dev away – 2.5 are problems
After creating new variable for distribution of variable – you can make a
histogram
This is a distribution of the errors – not cases
0 is avg error and tails are stand dev away
When you go into spss – create new variable
This one has large outliers on positive end
Mean residual was 44.84
Lowest error -35.68
Highest was 56.901
Z score – -2.942 to 4.689
There is more problem on the right
Case that has highest error is 4.689 stand dev away from mean
Cases of homicide - **
Cook’s D
More systematic and another tool to see outliers is cooks d
A measure of both distance and leverage – how much of an influence is the
case having on model
Higher the cook d value the more influence that case has on the slope and
stat sig
Greater than 1 is not a problem but other ones are
Look at structure of data, using these tools are there cases that really stand
out and might be influencing your model
Cook’s D – estimation diagnosis
Want to diagnose
Ranked in order DFbeta
What it does is look at difference between regression coefficient if you were
to drop those outliers
It will give indicator – if you drop them, what would amount of influence of
those cases be if you drop them
2/sq root of sample size OR greater than -1 or 1
Look at it comprehensively with all tests
0.04 as cutoff was established by 2/sqrtn
Solutions
Delete the observation that is the outlier
Delete the variable if it has a lot of outliers
Transform a v

