Class Notes (1,100,000)
CA (620,000)
UTSG (50,000)
SOC (3,000)
Lecture

# SOC202H1 Lecture Notes - Linear Regression, Null Hypothesis, Analysis Of Variance

Department
Sociology
Course Code
SOC202H1
Professor
Scott Schieman

Page:
of 5 PAGE 1
SOC202
PROF SCHIEMAN
MAR 20 2012
TG. SP.
Today’s topic: Regression Analysis
Imagine you collected a great data set
You have all these data, and large sample
Last classes have been ‘which test do you use?’ eg. Why would you use a
t-test
Formulas are all different but we are at the end all trying to say something
Regression serves a variety of purposes
Predicting future values of a variable, based on known values
Describe the patterns in complicated data in a simple way
Evaluating and refining theories about how changes in one variable cause
change in another
oDoes it really matter about one increase in education? – average
income increases by 10,000. If you knew that, you would know
education pays off.
Regression: The Basics (#1)
Q: “As the level of X increases, _____________?
A: “As the level of X increases, what happens to levels of Y?”
How much better does regression line fit the data? Eg. How well did you
do in the research paper? Prof summarized everyone into one mean
(average).
Regression: The Basics (#2)
People get more income because of their education, and people go back
to school again for more income. So income is the one affecting it first
Regressing dependent variable on the independent variable <- keeping
this in mind will orient you to how you look at the table
Regression: The Basics
Regression takes ANOVA and expands it out (eg. More education = more
income)
How far is that line from the grand mean? Imagine a scatter where height
had nothing to do with weight. How would you draw a best fitting straight
line? It will just be the mean.
PAGE 2
Assuming linearity, the grgression line goes through mean y at every value of x.
Each one has a mean, each one has a spread.
What is null-hypothesis? Formal way is to say that regression coefficient (slope)
equals 0 in the population
Informal: there is no relationship between X and Y in the population
Example:
If you get this example, you can plug this into anything
Just imagine that you are interested in research question, does eating a peep,
how does it affect aggression?
Let’s say you observed kids eating peeps and then observing them at
recess.
Aggression is the dependent variable. We are regressing aggression on
peeps.
Null hypothesis: Formally the slope is 0 in the population
o265 (number of obs) is imagine if you randomly selected them from
a large pool of kids. 265 carries a burden. They represent all the
kids that you are interested in
oIf you scooped a sample and by mistake, picked up rough kids who
are misrepresented, that 265 doesn’t represent the population.
Then everything else is crap (garbage In, garbage out – it doesn’t
mean anything)
oIf measure of peeps or measure of aggression is bad, everything
falls apart also
What is Regression Equation?
oPredicted aggression = 33.17 + 3.08 (peeps)
Other variables can be added, such as number of hours of
sleep the kid got last night, age, gender, etc (becomes
multivariable)
Interpret the slope
oThe slope is 3.08
oThe way to interpret is, with each one unit increase in peeps, score
on the aggression index increase by 3.08
oSlope: You got a kid, one of the kids in front of you. He ate 4 peeps.
He slapped 6 times. Multiplying those together, and doing that for
all the kids, and averaging it out over peeps. What’s the overall
What would be the score on the aggression index when peeps = 0? 1? 5?
10?
oAggression = 33.17 + (3.08)0 = 33.17
oAggression = 33.17 + (3.08)1 = 36.25 (As you move up one peep,
PAGE 3
as you move up in x, nothing is happening in y. there can be a
negative association as well.
oAggression is through the roof
What is t-statistic?
oT-statistic is slope divided by standard error
oStandard error is very straightforward. Sample to sample to sample
variability. Not exciting, theres no other way to say it. If you know
that standard error is large, you can scoop another 265 kids, and
find that slope is zero. This basically tells you how much error/how
accurate is in the estimate
How do you obtain it?
Slope divided by standard error: 3.08/.52 = 5.92
o5.92 is very unique.
oEg. Gwen Stefani is so different. 5.92 is like Gwen Stefani
Reflect on how things wouldl look like in terms of coefficient
and standard error
1.5 would not allow you to reject null hypothesis.
As kids moved up in levels of peep consumption, not much
happened in terms of aggression.
Much worse, if there is a lot of residual, and have kids all
over the place,
Standard error of a slope
oKeep in mind that t-statistic is slope divided by standard error, then
you need to get large t-statistic to reject the null hypothesis
oThe whole idea is deviation from some kind of value
oThe denominator is like averaging out how residuals look overall
What is the p-value?
oComputer will give you the precise value
op < 0.001
t-statistic is given a 5.92 and p-value associated is 0.001. Reject null.
o1.6 value is the size of t-statistic you need to obtain to reject null
hypothesis at 0.05 level.
oThe best way to think through is to come up with a brief essay, to
pick a few t-statistics and look at sample to sample variation
oIf t-statistic was 1.97, we could reject the null hypothesis, but we
will be unsure, when we choose another whole 265, we could
possibly fail to reject the null hypothesis. Then we’ll have to say
‘association isn’t strong enough’
Confidence interval
oLCL = 2.06, UCL = 4.10
oIf confidence interval doesn’t contain 0, at least 95% confident that
true population slope falls somewhere between 2.06 and 4.10