(Covers the end of chapter 7, and chapter 8 beginning)
SCATTERPLOT OF MARIJUANA USE VS. OHTHER DRUG USE
- Scatterplot is display showing relationship (if any) of one var to another
- if we know one var, can we predict the other?
- Ask yourself these questions when looking at scatterplot
Note - will refer to example mentioned here by (ex).
(ex) of the below scatterplot:
- data from many diff. coutnries showing relationship of % of people in country who use
marijuana plotted against % of people who used other drugs at least once in their life-time
- if we know marij use % in country (ie, a certain x-val), then can predict % of other drug use (ie. can
predict the corresponding y-val.)
- prediction is hoped to be close to ideal (ie. the truth) 1. - Describe what is seen on scatterplot
- straight line or curve?
- don't want it to be curved, but want trend that is reasonably well described by straight line
(ex) goes up; a positive association: as one goes up, the other goes up
- fairly linear
=> reasonably strong relationship
(ex) - can't predict other drug use perfectly
- knowing one is high gives good idea other is high too
2. Does it make sense to get correlation here?
3. Can a conclusion be made that the x-variable (predictor variable) causesthe y-variable (response
(ex) does marijuana use lead to use of other drugs?
- No, b/c correlation doesn't imply causation
- it does appear going straight up, doesn't go up level off, doesn't go up and down
Causation canberetrieved from thisdata?
- marij is a "gateaway" drug, - ie. if they try this, then they will try other drugs (does data prove this?)
- if marijuana use is high, other drug use is high too, but scatterplot doesn't tell us that one is
the cause of another, nor does it tell us which direction the cause and effect goes
- ie. first of all, it does not tell us what causes what
- ie. second of all, it does not tell us x causes y, or y causes x.
- all scatterplot tells us if one var is high, the other is high too
- so correlation is not CAUSING this effect
(EX) STREAM EXAMPLE
"In a study of streams in the Adirondack mountains, the following association was found between
the pH of the water and its hardness:"
- 2 quantitative var's are being related here
- hardness of water (how much Ca is in it); x-variable (variable on horizontal axis)
- pH (scale to assess what is the extent of acidity of the water); y-variable (var. on vertical
- +ve association
- NOT a straight line
Is it appropriate to use correlation as summary val for this relationship?
- No, b/c correlation only is utilized for linear (straight-line) relationships
- Q: if you know the x-varible, what does it tell you about other (ie. y-variable)? - if the hardnes is low, then the pH is also low
- but, we can not use linear relationship to go for hardness val's that are just over ~150 because the
relationship is no longer behaving like a straight line
- it levels off instead of continually going straight up
- if this were straight line, then as water hardness goes up, pH should continue going up
- instead, what is happening is that it is going up, then lvling off
1. look at scatterplot
2. ask urself: is it a straight line
- if it is, can calc. 'r' (correlation coefficient), and other summary val's for linear relationships
- if not, then not a straight line, then don't bother
For this example:
- it is +ve association, but its NOT a straight line
- thus correlation should not be used to summarize this relationship b/c correlation is only
useful in LINEAR relationships
What can we do with this?
- so far, the most we can do with "non-linear" relationships is to draw a picture
- when we get to chapter 10, we will learn ways to "unbend curves" so that we can get a straight line
again, and then apply methods of linear relationships to relationships that initially were not so linear
at all (re-expression)
Draw a picture, Draw a picture, Draw a picture - Notice that if we only looked at 'r' without seeing the picture, we might be misled into thinking that
association is such-and-such based on the value of the 'r'
- have to first check for linearity if it was valid to use 'r' at all
- this is analogous to idea of how shape can only be assessed by picture
What variable for x, which variable for y?
- when it comes to calculating 'r' (correlation), it doesn't matter which variable x or y
- so for correlation per se, it does not matter whether you put such and such var. on x-axis, or
- but to help explain the relationship better, we do as follows
- the variable that is the "outcome" is placed in the y-axis, and is called response variable
=> response var. (y) on y-axis
- the other var. that helps to explain the response goes on the x-axis, predictor variable
=> predictor var. (x) on x-axis
- pH of water is response,and so if we know how hard water is, then able to predict what pH will be
=> pH as response, water hardness as explanatory var (one that explains the response)
Linear Regression - finding the BEST line
What is this chapter about?
- we have data that can be modelled by a straight line relationship but now we have to find out which
line, out of the many lineswe can choose, best describes the relationship 
- Straight-line relationship is given by:
- x and y are the var's
- a = "intercept"; value of y where x = 0.
- if explanatory var. (x) equals 0, then what is response var. (y)?
- b = "slope"; if you increase x by ___, how much do you increase or decrease y by?
slope = 2
- increasing x by one increases y by 2
slope = -3
- increasing x by 1 decreases y by 3 - as x goes up, y goes down
- response var. (y) dep. on explanatory var. (x) in a straight-line relationship
- if y = a + bx, then the relationship is straight line
=> if want to find out straight line, then it’s a matter of finding out what is a (intercept) and b
- intercept: where it starts
- slope: how fast it goes up and down
Correlation and slope
- if slope is -ve, means that as x goes up, y goes down
=> corresponds to -ve correlation
- if slope is +ve, means that as x goes up, y goes up
=> corresponds to +ve correlation
More aboutStraight lines
- goal is to find a straight line given by an intercept and slope that best describes data
- its only a model, in the sense that it won't be perfectly accurate, but MAY be useful in that is
gives us a relationship b'/ween the 2 var's
- pts may not perfectly be on close line, but the line of best fit is the one that is the
most reasonably close to them
RESIDUALS = how much off the line an observation (ie. pt of data) is
- residual value (this is a y-value on a separate plot, the residual plot) = predicted value - actual value
(ex) Drug abuse example
- our model is ŷ = -3 + 0.5x
- we use ŷ because this stands for predicted value
- The point for England (ie. the % of drug abuse (x), and the % of other drug use (y) for
England) was found to be x = 40, y = 21
=> marijuana usage for England was 40% (x = 40), other drug usage was 21% (y = 21)
- we use "y" here because this stands for actual value
- Sub in x = 40 to see what y-val. model outputs for this
=> ( )( )
- Then, the residual is:
=> our prediction was 4 units below the actual value.
=> the other drug use f