Chapter 2- Looking at Data- Relationships
-associated: term used to describe the relationship between two variables ex. breed and life
What individuals or cases do the data describe?
What variables are present? How are they measured?
Which variables are quantitative and which are categorical?
Ex. on page 85
-response variable: measures an outcome of a study
-explanatory variable: explains or causes changes in the response variables. Ex. many of
these do not involve direct causation. Ex. sat scores of high school students help predict future
college grades but high sat scores don’t CAUSE high college grades.
-independent variables: called explanatory variables
-dependent variables: called response variables
Response variables rely on explanatory variables
-scatterplots: for showing relationship between two quantitative variables measured on the
-explanatory variable(s) on x axis called x. (if no explanatory variable, then any of the variables
can on either axis)
Response variable on y axis called y.
-Interpreting scatterplots :
Look for overall pattern and deviations from pattern ex. outliers- falls outside the pattern of the
Describe overall pattern by the form, direction, and strength of the relationship.
-form: ex. clusters
Pg. 87 fig. 2.1 has two clusters
Clusters: groups of points on the graph. They suggest that the data describe several distinct
kinds of individuals.
-positive associated: when two variables are above average values of one tend to accompany
above average values of the other and below average values also tend to occur together
-negatively associated: when two variables are above average values of one accompany
below average values of the other and vice versa.
-linear relationship: points roughly follow a straight line
Strength of relationship: determined by how closely the points follow a clear form
-to add a categorical variable to a scatterplot, use a different plot colour or symbol for each
-smoothing: systematic methods of extracting the overall pattern are helpful. They use
resistant calculations so they are not affected by outliers in the plot.
-to display a relationship between a categorical explanatory variable and a quantitative response
variable, make a side by side comparison of the distributions of the response for each category.
-measure used for data analysis by using a numerical measure to supplement the graph. (since
our eyes are not good judges of how strong a relationship is)
-correlation r: helps us see that r is positive when there is a positive association between the
variables. Ex. height and weight have a positive correlation.
Correlation: measures the direction and strength of the linear relationship between two
quantitative variables. It is usually written as r.
Ex. suppose data on variable x and y for n individuals. The means and standard deviations of the
two variables are x and sx for the x values and y and sy for the y values. The correlation r
between x and y is
∑ means : add these terns for all the individuals
This formula helps us see what correlation is but is not convenient for actually calculating r.
the beginning of this formula starts by standardizing the observations.
-ex. x is the means and sx is the standard deviation of the n heights, both in centimeters. the
value xi- xsx
is the standardized height of the ith person. The standardized height says how many SD above or
below the mean a person’s height lies. Standardized values have no units, they have no longer
measured in centimeters. The correlation r is an average of the products of the standardized
height and the standardized weight for the n people.
-properties of correlation:
Correlation for the following:
•Doesn’t make a difference what you make the x or y variable when calculating the
•Requires that both variables be quantitative, so that it makes sense to do the arithmetic
indicated by the formula for r. ex. city cant be calculated bc its categorical.
•Because r uses the standardized values of the observations, r does not change when we
change the units of measurement of x, y, or both. Ex. using weight and height. Cm ->
inches or kg -> lbs. doesn’t change the correlation between weight and height. Correlation
r has no unit of measurement
•Positive r indicates positive association between the variables and negative r indicates
•Correlation r is always a number between -1 and 1. Values of r near 0 means a very weak
linear relationship. Strength of relationship increases as r moves away from 0 toward
either -1 or 1. Values of r close -1 or 1 means that the points lie close to a straight line.
The extreme values
r=-1 and r= 1 occur only when the points in a scatterplot lie exactly along a straight line.
•Measures the strength of only the linear relationship between two variables. Correlation
does not describe curved relationships between variables, no matter how strong they are.
•Like the mean and SD, the correlation is not resistant: r is strongly affected by a few
outlying observations. Use r with caution when outliers appear in scatterplot
2.3- Least Squares Regression
-Correlation: measures the direction and strength of the linear (straight line) relationship
between two quantitative variables
-Regression line: summarizes the relationship between two variables but only in a specific
setting: when one of the variables helps explain or predict the other. Regression is a straight line
that describes how a response variable (y) changes as an explanatory variable (x) changes. We
use this usually to predict the value of y for a given value of x. It requires both variables.
Can be used to predict response y for a specific value of the explanatory x. Accuracy of the
predictions depends on how much scatter about the line the data show.
-Fitting a line: to data means drawing a line that comes as close as possible to the points. It
gives a description of the dependence of the response variable y on the explanatory variable x.
-Straight Lines equation: y=b0 + b1x
b1 is the slope: amount by which y changes when x increases by one unit. Rate of change in
the response y as the explanatory variable x changes.
b0 is the intercept: value of y when x=0
-Extrapolation: use of regression line for prediction far outside the range of values of the
explanatory variable x used to obtain the line. Such predictions are not accurate.
-Pg.112 Fig. 2.12- error=observed gain-predicted gain
-Errors are positive if the observed response lies above the line, and negative if the response
lies below the line.
-GOOD Regression Line: makes vertical distances of the data points from the line appear as
small as possible for example like the least squares idea Ex. pg. 113 Fig. 2.13
-Least Squares Regression Line (of the y on x): line that makes the sum of the squares of
the vertical distances of the data points from the line as small as possible