Chapter 2- Looking at Data- Relationships

-associated: term used to describe the relationship between two variables ex. breed and life

span

-examining relationships:

What individuals or cases do the data describe?

What variables are present? How are they measured?

Which variables are quantitative and which are categorical?

Ex. on page 85

-response variable: measures an outcome of a study

-explanatory variable: explains or causes changes in the response variables. Ex. many of

these do not involve direct causation. Ex. sat scores of high school students help predict future

college grades but high sat scores don’t CAUSE high college grades.

-independent variables: called explanatory variables

-dependent variables: called response variables

Response variables rely on explanatory variables

2.1- Scatterplots

-scatterplots: for showing relationship between two quantitative variables measured on the

same individuals.

-explanatory variable(s) on x axis called x. (if no explanatory variable, then any of the variables

can on either axis)

Response variable on y axis called y.

-Interpreting scatterplots :

Look for overall pattern and deviations from pattern ex. outliers- falls outside the pattern of the

relationship.

Describe overall pattern by the form, direction, and strength of the relationship.

-form: ex. clusters

Pg. 87 fig. 2.1 has two clusters

Clusters: groups of points on the graph. They suggest that the data describe several distinct

kinds of individuals.

-positive associated: when two variables are above average values of one tend to accompany

above average values of the other and below average values also tend to occur together

-negatively associated: when two variables are above average values of one accompany

below average values of the other and vice versa.

-linear relationship: points roughly follow a straight line

Strength of relationship: determined by how closely the points follow a clear form

-to add a categorical variable to a scatterplot, use a different plot colour or symbol for each

category

-smoothing: systematic methods of extracting the overall pattern are helpful. They use

resistant calculations so they are not affected by outliers in the plot.

-to display a relationship between a categorical explanatory variable and a quantitative response

variable, make a side by side comparison of the distributions of the response for each category.

2.2- Correlation

-measure used for data analysis by using a numerical measure to supplement the graph. (since

our eyes are not good judges of how strong a relationship is)

-correlation r: helps us see that r is positive when there is a positive association between the

variables. Ex. height and weight have a positive correlation.

Correlation: measures the direction and strength of the linear relationship between two

quantitative variables. It is usually written as r.

Ex. suppose data on variable x and y for n individuals. The means and standard deviations of the

two variables are x and sx for the x values and y and sy for the y values. The correlation r

between x and y is

r=1n-1(xi-xsx)(yi- ysy)

∑ means : add these terns for all the individuals

This formula helps us see what correlation is but is not convenient for actually calculating r.

the beginning of this formula starts by standardizing the observations.

-ex. x is the means and sx is the standard deviation of the n heights, both in centimeters. the

value xi- xsx

is the standardized height of the ith person. The standardized height says how many SD above or

below the mean a person’s height lies. Standardized values have no units, they have no longer

measured in centimeters. The correlation r is an average of the products of the standardized

height and the standardized weight for the n people.

-properties of correlation:

Correlation for the following:

•Doesn’t make a difference what you make the x or y variable when calculating the

correlation

•Requires that both variables be quantitative, so that it makes sense to do the arithmetic

indicated by the formula for r. ex. city cant be calculated bc its categorical.

•Because r uses the standardized values of the observations, r does not change when we

change the units of measurement of x, y, or both. Ex. using weight and height. Cm ->

inches or kg -> lbs. doesn’t change the correlation between weight and height. Correlation

r has no unit of measurement

•Positive r indicates positive association between the variables and negative r indicates

negative association

•Correlation r is always a number between -1 and 1. Values of r near 0 means a very weak

linear relationship. Strength of relationship increases as r moves away from 0 toward

either -1 or 1. Values of r close -1 or 1 means that the points lie close to a straight line.

The extreme values

r=-1 and r= 1 occur only when the points in a scatterplot lie exactly along a straight line.

•Measures the strength of only the linear relationship between two variables. Correlation

does not describe curved relationships between variables, no matter how strong they are.

•Like the mean and SD, the correlation is not resistant: r is strongly affected by a few

outlying observations. Use r with caution when outliers appear in scatterplot

2.3- Least Squares Regression

-Correlation: measures the direction and strength of the linear (straight line) relationship

between two quantitative variables

-Regression line: summarizes the relationship between two variables but only in a specific

setting: when one of the variables helps explain or predict the other. Regression is a straight line

that describes how a response variable (y) changes as an explanatory variable (x) changes. We

use this usually to predict the value of y for a given value of x. It requires both variables.

Can be used to predict response y for a specific value of the explanatory x. Accuracy of the

predictions depends on how much scatter about the line the data show.

-Fitting a line: to data means drawing a line that comes as close as possible to the points. It

gives a description of the dependence of the response variable y on the explanatory variable x.

-Straight Lines equation: y=b0 + b1x

b1 is the slope: amount by which y changes when x increases by one unit. Rate of change in

the response y as the explanatory variable x changes.

b0 is the intercept: value of y when x=0

-Extrapolation: use of regression line for prediction far outside the range of values of the

explanatory variable x used to obtain the line. Such predictions are not accurate.

-Pg.112 Fig. 2.12- error=observed gain-predicted gain

-Errors are positive if the observed response lies above the line, and negative if the response

lies below the line.

-GOOD Regression Line: makes vertical distances of the data points from the line appear as

small as possible for example like the least squares idea Ex. pg. 113 Fig. 2.13

-Least Squares Regression Line (of the y on x): line that makes the sum of the squares of

the vertical distances of the data points from the line as small as possible