Textbook Notes
(363,559)

Canada
(158,426)

Western University
(15,371)

Statistical Sciences
(145)

Jennifer Waugh
(37)

Chapter 9

# Chapter 9.docx

Unlock Document

Western University

Statistical Sciences

Statistical Sciences 2244A/B

Jennifer Waugh

Spring

Description

9.1 Overview
● This chapter introduces important methods for making inferences based on sample data
that come in pairs
● This chapter has the objective of determining whether there is an association between two
variables, and if such an association exists, we want to describe it with an equation that
can be used for predictions
9.2 Correlation
● The main objective of this section is to analyze a collection of pairs sample data and
determine whether there appears to be an association between the two variables
○ we refer to such an association as a correlation
Part 1: Basic Concepts of Correlation
● Acorrelation exists between two variables when one of them is related to the other in
some way
Exploring the Data
● We should always begin an investigation into the association between two variables by
constructing a graph called a scatterplot or scatter diagram
● Ascatterplot is a graph in which the paired (x,y) sample data are plotted with a horizontal
x-axis and a vertical y-axis; each individual (x,y) pair is plotted as a single point
● When examining a scatterplot, we should study the overall pattern of the plotted points;
note its direction and if there are any outliers
Linear Correlation Coefficient
● Because visual examinations of scatterplots are largely subjective, we need more
objective measures; we use the linear coefficient r, useful for detecting straight-line
patterns
● The linear correlation coefficient r measures the strength of the linear association
between the paired x and y quantitative values in a sample
● Because the linear correlation coefficient r is calculated using sample data, it is a sample
statistic used to measure the strength of the linear correlation between x and y
○ if we had every pair of population values for x and y, the result of the linear
correlation coefficient would be a population parameter, ρ
Rounding the Linear Correlation Coefficient
● Round the linear correlation coefficient r to three decimal places so that its value can be
directly compared to critical values in TableA-6
Interpreting the Linear Correlation Coefficient
● Given the way that the formula for calculating r is constructed, the value of r must
always fall between -1 and +1 inclusive
○ if r is close to 0, we conclude there are no significant linear correlation between x
and y
○ if r is close to -1 or +1, we conclude that there is a significant linear correlation
between x and y
● When there really is no linear correlation between x and y, table A-6 lists values that are
critical in this sense: they separate usual values of r from those that are unusual
● Properties of the Linear Correlation Coefficient r
○ The value of r is always between -1 and +1 inclusive ○ The value of r does not change if all values of either variable are converted to a
different scale
○ The value of r is not affected by the choice of x or y; interchange all x and y
values and the value of r will not change
○ r measures the strength of a linear association; not designed to measure strength
of an association that is not linear
Interpreting r: Explained Variation
● If we conclude that there is a significant linear correlation between x and y, we can find a
linear equation that expresses y in terms of x and that equation can be used to predict
values of y for given values of x
● In section 9-3 we will describe a procedure for finding such equations and how how to
predict values of y when given x
● However, a predicted value of y will not necessarily be the exact result because in
addition to x, there are other factors affecting y such as random variation and other
characteristics not included in the study
Common Errors Involving Correlation
● Acommon source of error involves concluding that correlation implies causation
○ a lurking variable is one that affects the variables being studied but is not included
in the study
● Another source of error arises with data based on averages
○ averages suppress individual variation and may inflate the correlation coefficient
● Athird source of error involves the property of linearity
○ an association may exist between x and y even when there is no significant linear
correlation
Part 2 Beyond the Basic Concepts of Correlation
Formal Hypothesis Test
● We present two methods for using a formal hypothesis test to determine whether there is
a significant linear correlation between two variables
● Method 1 uses the Student t distribution with a test statistic having the form of t = r/s r
where s drnotes the sample standard deviation of r values
● Generally the hypothesis tests in this section will involve two tailed tests where the null
and alternative hypothesis are as follows:
○ H : 0 = 0
○ H : 1 ≠ 0
● However, often we will see one tailed tests with a claim of a positive linear correlation or
a claim of a negative linear correlation:
○ H : 0 = 0
○ H : ρ < 0 or H : ρ > 0
1 1
● Given a collection of paired (x,y) data, the point (x(bar), y(bar)) is called the centroid
● The statistic r is based on the sum of the products (x - x(bar))(y - y(bar))
● In any scatterplot, vertical and horizontal lines through the centroid (x(bar), y(bar)) divide
the diagram into four quadrants
● If the points of the scatter plot then to approximate an uphill line, individual values of the
product (x - x(bar))(y - y(bar)) tend to be positive as most of the points are found in the
first and third quadrants
● If the points of the scatterplot approximate a downhill line, most of the points are in the second and fourth quadrants where (x - x(bar)) and (y - y(bar)) are opposite in sign thus
the sum of the product (x - x(bar))(y - y(bar)) is negative
● Points that follow non linear pattern tend to be scattered among the four quadrants so the
value of the sum of (x - x(bar))(y - y(bar)) tends to be close to 0
● Therefore, we can use the sum of (x - x(bar))(y - y(bar)) as a measure of how the points
are arranged
Confidence Intervals
● In preceding chapters we discussed methods of inferential statistics by addressing
methods of hypothesis testing and methods for constructing confidence interval estimates
● Asimilar procedure may be used for confidence intervals of , however it involves
complicated transformations so screw it for now
9.3 Regression
● In Section 9-2 we analyzed paired data with the goal of determining whether there is a
significant linear correlation between two variables
● The main objective of this section is to describe the association between two variables by
finding the graph and equation of the straight line that represents the association
● This straight line is called the regression line and its equation is called the regression
equation
Part 1 Basic Concepts of Regression
● The regression equation expresses an association between x (called the independent
variable, or predictor variable, or explanatory variable) and y(hat) (called the dependent
variable, or response variable)
● The typical equation of a straight line y= mx + b is expressed in the form y(hat) = b + 0
b 1
○ where b is 0he y-intercept and b is th1 slope
○ the given notation shows that b and 0 are sa1ple statistics used to estimate the
population parameters β and0β 1
● We will use paired sample data to estimate the regression equation
● Once we have evaluated b and b 1e can 0dentify the estimated regression equation,

More
Less
Related notes for Statistical Sciences 2244A/B