Textbook Notes
(363,666)

Canada
(158,520)

University of Guelph
(11,999)

SOAN 3120
(37)

Michelle Dumas
(18)

Chapter

# Soan 3120 Chapter Notes.docx

Unlock Document

Sociology and Anthropology

SOAN 3120

Michelle Dumas

Fall

Description

Soan 3120 Chapter Notes
Chapter 1: Picturing Distributions with Graphs
Individuals and Variables
Individuals: the objects described by a set of data; may be people, animals or things
Variables: any characteristic of an individual.Avariable can take different values for different individuals
Categorical Variable: places an individual into one of several groups or categories
Ex. sex, major
• Quantitative Variable: takes numerical values for which arithmetic operations make sense.
These values are usually recorded in a unit of measurement
Ex. height in inches, age in years
Categorical Variables: Pie Charts and Bar Graphs
Exploratory DataAnalysis: examining data to describe it’s main features
• Begin by examining each variable by itself, then study the relationship among the variables
• Begin with a graph(s) and add numerical summaries of specific aspects of the data
To examine a single variable, display it’s distribution
Distribution: tells us what value it takes and how often it takes these values
The Distribution of a categorical variable lists the categories and gives the count/percent of individuals who fall
in each category
Roundoff Error: when the percentages in a distribution don’t equal 100% because they were rounded. This error
doesn’t actually point to a mistake, just the effect of rounding.
Pie Chart: used to display the distribution of a categorical variable
Must include all the categories to make up a whole
Used to emphasize each category’s relation to the whole
Bar Graph: represents each category as a bar
Can compare any set of quantities that are measured in the same units
Quantitative Variables: Histograms
Histogram: the most common graph of the distribution of one quantitative variable
To create a histogram:
1. Choose the Classes: divide the range of the data into classes of equal width.
2.
Count the Individuals in each class
3. Draw the Histogram
Choosing too few classes will cause all values to fall into a few classes (Skyscraper)
Choosing too many classes will cause many classes to have few or no values (Pancake)
When examining a histogram, look for the overall pattern and for striking deviations from that pattern
Use shape, center, spread to describe the pattern
Outlier: an individual value that falls outside the overall pattern
Midpoint: the value with roughly half the observations taking smaller/large values
Spread: described by giving the smallest and largest values of a distribution Symmetric: a distribution in which the left and right sides of the histogram are approximately mirror images
Skewed to the right: the right side of the histogram extends much farther than the left side
Skewed to the left: the left side of the histogram extends much farther than the right side
Note: the direction of skewedness is the direction of the long tail, NOT the direction where more observations
are clustered ( the hump)
Quantitative Variables: Stemplots
For small data sets, a stemplot is quicker to make and presents more detailed information
To make a stemplot:
1. Separate each observation into a stem (all but the final digit) and a leaf (the final digit). Stems
can have as many digits as needed, but a leaf has only 1 per observation
2. Write the stems in a vertical column, smallest to largest, and draw a line separating them from
the leafs. Include all the stems, even if they have no leaf
3. Write each leaf in the row to the right of its stem, in increasing order
Ex.
1 3
2 4 6
3 2 5 6
4 1 3
5 7
This would represent the data: 1.3, 2.4, 2.6, 3.2, 3.5, 3.6, 4.1, 4.3, 5.7
Astemplot looks like a histogram turned on end
Unlike a histogram, a stemplot preserved the actual value of each observation
Note: Stemplots do NOT work well for large data sets, where each stem must hold a large number of leaves
Split Stems: you can split the stems in a stemplot to double the number of stems but reduce the number of leafs
on each stem
Time Plots
Time Plot: plots each observation against the time at which is was measured.
Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale
Cycles: regular up and down movements
Trend: a long term upward/downward movement over time
Time Series Data: time plots show the change in one variable over time
Cross Sectional Data: histograms display many variables at the same time Chapter 2
Chapter 2: Describing Distributions with Numbers
Measures of Center: The Mean x
To find the mean of a set of observations, add their values and divide by the number of observations:
x= 1 ∑ x
n i
The mean is sensitive to the influence of a few extreme observations (outliers) as well as skewed distributions,
which pull the mean toward the tail
The mean is NOT considered a resistant measure of center
Measures of Center: The Median M
The Median is the midpoint of a distributions, the number such that half the observations are smaller and the
other half are larger. To find the median of a distribution:
Arrange all the observations in order of size, smallest to largest
If n is odd, M is the center observation in the list
If n is even, M is midway between the two center observations in the list
You can always find M by counting:
(n+1)
2
from the start of the list (this does not give M, just it’s position)
The median is a resistant measure of center
Comparing the Mean and the Median
Symmetric Distribution: Mean = Median
Skewed Distribution: Mean goes towards the tail
Measuring Spread: The Quartiles
To calculate the quartiles:
Arrange observations in order and locate M
Q 1 = the median of the observations below the mean
Q 3 = the median of the observations above the mean
Five Number Summary and Box Plots
Five Number Summary of a distribution consists of:
MinQ M1 Max3
Offers a reasonably complete description of the center and spread of a distribution
Box Plot: a graph of the five-number summary
Acentral box spans the quartiles 1 and 3
Aline in the box marks the median M
Lines extend from the box to the smallest and largest observations Chapter 2
Best used for side by side comparisons of more than one distribution
Spotting Suspected Outliers
IQR: The Inter-Quartile Range, measures the distance between the first and third quartiles
IQR=Q −3 1
The 1.5×IQR Rule
Call an observation an outlier if it falls more than 1.5 x IQRs above the third quartile or below the first quartile
Measuring Spread: Standard Deviation
Measures spread by looking at how far the observations are from their mean
To Calculate: first calculate variance
x1
¿
¿
s = 1 ∑ ¿
n−1
The Standard deviation is the square root of the mean
x
¿
¿i−x¿ 2
¿
¿
1
∑ ¿
n−1
s= √ ¿
The Standard Deviation
Measures spread around the mean – should only be used when mean is used as the measure of center
Is always ≥0 . s=0 only when all the numbers are the same
Has the same units of measurement as the original observations
It is NOT resistant, a few outliers will change it
Choosing Measures of Center and Spread
Five-Number Summary OR Mean and standard deviation
Choose: Five-Number Summary: usually better than mean and standard deviation for describing a skewed
distribution or a distribution with strong outliers
Use Mean and standard deviation only when the distribution is relatively symmetric Chapter 3
Chapter 3: The Normal Distributions
Density Curves: Sometimes the overall pattern of a large number of observations is so regular that we can
describe it by a smooth curve
Areas under the curve give a good approximation of the actual distribution of the data
ADensity Curve:
Is always on or above the horizontal axis
Has an area of exactly 1 underneath it
Does not include outliers
Describing Density Curves
The median is the equal areas point, dividing the area under the curve in half
The mean is the “balance” point of the distribution
Symmetric Density Curve: Mean and Median are equal
Skewed: Mean is pulled away from the median, towards the tail
Because the density curve is just an idealized description of a distribution, the mean and standard deviation have
to be distinguished from those actually computed from the real data
x→μ s→σ
Normal Distributions
Anormal curve describes a normal distribution
All normal curves have the same overall shape
Symmetric
Single-peaked
Bell-shaped
μ,σ
Any Normal curve is completely described given:
The mean and median are both at the center.
Changing μ without changing σ moves the curve along the horizontal axis without changing the spread
σ changes the spread of the curve
You can usually measure σ by eye on a normal curve by finding the points at which the change of curvature
takes place
The 68-95-99.7 Rule:
Approximately 68% of observations fall within 1σof μ
Approximately 95% of observations fall within 2σof μ
Approximately 99.7% of observations fall within 3σof μ
Example:
Given a normal distribution with μ=6.84∧σ=1.55 , find the range of observations that fall into 95%
of the curve:
If 95% of observations fall withinσof μ , to calculate:
μ−2=6.84−(2)(1.55)=6.84−3.10=3.74 μ+2=6.84+(2)(1.55)=6.84+3.10=9.94 Chapter 3
Thus, 2.5% of observations fall above 9.94 and 2.5% of observations fall below 3.74. 95% of
observations fall within this range.
Standard Normal Distribution
All normal distributions are the same if we measure in units of size or about the mean μ
Changing to these units is called standardizing
If x is an observation from a normal distribution,N(μ,σ) The standardized value of x :
x−μ
z=
σ
Astandardized value is often called a z-score
Az-score tells us how many standard deviations an original observation falls from the mean, and in which
direction
Standardizing a variable that has any normal distribution produces a new variable that has the standard normal
distribution
Standard Normal Distribution is the distribution N(0,1) with mean 0 and standard deviation 1
x N(μ,σ) μ σ
If a variable has any Normal Distribution with mean and standard deviation , then
the standard variable
x−μ
z=
σ
has the standard normal distribution
Finding Normal Proportions
Cumulative Proportions: the cumulative proportion for a value x in a distribution is the proportion of
observations in the distribution that are less that or equal to x. This can be found using the Standard Normal
Table
Using the Standard Normal Table
To find cumulative proportions form a table, we must first standardize to express the problem in the standard
scale of z-scores
This allows us to use one table, a table of standard Normal cumulative proportions
The table entries are cumulative proportions, areas under the curve to the left of a value z
Finding a Value when given a Proportion
Use the Table backward, find the given proportion in the body of the table and then find the corresponding z-
score
Once we have a z-score, we un-standardize it back to the original x scale
N(504,111)
Example: Given we want to know what observation or higher would be the top 10%
of the distribution
The x-value that puts an observation in the top 10% is the same as the x-value for which 90% of the
observations are to it’s left
Use the table, locate 0.9, the corresponding z-score is 1.28
Un-standardize this value and transform z back to the original x Chapter 3
If z=1.28, that means x lies 1.28 standard deviations above the mean, therefore:
x=mean+(1.28)(standarddeviation) ¿504+(1.28)(111) ¿646.08
Thus, anything larger than 646.08 would be considered the top 10% of the distribution. Chapter 4
Chapter 4: Scatterplots and Correlation
Explanatory and Response Variables
Response variable: measures the outcome of a study
Often called independent or predictor variables
Explanatory variable: may explain or influence changes in a response variable
Often called dependent variables
Displaying Relationships: Scatterplots
The most useful graph for displaying the relationship between two quantitative variables is a
scatterplot
Shows the relationship between two quantitative variables measured on the same individuals
Always plot the explanatory variable (x) on the horizontal axis, and the response variable (y) on
the vertical axis
Interpreting Scatterplots
First: look at the overall pattern and for any striking deviations from that pattern
Next: describe the pattern by the direction, form and strength of the relationship
An important deviation is an outlier
PositiveAssociation: when above average values of one variable tend to accompany above
average values for the other variable
NegativeAssociation: when above average values of one variable tend to accompany below
average values for the other variable
Adding Categorical Variables to Scatterplots
To add a categorical variable, use a different plot colour or symbol for each category
Measuring Linear Association: Correlation
The Correlation r measures the direction and strength of the linear relationship between two
quantitative variables
The equation for r is:
x −́x y −́y
r= 1 ∑ ( i )( i )
n−1 s x sy
The values for the first individuals are x 1y 1 and the values for the second individuals are
x ∧y
2 2 and so on…
The mean and standard deviation of xare x∧s x
yare y∧s
The mean and standard deviation of y
The formula starts by standardizing the observations (from chapter 3)
x∧y xi−x́ y í y
Standardize the values by: sx and sy Chapter 4
The correlation r is an average of the products of the standardized x and y for all the individuals
Facts about Correlation
It makes no distinction between explanatory and response variables
Because r uses standardized values, r does not change when we change the units of measurement
of x and y
Positive r indicates positive association, and vice-versa
The Correlation r is always a number between -1 and +1
Values close to 0 indicate a weak linear relationship, while values close to -1 and +1 indicate a
strong linear relationship
Correlation requires that both variables be quantitative
Correlation measures only the strength of linear relationships between two variables
Correlation r is strongly affected by a few outlying observations Chapter 5
Chapter 5: Regression
Regression Lines
Aregression line is a straight line that describes how a response variable y changes as an
explanatory variable x changes
We often use a regression line to predict the value of y for a given value of x
Suppose that y is a response variable, plotted on the vertical axis, and x is an explanatory
variable, plotted on the horizontal axis.Astraight line relating y to x has the equation:
y=a+bx
In this equation, b is the slope: the amount by which y changes when x increases
Slope is the rate of change in the response variable, on average, as the explanatory variable
increases
In this equation, a is the intercept: the value of y when x = 0
To plot this line on a scatterplot, use the equation to find the predicted y for 2 values of x, one
near each end of the range of x. Plot each y above it’s x value, and draw a line through the two
points
The Least Squares Regression Line
We need a way to draw a line that doesn’t require guessing where it should go, and a good
regression line makes the vertical distances of the points from the line as small as possible
The Least Squares Regression Line of y on x is the line that makes the sum of the squares of the
vertical distance of the data points from the line as small as possible
To get the equation for the least squares regression line:
Calculate the means x∧́y
Calculate the standard deviations sx and sy
Calculate the correlation r
Thus, the least squares regression line is:
y=a+bx
With a slope:
sy
b=r sx
And Intercept:
a=́ y−b́ x
It is important to remember that this line gives us a predicted value y for x, that’s why there is
a hat on y
Facts about the Least Squares Regression Line
The distinction between explanatory x and response y variables is essential in regression Chapter 5
There is a close connection between correlation and the slope of the least squares regression line
They always have the same sign
The least squares regression line always passes through the point (x, y) which is the mean of
both x and y
The square of the correlation, r2 is the fraction of the variation in the values of y that is
explained by the least squares regression of y on x
Residuals
Aresidual is the difference between an observed value of the response variable and the value
predicted by the regression line.
Aresidual is a prediction error that remains after we have chosen the regression line
Residual = observed y – predicted y
= y−̂ y
There is a residual for each data point, which means finding residuals sucks because you have to
find the predicted y response for every x
Residuals calculated from a least squares regression like have a special property: the mean of the
least squares residuals is always 0
Residual Plots: a scatterplot of the regression residuals against the explanatory variable, this
helps us assess how well a regression line fits the data
Influential Observations
An observation is influential for a statistical calculation if removing it would markedly change
the result of the calculation
Points that are outliers on the x and y direction of a scatterplot are sometimes influential
If the outlier does not lie close to the regression line, it will be influential, but if it does, it wont
be as influential
Cautions about Correlation and Regression
Correlation and Regression lines describe only linear patterns, and these lines are NOT resistant
Ecological Correlations: correlations based on averages rather than on individuals are misleading
if they aren’t interpreted properly
Extrapolation: the use of a regression line for predicting a value of y far outside the range of
values for x. Such predictions are usually not accurate
Lurking Variables: the relationship between two variables can often be understood only by taking
other variables into account
Association does not Imply Causation
Astrong association between two variables is not enough to draw conclusions about cause-and-
effect Chapter 5
Just because an explanatory variable x and a response variable y have a very strong correlation,
this is not evidence that changes in x actually cause changes in y Chapter 6
Chapter 6: Two-Way Tables
Two Way Table: describes two categorical variables
Marginal Distributions
The marginal distribution of one of the categorical variables in a two-way table is the distribution of values of
that variable among all individuals described in the table
Each distribution in a two way table, when alone, are called marginal distributions (because they appear at the
right and bottom margins of the two-way table)
Roundoff Error: when adding percentages, it may come up sometimes as 99.9% because each percentage had
been rounded to the nearest tenth.
Each marginal distribution form a two-way table is a distribution for a single categorical variable. We can use a
bar graph or a pie chart to display it
Conditional Distributions
Aconditional distribution is the distribution of values of one variable among only individuals who have a
certain value of the other variable. There is a separate conditional distribution for each value of the other
variable
Uses the term conditional because the distribution describes only the variable in which a condition is satisfied
No single graph portrays the form of the relationship between categorical variables. No single numerical
measure summarizes the strength of the association
Simpson’s Paradox
An association or comparison that holds for all of several groups can reverse direction when the data are
combined to form a single group.
The lurking variable in Simpson’s Paradox is categorical. It breaks the individuals into groups, in which the true
relationship is then visible Chapter 7
Chapter 7: Exploring Data, Part 1 Review
Alist of the skills that should be acquired after reading chapter 1-6:
Data
Identify the individuals and variables in a set of data
Identify each variable as categorical or quantitative. Identify the unite in which each quantitative variable is
measured
Identify the explanatory and response variables in situations where one variable explains or influences another
Displaying Distributions
Recognize when a pie chart can or cannot be used
Make a bar graph of the distribution of a categorical variable
Interpret pie charts and bar graphs
Make a histogram of the distribution of a quantitative variable
Make a stemplot of the distribution of a small set of observation. Round leaves of split stems
Make a time plot of a quantitative variable over time
Describing Distributions (Quantitative Variable)
Look for the overall pattern and for major deviations from that pattern
Assess from a histogram or stemplot whether the shape of a distribution is symmetric, skewed, or neither, as
well as how many peaks there are
Describe the overall pattern by giving numerical measures of center and spread in addition to a verbal
description of shape
Decide which measures of center and spread are more appropriate: the mean and standard deviation, or the 5
number summary
Recognize outliers and give plausible explanations for them
Numerical Summaries of Distributions
Find the median and the quartiles for a set of observations
Find the 5-number summary and draw a boxplot, assess center, spread, symmetry and skew
Find the mean and the standard deviation for a set of observation
Understand that the median is more resistant than the mean. Recognize that skew in a distribution moves the
mean away from the median toward the tail
Know the basic properties of the standard deviation
Density Curves
Know that areas under a density curve represent proportions of all observations and that the total area under a
density curve is 1
Approximately locate the median and the mean on a density curve
Know that the mean and median both lie at the center of a symmetric density curve Chapter 7
Recognize the shape of Normal Curves and estimate by eye both the mean and standard deviation of such a
curve
Use the 68-95-99.5 rule and symmetry to state what percent of the observations from a normal distribution fall
between two points
Find the standardized value (z-score) of an observation. Interpret z-scores and understand that any normal
distribution becomes the standard normal N (0, 1) distribution when standardized
Given that a variable has a normal distribution with a stated mean and standard deviation, calculate the
proportion of values above, below or between stated number(s)
Given that a variable has a normal distribution, calculate the point having a stated proportion of all values above
it or below it
Scatterplots and Correlations
Make a scatterplot to display the relationship between two quantitative variables measured on the same subjects
Add a categorical variable to a scatterplot
Describe the direction, form and strength of the overall pattern, in particular, recognize positive or negative
association and linear patterns. Recognize outliers
Judge whether it is appropriate to use correlation to describe the relationship between two quantitative variables.
Find the correlation r
Know the basic properties of correlation r
Regression Lines
Understand that regression requires an explanatory variable and a response variable. Correctly identify which
variable is the explanatory variable and which is the response variable.
Explain what the slope b and intercept a mean in the equation y = a + bx
Draw a graph of a regression line when you are given it’s equation
Use a regression line to predict y for a given x
Find the slope and intercept of the least-squares regression line from the means and standard deviations of x and
y and r
Use r squared to describe how much of the variation in one variable can be accounted for by a straight line
relationship
Recognize outliers and potentially influential observations from a scatterplot with the regression line drawn on
it
Calculate the residuals and plot them against the explanatory variable x. Recognize that a residual plot
magnifies the pattern of the scatterplot of y vs. x
Cautions about correlation and Regression
Understand that both r and the least squares regression line can be strongly influenced
Recognize possible lurking variables that may explain the observed association between two variables
Understand that even a strong correlation does not mean that there is a cause and effect relationship
Give plausible explanations for an observed association between two variables: direct cause and effect, the
influence of lurking variables etc. Chapter 7
Categorical Data
From a two-way table, find the marginal distributions of both variables by obtaining the row sums and column
sums
Express any distribution in percentages by dividing the category counts by their total
Describe the relationship between two categorical variables by computing and comparing percents
Recognize Simpson’s Paradox
Questions to Do:
3.10 (pg 86)
3.28 (pg 91)
7.1 (pg 180)
7.7 – 7.10 (pg 181)
7.18 (pg 183)
7.20 – 7.23 (pg 184-185)
7.25-7.26 (pg 185-186)
7.31-7.33 (pg 188-189)
7.37 (pg 189)
7.38 (pg 190-191) Chapter 8
Chapter 8: Producing Data: Sampling
Population versus sample
Population: the entire group of individuals about which we want information
Sample: part of the population form which we actually collect information. We use a sample to draw
conclusions about the entire population
Sampling Design: describes exactly how to choose a sample from the population
The first step in planning a sample survey is to figure out exactly what population we want to describe, and then
figure out exactly what we want to measure
How to Sample Badly
Convenience Sample: a sample selected by taking the members of the population that are easiest to reach. Often
produces unrepresentative data
Bias: if the design of a statistical study systematically favours certain outcomes
Voluntary response sampling: consists of people who choose themselves by responding to a broad appeal. They
are biased because people who respond are likely to have strong opinions
People who take the trouble to respond to an open invitation are usuall

More
Less
Related notes for SOAN 3120