Unit 1 11/10/2012 3:08:00 PM
Statistics is the scientific method for obtaining, organizing, summarizing and analyzing data.
Descriptive statistics: organizing, summarizing and presenting collected data.
Inferential statistics: making generalization from the sample to the population.
Population: the totality of units about which we want information.
Sample: a subset of the units in a population.
Individuals: objects described by a set of data.
A variable is a characteristic that varies among individuals in a population.
o Hair Color
Quantitative variable: something that can be counted or measured for each individual and then
added, subtracted, averaged etc…
Discrete data – the variable takes isolated numbers.
Continuous data - no gaps among its numerical values; when observations involve
Categorical variable: place individuals into one of several groups for which arithmetic operations such
as adding and averaging do not make sense.
Nominal: a label or name and the categories cannot be ordered in any sense.
Ordinal: the categories can be ordered by their relative attribute or quality.
The distribution of a variable tells us what values it takes and how often it takes these values.
Value does not have to be quantitative.
Bar chart: variable values on one axis and frequency on the other.
Pie Chart: gives us a visual representation of the relative frequency of the observed values for a
Histograms/Stem Plots: These are summary graphs for a single variable. They are very useful to
understand the pattern of variability in the data.
o Form of bar graph
o Variable on the x/y axes
o Displays the count of the data values falling into categories.
o Describes the distribution of the data set.
o Used only for quantitative variables.
o Is normally applied for large data sets.
Time Plot: use when there is a meaningful sequence, such as time. A distribution is symmetric if the right and left sides of the histogram are approximately mirror
images of each other.
A distribution is skewed to the right if the right side of the histogram (side with larger values)
extends much farther out than the left side. It is skewed to the left if the left side of the histogram
extends much farter out than the right side.
Not all distributions have a simple overall shape, especially when there are few observations.
An important kind of deviation.
Outliers are observations that lie outside the overall pattern of a distribution.
What to look for on the histogram
Overall shape: symmetric or roughly symmetric. Unimodal or bimodal or multimodal.
Location of centre and spread: centre (mean, median). Spread (range, standard deviation).
Outliers: any values outside the overall pattern
Stem and leaf plot
Leaf consists only the last (right most digit).
Stem consists only the remaining part.
When plotting a moderate number of observations, you can split each stem.
When observed values have too many digits, trim the numbers before making a stem plot.
The purpose of back to back stem plots is to compare two population distributions by allowing
leaves to extend to the right for one population and to the left for another.
Location is determined by where the centre of our data falls.
Mean (arithmetic average)
Add all values then divide by the number of individuals.
The mean is the balance point.
Number such that half of the observations are smaller and half are larger.
Number such that, when the data are ordered, there are an equal number of observations
on each side of it.
Order the data, count the number of data and compute ½(n+1), find the data point at
position ½(n+1) from the smallest value.
If n is an odd #, the median is the (n+1)/2 th value.
If n is an even #, the median is the average of the two values on either side of the
(n+1)/2 th position.
The median is not equal to (n+1)/2 ; it is the data value in the (n+1)/2 th position.
Then mean and median are the same only if the distribution is symmetrical. If the distribution is skewed, the mean is pulled toward the long tail (extreme value).
Mode: it is the most frequently occurred measurement in the sample.
There maybe no mode at all.
There maybe more than one mode.
Weighted mean = mean x weight…/total weights
The first quartile is 25% of the data is below it.
Q2 is the Median
Q3 is the value that has 75% of the data below it.
Q3-Q1 is called the IQR, Interquartile Range
*five number summary
Median-Q1 = Q2-Q1 and Q3-median = Q3-Q2
Will tend to be the same if a distribution is symmetric but unequal if the distribution is skewed.
The distribution is skewed to the right:
Q3-Q2 > Q2-Q1
The distribution is skewed to the left:
Q3-Q2 < Q2-Q1
An outlier is an observation that does not follow the general patter of the data. They are troublesome
data points, and it is important to be able to identify them.
One way to raise the flag for a suspected outlier is to compare the distance from the suspicious data
point to the nearest quartile.
IQR = Q3 – Q1
Suspected outlier: we call an observation a suspected outlier if it falls more then 1.5 times the size of
the Interquartile range. 1.5 x IQR rule.
The pth percentile of a list of numbers is such a p% of the numbers in the ordered list have a value
less than or equal to the pth percentile value.
Consider each value deviation from the mean
Standard deviation steps:
Find the mean
Find the deviation
Square the deviations Sum up all the squared deviations
Divide the total of the summed deviations by n-1 where n is the total sample size
To find s we take the square root of what we have found
S measures spread about the mean and should be used only when the mean is the measure of
S=0 only when all observations have the same value and there is no spread. Otherwise S > 0.
S is not resistant to outliers.
S gets larger, as the observations spread out about their mean.
S has the same units of measurement as the original observations.
Variables can be recorded in different units of measurement.
If a linear transformation changes the original variable x into the new variable x(new) given by an
equation of the form:
X(new) = a+bx
Then the new mean and new standard deviations are:
Linear transformations do not change the basic shape of a distribution.
Multiplying each observation by a positive number b multiplies both measures of centre
and spread by b.
Adding the same number a positive or negative to each observation adds a to measure
of centre and to quartiles but it does not change measures of spread. Unit 2 11/10/2012 3:08:00 PM
Response variable: measures or records and outcome of a study.
Denoted by Y.
Examples: amount of medicine, age, hours of study.
Explanatory variable: explains changes in the response variable.
Denoted by X.
Examples: time of recovery, test score, weight.
The explanatory variable is the one that the researcher controls.
The response variable is the one that the researcher measures to determine the effect of the
Scatterplots: a two dimensional plot of (x,y) for two quantitative variables being measured on the
same individual or unit.
One axis is used to represent each of the variables.
A graphical tool to examine the association between two quantitative variables.
Form: linear, curved, clusters or no pattern.
Direction: positive, negative or no direction.
Positive association: high values of one variable tend to occur together with high values
of the other variables.
Negative association: high values of one variable tend to occur together with low values
of the other variable.
No relationship: X and Y independently. Knowing X tell you nothing about Y.
Strength: how closely the points fit the “form”.
Strength of the relationship between the two variables can be seen by how much
variation, or scatter there is around the main form.
With a strong relationship you can get a pretty good estimate of y if you know x.
With a weak relationship for any x you might get a wide range of y values.
Outliers: any deviation from the pattern.
An outlier is a data value that has a very low probability of occurrence.
In a scatter plot, outliers are points that fall outside of the overall pattern of the
The correlation coefficient, denoted as r, measures both the direction and strength of a linear
Correlation can only be used to describe quantitative variables.
r only describes linear relationships no matter how strong the association, r does not describe curved
relationships. r does not distinguish x & y, the correlation coefficient, r treats x and y symmetrically.
sign of r and the direction of the linear relationship
r > 0 positive linear association
r < 0 negative linear association
magnitude of r and the strength of the linearity
r large = strong linear association
r small = weak linear association
r = 0 no linear association
r has no unit.
r is unit free. It does not depend upon the units of measurements of x or y or both.
r is strongly affected by the presence of a few outliers.
changing the units of variables does not change the correlation coefficient “r”.
r ranges from -1 to +1.
r is sensitive to extreme values.
If there is positive association, r is the positive square root and if there is a negative association, r is a
negative square root.
A regression line is a straight line that describes how a response variable y changes as an
explanatory variable x changes.
We often us a regression line to predict the value of y for a given value of x.
The least squares regression line is the unique line such that the sum of the squared
vertical y distances between the data points and the line is as small as possible.
How to find b and b
We calculate the slope of the line b
o B =r(sy/sx)
Where r is the correlation coeffiecient, sy is the sample standard deviation of the
response variable Y, and sx is the sample standard deviation of the explanatory variable X.
Once we know b , the slope we can calculate b , the y-intercept.
o B =y - b x
o Where x and y are the sample means of the X and Y.
The slope b and the correlation coefficient r are of the same sign.
The regression line always passes through the sample means of x and y.
Regression examines the distance of all points from the line in y direction only.
The distinction between explanatory and response variables is crucial in regression. If you exchange y for x in calculation the regression line, you will get the wrong line.
Correlation vs. Regression
The correlation is a measure of spread in both the x and y directions in the linear
relationship, while in regression we examine the variation in the response variable (y)
given change in the explanatory variable (x).
y = ax + b
b , y-intercept
the value of y when x=0
sometimes the y-intercept is not practically possible.
b , the slope
if b > 0 the response variable is expected to increase b units for an additional unit
increase in the explanatory variable.
If b < 0 the response variable is expected to decrease b units for an additional unit
increase in the explanatory variable x.
The equation of the least-squares regression allows you to predict y for any x within the range
studied, even if such x is not from the sample.
Extrapolation is the use of a regression line for predictions outside the range of x values used to
obtain the line.
The distances from each point to the least-squares regression line gives us the potentially useful
information about the contribution of individual data points to the overall pattern of scatter.
These are called “residuals”.
Residual= observed y – predicted y = y – y
Residuals are the distances between y-observed and y-predicted. We plot them in a residual plot.
If residuals are scattered randomly around 0, chances are your data fit a linear model, was normally
distributed, and you didn’t have any outliers
r^2 represents the percentage of the variance in y that can be explained by changes in x.
percentage of variation think r^2
Outlier: observation that lies outside the overall pattern of observations.
Influential individual: observation that markedly changes the regression if removed. This is often an
outlier on the x-direction.
A lurking variable is a variable not included in the study design that does have an effect on the
A lurking variable is neither the explanatory variable nor the response variable.
Falsely suggest a relationship or mask a relationship. Two variables are confounded when their effects on a response variable cannot be distinguished from
The lurking variables and the confounded variables sometimes create a correlation between the
explanatory variable x and the response y, but they also hide a true relationship between x and y.
*easiest way to collect evidence that y causes x is to do an experiment*
If we switch the x and y variables and fit a line, the equation changes.
This is because least-squares regression looks only at the distances.
At least-squares relationship always passes through the point (x,y) on the graph of y
versus x. That is, it passes through the averages of both variables.
If there is a linear relationship between y and x, y changes (varies) as x does. But, there is
added variation not captured by the relationship specifies in the regression line.
The square of the correlation r^2, is the fraction of the variation in the values of y
explained by the least squares regression of y on x.
r^2 indicates how well the regression explains the response y.
So square a correlation to get a better idea of the strength of the association. Unit 3 11/10/2012 3:08:00 PM
Population: the entire group of individuals in which we are interested but can’t usually assess directly.
Parameter is a number describing a characteristic of the population.
Sample is the part of the population we actually examine and for which we do have data.
A statistic is a number describing a characteristic of a sample.
Sample is a “subset” of population.
We estimate the parameter using the statistics.
Observational study: record data individuals without attempting to influence the responses.
Essential sources of data on a variety of topics.
Experimental study: deliberately impose a treatment on individuals and record their responses.
Influential factors can be controlled.
They allow us to draw conclusions about the effect of one variable on another.
We can study many variables at once.
Lurking variables and confounded variables might have some consequences to the response variable
in an observation.
Well designed experiments take steps to defeat confounding.
Individuals in an experiment are experimental units.
Humans are called subjects.
The explanatory variables in an experiment a