STAB22H3 Study Guide - Midterm Guide: Time Series, Standard Deviation, Linear Map

710 views1 pages
27 Nov 2012
Exploratory Data Analysis: Uses graphs and numerical summaries to describe the variables in a
data set and the relations among them.
1. Begin by examining each variable by itself. Then move on to study the relationship among the
2. Begin with graphs. Then add numerical summaries of specific aspects of the data.
Distribution of a Categorical Variable: lists the categories and gives either the count or
the percent of individuals who fall in each category.
Bar Graph (describe the distribution of categorical variables): The heights of the bars compare
the percents of each category.
-easier to read and more flexible
Pie Chart (describe the distribution of categorical variables): Helps to see what part of the
group each category forms.
-Because pie charts lack scales, add the percents to the labels for each slice.
-Require that you include all the categories that make up a whole.
-Use them only when you want to emphasize each category's relation to the whole.
Tails: The extreme values of a distribution are in the tails of the distribution.
Stemplots (describe the distribution of quantitative variables): Gives a quick picture of the
shape of a distribution while including the actual numerical values in the graph.
-also called a stem-and-leaf plot
-work best for small numbers of observations that are all greater than 0.
-do not work well for large data sets.
Back-to-back stemplot: when comparing two related distributions using a common stem and the
leaves for one on the right and the leaves for the other on the left.
Histograms (describe the distribution of quantitative variables): Break the range of values of a
variable into classes and display only the count or percent of the observation that fall into each
-A histogram shows the distribution of counts or percents among the values of a single variable. A
bar graph compares the size of different items.
-plot the frequencies (counts) or the percents of equal-width classes of values.
Frequencies: The number of individuals in each class.
- the base of the bar covers the class and the height is the class count.
- the area of a histogram is determined by the height since the width of all the bars are equal.
Outlier: are observations that lie outside the overall pattern of a distribution.
-An individual value that falls outside the overall pattern.
Midpoint of a Distribution: the value with roughly half the observations taking smaller values and
half taking larger values.
Modes: major peaks in the graph of a distribution
-a distribution with one major peak is called unimodal
Symmetric Distribution: when the values smaller and larger than its midpoint are mirror images of
each other.
Skewed to the right: if the right tail (larger values) of the distribution is much longer than the left tail
(smaller values)
Time Plots: the plotting of each observation against the time at which it was measured. Always put
time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.
Connecting the data points by lines helps emphasize any changes over time.
Time Series: measurements of a variable taken at regular intervals over time.
Trend: in a time series, is a persistent, long-term rise or fall.
Seasonal Variation: a pattern in a time series that repeats itself at known regular intervals of time
Statistics: The science of learning from data.
Data: Data are numerical facts.
Data are numbers with a context, and we need to understand the context if we are to
make sense of the numbers.
Individuals: are the objects described in a set of data.
Individuals are sometimes people.
When the objects that we want to study are not people, we often call them cases.
Cases: When the objects that we want to study are not people, we often call them
Variable: is any characteristic of an individual.
A variable can take different values for different individuals.
Categorical Variable: place an individual into one of two or more groups or
Quantitative Variable: takes numerical values for which arithmetic operations, such
as adding and averaging, make sense.
Distribution: The distribution of a variable tells us what values it takes and how often
it takes these values.
Rate: Often, the rate at which something occurs is more meaningful than the simple
count of occurrences.
Mean: The average value.
-a measure of the center of a distribution.
-To find the mean of a set of observations, add their values and divide by the number of
Median: The middle value or midpoint.
-a measure of the center of a distribution.
-half of the observations are smaller than the median and half are larger.
-first arrange all the values from smallest to largest
-(n+1)/2 = position of median
-Also known as the 50th percentile
-(see notation)
Resistant Measure: A measure that is not sensitive to the influence of a few extreme
observations (e.g. outliers.
Spread: The simplest useful numerical description of a distribution consists of both a measure of
center and a measure of spread.
First Quartile - Q1: the median of the observations whose position in the ordered list is to the
left of the location of the overall median.
-Remember to first arrange the observations in increasing order.
-(n)(0.25) = position of Q1
Third Quartile - Q3: the median of the observations whose position in the ordered list is to the
left of the location of the overall median.
-Remember to first arrange the observations in increasing order.
-(n)(0.75) = position of Q2
Five-Number Summary: consists of the smallest observation, the first quartile, the median, the
third quartile, and the largest observation, written in order from smallest to largest.
-In symbols:
Minimum Q1 M Q3 Maximum
-leads to another visual representation of a distribution of a distribution, the boxplot.
Box Plot: a graph of the five number summary
-a central box spans the quartiles Q1 and Q3
-a line in the box marks the median M.
-Lines extend from the box out to the smallest and largest observations.
-effective for comparing distributions
Interquartile Range (IQR): The distance between the first and third quartiles,
IQR = Q3 - Q1
The 1.5 x IQR Rule for outliers: An observation is a suspected outlier if it falls more than 1.5 x
IQR above the third quartile or below the first quartile.
Variance, s^2: a set of observations is the average of the squares of the deviations of the
observations from their mean.
In symbols,
Standard Deviation, s: the square root of the variance, s^2:
In symbols:
-s measures spread about the mean and should be used only when the mean is chosen as the
measure of center.
-s=0 only when there is no spread. This happens only when all observations have the same
value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets
-s, like the mean (x bar), is not resistant. A few outliers can make s very large.
-s is the natural measure of spread for normal distributions.
Choosing a summary: The five-number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with strong outliers. Use the mean
and standard deviation only for reasonably symmetric distributions that are free of outliers.
Linear transformations: changes the original variable x into the new variable x(new) given by
the equation of the form:
-Adding the constant a shifts all values of x upward or downward by the same amount. In
particular, such a shift changes the origin (zero point) of the variable. Multiplying by the positive
constant b changes the size of the unit of measurement.
-A linear transformation multiplies a measure of spread by b and changes a percentile or
measure of center minto a + bm.
Normal Curves: -a type of density curve
-describes a normal distribution N( μ , σ )
-are symmetrical, unimodal, and bell shaped.
-the exact density curve for a particular normal distribution is specified by giving its mean μ and its standard
deviation σ .
-in normal curves, mean = median
-the standard deviation controls the spread of a normal curve
-The points at which the change of curvature takes place on a normal curve are located at distance σ on either
side of the mean μ .
68-95-99.7 rule: In a normal distribution with mean and standard deviation :
- approximately 68% of the observations fall within σ of the mean μ.
- approximately 95% of the observations fall within 2σ of the mean μ.
- approximately 99.7% of the observations fall within of the mean μ.
Standardizing: -all normal distributions are the same if we measure in units of size σ about the mean μ as
center. Changing to these units is called standardizing.
-If x is an observed value from a distribution that has mean μ and standard deviation σ, the standardized
value of x is:
-the standardized values for any distribution always have mean 0 and standard deviation 1.
z-score: -a standardized value is often called a z-score.
-a z-score tells us how many standard deviations the original observation falls away from the mean, and in
which direction.
-Observations larger than the mean are positive when standardized, and observations smaller than the mean are
The Standardized Normal Distribution: is the Normal distribution N( 0 , 1 ) with mean 0 and standard
deviation 1.
-if a variable X has any normal distribution N( μ , σ ) with mean μ and standard deviation σ, then the
standardized variable
has the standard Normal distribution.
Cumulative Proportion: the proportion of observations in a distribution that lie at or below a given value.
-When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the
left of a given value.
-Table A in the back of the textbook gives cumulative proportions for the standard Normal distribution.
Normal Quantile Plots:
1. Arrange observed data values from smallest to largest. Record what percentile of the data each value
2. Do Normal distribution calculations to find the values of z corresponding to these same percentiles. We call
these values of Z Normal Scores.
3. Plot each data point x against the corresponding Normal score. If the data distribution is close to any normal
distribution, the plotted points will lie close to a straight line. Outliers appear as points far away from the overall
Density Curve:-a curve that is always on or above the horizontal axis.
-has an area of exactly 1 underneath it.
-describes the overall pattern of a distribution. The area under the curve and above any range of values is
the proportion of all observations that fall in that range.
-outliers are not described by the curve.
Mode of Density Curve: a peak point of the curve, the location where the curve is highest.
Median of a Density Curve: the equal-area point, the point that divides the area under the curve in half.
Mean of a Density Curve: -the balance point,
-the point at which the curve would balance on a pivot if made of solid material
Unlock document

This preview shows half of the first page of the document.
Unlock all 1 pages and 3 million more documents.

Already have an account? Log in

Get OneClass Grade+

Unlimited access to all notes and study guides.

Grade+All Inclusive
$10 USD/m
You will be charged $120 USD upfront and auto renewed at the end of each cycle. You may cancel anytime under Payment Settings. For more information, see our Terms and Privacy.
Payments are encrypted using 256-bit SSL. Powered by Stripe.