# STAB22H3 Study Guide - Midterm Guide: Time Series, Standard Deviation, Linear Map

710 views1 pages

Exploratory Data Analysis: Uses graphs and numerical summaries to describe the variables in a

data set and the relations among them.

1. Begin by examining each variable by itself. Then move on to study the relationship among the

variables.

2. Begin with graphs. Then add numerical summaries of specific aspects of the data.

Distribution of a Categorical Variable: lists the categories and gives either the count or

the percent of individuals who fall in each category.

Bar Graph (describe the distribution of categorical variables): The heights of the bars compare

the percents of each category.

-easier to read and more flexible

Pie Chart (describe the distribution of categorical variables): Helps to see what part of the

group each category forms.

-Because pie charts lack scales, add the percents to the labels for each slice.

-Require that you include all the categories that make up a whole.

-Use them only when you want to emphasize each category's relation to the whole.

Tails: The extreme values of a distribution are in the tails of the distribution.

Stemplots (describe the distribution of quantitative variables): Gives a quick picture of the

shape of a distribution while including the actual numerical values in the graph.

-also called a stem-and-leaf plot

-work best for small numbers of observations that are all greater than 0.

-do not work well for large data sets.

Back-to-back stemplot: when comparing two related distributions using a common stem and the

leaves for one on the right and the leaves for the other on the left.

Histograms (describe the distribution of quantitative variables): Break the range of values of a

variable into classes and display only the count or percent of the observation that fall into each

class.

-A histogram shows the distribution of counts or percents among the values of a single variable. A

bar graph compares the size of different items.

-plot the frequencies (counts) or the percents of equal-width classes of values.

Frequencies: The number of individuals in each class.

- the base of the bar covers the class and the height is the class count.

- the area of a histogram is determined by the height since the width of all the bars are equal.

Outlier: are observations that lie outside the overall pattern of a distribution.

-An individual value that falls outside the overall pattern.

Midpoint of a Distribution: the value with roughly half the observations taking smaller values and

half taking larger values.

Modes: major peaks in the graph of a distribution

-a distribution with one major peak is called unimodal

Symmetric Distribution: when the values smaller and larger than its midpoint are mirror images of

each other.

Skewed to the right: if the right tail (larger values) of the distribution is much longer than the left tail

(smaller values)

Time Plots: the plotting of each observation against the time at which it was measured. Always put

time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.

Connecting the data points by lines helps emphasize any changes over time.

Time Series: measurements of a variable taken at regular intervals over time.

Trend: in a time series, is a persistent, long-term rise or fall.

Seasonal Variation: a pattern in a time series that repeats itself at known regular intervals of time

Statistics: The science of learning from data.

Data: Data are numerical facts.

Data are numbers with a context, and we need to understand the context if we are to

make sense of the numbers.

Individuals: are the objects described in a set of data.

Individuals are sometimes people.

When the objects that we want to study are not people, we often call them cases.

Cases: When the objects that we want to study are not people, we often call them

cases.

Variable: is any characteristic of an individual.

A variable can take different values for different individuals.

Categorical Variable: place an individual into one of two or more groups or

categories.

Quantitative Variable: takes numerical values for which arithmetic operations, such

as adding and averaging, make sense.

Distribution: The distribution of a variable tells us what values it takes and how often

it takes these values.

Rate: Often, the rate at which something occurs is more meaningful than the simple

count of occurrences.

Mean: The average value.

-a measure of the center of a distribution.

-To find the mean of a set of observations, add their values and divide by the number of

observations.

Median: The middle value or midpoint.

-a measure of the center of a distribution.

-half of the observations are smaller than the median and half are larger.

-first arrange all the values from smallest to largest

-(n+1)/2 = position of median

-Also known as the 50th percentile

-(see notation)

Resistant Measure: A measure that is not sensitive to the influence of a few extreme

observations (e.g. outliers.

Spread: The simplest useful numerical description of a distribution consists of both a measure of

center and a measure of spread.

First Quartile - Q1: the median of the observations whose position in the ordered list is to the

left of the location of the overall median.

-Remember to first arrange the observations in increasing order.

-(n)(0.25) = position of Q1

Third Quartile - Q3: the median of the observations whose position in the ordered list is to the

left of the location of the overall median.

-Remember to first arrange the observations in increasing order.

-(n)(0.75) = position of Q2

Five-Number Summary: consists of the smallest observation, the first quartile, the median, the

third quartile, and the largest observation, written in order from smallest to largest.

-In symbols:

Minimum Q1 M Q3 Maximum

-leads to another visual representation of a distribution of a distribution, the boxplot.

Box Plot: a graph of the five number summary

-a central box spans the quartiles Q1 and Q3

-a line in the box marks the median M.

-Lines extend from the box out to the smallest and largest observations.

-effective for comparing distributions

Interquartile Range (IQR): The distance between the first and third quartiles,

IQR = Q3 - Q1

The 1.5 x IQR Rule for outliers: An observation is a suspected outlier if it falls more than 1.5 x

IQR above the third quartile or below the first quartile.

Variance, s^2: a set of observations is the average of the squares of the deviations of the

observations from their mean.

In symbols,

Standard Deviation, s: the square root of the variance, s^2:

In symbols:

-s measures spread about the mean and should be used only when the mean is chosen as the

measure of center.

-s=0 only when there is no spread. This happens only when all observations have the same

value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets

larger.

-s, like the mean (x bar), is not resistant. A few outliers can make s very large.

-s is the natural measure of spread for normal distributions.

Choosing a summary: The five-number summary is usually better than the mean and standard

deviation for describing a skewed distribution or a distribution with strong outliers. Use the mean

and standard deviation only for reasonably symmetric distributions that are free of outliers.

Linear transformations: changes the original variable x into the new variable x(new) given by

the equation of the form:

-Adding the constant a shifts all values of x upward or downward by the same amount. In

particular, such a shift changes the origin (zero point) of the variable. Multiplying by the positive

constant b changes the size of the unit of measurement.

-A linear transformation multiplies a measure of spread by b and changes a percentile or

measure of center minto a + bm.

Normal Curves: -a type of density curve

-describes a normal distribution N( μ , σ )

-are symmetrical, unimodal, and bell shaped.

-the exact density curve for a particular normal distribution is specified by giving its mean μ and its standard

deviation σ .

-in normal curves, mean = median

-the standard deviation controls the spread of a normal curve

-The points at which the change of curvature takes place on a normal curve are located at distance σ on either

side of the mean μ .

-Equation:

68-95-99.7 rule: In a normal distribution with mean and standard deviation :

- approximately 68% of the observations fall within σ of the mean μ.

- approximately 95% of the observations fall within 2σ of the mean μ.

- approximately 99.7% of the observations fall within 3σ of the mean μ.

Standardizing: -all normal distributions are the same if we measure in units of size σ about the mean μ as

center. Changing to these units is called standardizing.

-If x is an observed value from a distribution that has mean μ and standard deviation σ, the standardized

value of x is:

-the standardized values for any distribution always have mean 0 and standard deviation 1.

z-score: -a standardized value is often called a z-score.

-a z-score tells us how many standard deviations the original observation falls away from the mean, and in

which direction.

-Observations larger than the mean are positive when standardized, and observations smaller than the mean are

negative.

The Standardized Normal Distribution: is the Normal distribution N( 0 , 1 ) with mean 0 and standard

deviation 1.

-if a variable X has any normal distribution N( μ , σ ) with mean μ and standard deviation σ, then the

standardized variable

has the standard Normal distribution.

Cumulative Proportion: the proportion of observations in a distribution that lie at or below a given value.

-When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the

left of a given value.

-Table A in the back of the textbook gives cumulative proportions for the standard Normal distribution.

Normal Quantile Plots:

1. Arrange observed data values from smallest to largest. Record what percentile of the data each value

occupies.

2. Do Normal distribution calculations to find the values of z corresponding to these same percentiles. We call

these values of Z Normal Scores.

3. Plot each data point x against the corresponding Normal score. If the data distribution is close to any normal

distribution, the plotted points will lie close to a straight line. Outliers appear as points far away from the overall

pattern.

Density Curve:-a curve that is always on or above the horizontal axis.

-has an area of exactly 1 underneath it.

-describes the overall pattern of a distribution. The area under the curve and above any range of values is

the proportion of all observations that fall in that range.

-outliers are not described by the curve.

Mode of Density Curve: a peak point of the curve, the location where the curve is highest.

Median of a Density Curve: the equal-area point, the point that divides the area under the curve in half.

Mean of a Density Curve: -the balance point,

-the point at which the curve would balance on a pivot if made of solid material