• A data set contains information on a number of individuals. Individuals may be people,
animals, or things. For each individual, the data gives values for one or more
variables. A variable describes some characteristic of an individual, such as a
person's height, sex or salary.
• Some variables are categoricaland others are quantitative. A categorical variable places
each individual into a category, such as male or female. A quantitative variable
has numerical values that measure some characteristic of each individual, such
as height in centimetres or salary in dollars.
• Exploratorydataanalysisuses graphs and numerical summaries to describe the
variables in a data set and the relations among them.
• After you understand the background of your data (individuals, variables, units of
measurement), the first thing to do is almost always plotyourdata.
• The distributionof a variable describes what values the variable takes and how often it
takes these values. Piechartsand bargraphsdisplay the distribution of a
categorical variable. Bar graphs can also compare any set of quantities
measured in the same units. Histogramsand stemplotsgraph the distribution of a
• When examining any graph, look for an overallpatternand for notable deviationsfrom
• Shape,centre,andspreaddescribe the overall pattern of the distribution of a quantitative
variable. Some distributions have simple shapes, such as symmetricor skewed.
Not all distributions have a simple overall shape, especially when there are few
• Outliersare observations that lie outside the overall pattern of a distribution. Always
look for outliers and try to explain them.
• When observations on a variable are taken over time, make a timeplot that graphs
time horizontally and the values of the variable vertically. A time plot can reveal
trends,cycles,or other changes over time.
• A numerical summary of a distribution should report at least its centreand its spreador
• The meanx-barand the medianM describe the centre of a distribution in different ways.
The mean is the arithmetic average of the observations, and the median is the
midpoint of the values.
• When you use the median to indicate the centre of the distribution, describe its spread
by giving the quartiles.The firstquartile,Q1has 1/4th of the observations below it,
and the thirdquartileQ3has 3/4ths of the observations below it.
• The five-numbersummaryconsisting of the median, the quartiles, and the smallest and
largest individual observations provides a quick overall description of the
distribution. The median describes the centre, and the quartiles and extremes
show the spread. • Boxplotsbased on the five-number summary are useful for comporting several
distributions. The box spans the quartiles and shows the spread of the central
half of the distribution. The median is marked within the box. Lines extend from
the box to the extremes and show the full spread of the data.
• The variances^2and especially its square root, the standarddeviations,are common
measures of spread about the mean as centre. The standard deviation s is zero
when there is not spread and gets larger as the spread increases.
• A resistantmeasureof any aspect of a distribution is relatively unaffected by changes in
the numerical value of a small proportion of the total number of observations, no
matter how large these changes are. The median and quartiles are resistant, but
the mean and the standard deviation are not.
• The mean and standard deviation are good descriptions for symmetric distributions
without outliers. They are most useful for the Normal distributions introduced in
Chapter 3. The five-number summary is a better description for skewed
• Numerical summaries do not fully describe the shape of a distribution. Always plot
• A statistical problem has a real-world setting. You can organize many problems using
the four steps state,plan,solveand conclude.
• We can sometimes describe the total pattern of a distribution by a densitycurve. A
density curb has total area 1 underneath it. An area under a density curve gives
the proportion of observations that fall in a range of values.
• A density curve is an idealized description of the overall pattern of a distribution that
smooths out the irregularities in the actual data. We write the meanofadensity
curveas muand the standarddeviationofadensitycurveas sigmato distinguish them
from the mean (x bar) and standard deviation (s) of the actual data.
• The mean, the median and the quartiles of a density curve can be located by eye.
The meanis the balance point of the curve. The mediandivides the area under
the curve in half. The quartilesand the median divide the area under the curve
into quarters. The standarddeviationsigmacannot be located by eye on most
• The mean and median are equal for symmetric density curves. The mean of a
skewed curve is located farther toward the long tail than is the median.
• The Normaldistributionsare described by a special family of bell-shaped, symmetric
density curves, called Normalcurves. Mu and sigma completely specify a Normal
distribution N(mu,sigma). The mean is the centre of the curve, and sigma is the
distance from mu to the change-of-curvature points on either side.
• To standardizeany observation x, subtract the mean of the distribution and then divide
by the standard deviation. The resulting z-score: z=(x-mu)/
sigma says how many standard deviations x lies from the distribution mean.
• All Normal distributions are the same when measurements are transformed to the
standardized scale. In particular, all Normal distributions satisfy the 68-95-99.7 rule,which describes what percent of observations lie within one, two, and three
standard deviations of the mean.
• If x has the N(mu,sigma) distribution, then the standardizedvariable[z=(x-mu)/sigma]
has the standardNormaldistributionN(0,1) with mean 0 and standard deviation 1.
Table A gives the cumulativeproportionsof standard Normal observations that are
less than z for many values of z. By standardizing, we can use Table A for any
• To study relationships between variables, we must measure the variables on the same
group of individuals.
• If we think that a variable x may explain or even cause changes in another variable y,
we call x and explanatoryvariableand y a responsevariable.
• A scatterplot displays the relationship between 3 quantitative variables measured on
the same individuals. Mark values of one variable on the horizontal axis (x axis)
and values of the other variable on the vertical axis (y axis). Plot each individual's
data as a point on the graph. Always plot the explanatory variable, if there is one,
on the x axis of a scatterplot.
• Plot points with different colors or symbols to see the effect of a categorical variable in
• In examining a scatterplot, look for an overall pattern showing the direction,form,and
strengthof the relationship, and then for outliersor other deviations from this
• Direction:If the relationship has a clear direction, we speak of either positiveassociation
(high values of the two variables tend to occur together) or negativeassociation
(high values of one variable tend to occur with low values of the other variable).
• Form:Linearrelationships,where the points show a straight-line pattern, are an
important form of relationship between two variables. Curved relationships and
clustersare other forms to watch for.
• Strength:The strength of a relationship is determined by how close the points in the
scatterplot lie to a simple form such as a line.
• The correlationrmeasures the direction and strength of the linear association between
two quantitative variables x and y. Although you can calculate a correlation for
any scatterplot, r measures only straight-line relationships.
• Correlation indicates the direction of a linear relationship by its sign: r > 0 for a positive
association and r < 0 for a negative association. Correlation always satisfies [-1
greaterthanorequaltorlessthanorequalto+1]and indicates the strength of a
relationship by how close it is to -1 or +1. Perfect correlation, r = +/- 1, occurs
only when the points on a scatterplot lie exactly on a straight line.
• Correlation ignores the distinction between explanatory and response variables. The
value of r is not affected by changes in the unit of measurement of either
variable. Correlation is not resistant, so outliers can greatly change the value of r.
Chapter5:Summary • A