COMMERCE 291 – Lecture Notes – Jonathan Berkowitz (copyright, 2013)
Summary of Lectures 4 and 5
Chapter 5: Displaying and Describing Quantitative Data
With quantitative data, there are many values and little repetition (e.g. age of fans at a
Canucks game). Here it is better to group the data into classes or intervals and find the
frequency of each class.
Example: Sales records – the number of product units sold by each salesperson
Class Frequency Relative Frequency Percentage
0-10 3 .06 6%
11-20 10 .20 20%
21-30 17 .34 34%
31-40 15 .30 30%
41-50 2 .04 4%
51-60 3 .06 6%
Total 50 1.00 100%
A useful type of graph of a frequency table for a quantitative variable is a “histogram”.
Here is a histogram (created in Excel) for the above frequency table.
Note that in a histogram the bars are touching; in fact “histogram” means “picture of
Units sold per salesperson
0-10 11-20 21-30 31-40 41-50 51-60
# of units
Note: The labeling says 0-10, 11-20, etc., so that there is no ambiguity on the borders.
The 11-20 category is really >10 to 20.
1 Remember: The difference between a bar chart and a histogram: bar charts are for
categorical data where the categories are not contiguous (i.e. touching); histograms are
for quantitative data where the underlying continuum is divided into contiguous class
An alternative to the histogram is the stem-and-leaf plot (also called, simply, a stemplot).
Example: The following data are final exam marks (ranked in ascending order) from a
rather difficult math exam.
33 48 54 60 64 68 71 74 76 82
34 49 54 60 64 70 71 74 77 83
38 50 55 61 65 70 71 74 78 85
40 50 56 62 65 70 71 75 80 86
42 51 58 64 65 71 72 76 81 90
42 52 58 64 67 71 74 76 81 91
Use the “10s-digit” as the stems, and the “1s-digits” as the leaves.
3 | 3 4 8
4 | 0 2 2 8 9
5 | 0 0 1 2 4 4 5 6 8 8
6 | 0 0 1 2 4 4 4 4 5 5 5 7 8
7 | 0 0 0 1 1 1 1 1 1 2 4 4 4 4 5 6 6 6 7 8
8 | 0 1 1 2 3 5 6
9 | 0 1 5
This looks like a histogram, on its side, with class intervals 30-39, 40-49, 50-59, etc.
The advantage of the stemplot is that it retains the original data values. The
disadvantage is that it is limited in what the class intervals can be, and is limited to the
two leading significant digits.
Quantitative Data Condition: data values have known units
Categorical Data Condition: data are counts of individuals in categories
What to look for in a histogram:
• Is there a single peak/mode (unimodal) or multiple peaks/modes (bimodal,
• Is the shape symmetric or skewed (asymmetric)?
• Are there outliers (unusual observations a long way from the main body of data)?
2 Numerical Summaries of Centre and Spread:
Note: Categorical data can really only be summarized by counts and relative
frequencies, so the numerical summaries discussed here apply only to quantitative data.
Measures of centre and spread cannot easily detect gaps, multiple peaks or extreme
values (i.e. outliers) in a distribution. That is why analysis should begin with a graph!
Measures of Centre (also called Location):
• Mean (denoted by x ): sum of the data values divided by the number of data
• Median: the middle value, after arranging the data in ascending order
• Mode: the highest point in the histogram (by far the least useful of the three
summaries and not often used as a summary measure)
The mean can be thought of as the “centre of gravity” of the distribution. That is, if the
histogram (since we have measurement data!) were sitting on a seesaw, the mean
would be the point at which the seesaw was in balance.
The median can be though of as the middle of the data; 50% are on each side of the
median. Note: If there is an even number of data values, the median is the average of
the two middle values.
For symmetric distributions, the mean equals the median.
For asymmetric or skewed distributions, the mean and the median are not equal.
If the distribution has a long right-hand tail (i.e. skewed right) the mean is greater than
If the distribution has a long left-hand tail (i.e. skewed left), the mean is less than the
The median is a more “robust” measure than the mean; that is, it is not highly affected by
the presence of outliers.
Example: For data values: 1,2,3,4,5,6,7,8,9, both the mean and median are 5;
For data values: 1,2,3,4,5,6,7,8,99, the mean is 15 but the median is still 5.
Example (adapted from Martin Gardner): The management team of ABC Company
consisted of Mr. X, his brother and six relatives. The work force consisted of five
forepersons and ten workers. When the company was ready to expand, Mr. X
interviewed applicants and told them that the average weekly salary at ABC was $600. A
new worker, Y, was hired and, after a few days, complained that Mr. X had misled him. Y
had checked with the other workers and discovered that none was getting more than
$200 a week, so how could the average be $600. Mr. X explained the salary structure:
Each week Mr. X got $4800, his brother got $2000, each of the six relatives made $500,
the forepersons each made $400 and the ten workers each got $200. That made a
weekly total payroll of $13,800 for 23 people, leading to an average of $600. What about
the median and mode? The median is the middle value, in this case, the 12 value,
which is $400. The mode is the most frequently occurring value, which is $200.
3 So here we have three measures of location: $200, $400 and $600. Which is the
preferred one? Since the distribution is so skewed and also has a large extreme value,
the median is preferred. Management would, of course, want to report the mean, but the
Union would probably prefer to report the mode during contract negotiations!
* * * * *
There are actually three kinds of means – arithmetic, geometric, and harmonic.
Arithmetic mean (the familiar one, x ) – for non-related values
(Note that “arithmetic” is pronounced with the emphasis on the “met” because it is an
adjective here – arith-MET-ic)
Harmonic mean – for rates which are NOT dependent on each other
HM = n
1 1 1
x1 x 2 xn
Example: Drive at 40 kph from point A to point B and 60 kph from point B to point A.
What is the average speed for the round trip?
HM = = 48 kph
Geometric mean – for rates where each measurement depends on the previous one.
Denote the rates by R1,R 2...Rn and the geometric mean by R g. Then
(1 R g (1 R )(1 R )2(1 R ) n
Then R g (1 R )(11R )21 R ) 1 n
Example: Suppose an investment experiences 100% growth in the first year and then a
50% loss in the second year. What is the average in increase? (The arithmetic mean
would be [100%+(-50%)]/2 = 25% but that’s nonsense since you would actually be right
back where you started from!)
R g (11.0)(1(0.5)) 1 = 1–1 = 0
End of extra material
* * * * *
4 Measures of Spread (also called Scale or Dispersion):
Range = Maximum value – Minimum value
Variance (s ) and Standard Deviation (s or SD)
2 x i
x x 2 xi
s = i A computational formula is: s = n
n 1 n 1
The standard deviation (sometimes referred to as the SD) can best be interpreted as
“the typical distance from a data value to the mean”.
Neither the mean nor the standard deviation is resistant to outliers. They are also a poor
choice of summary if the distribution is highly skewed.
Percentiles: The p percentile is a value below which p% of the data values fall. Some
pethentiles have srdcial names:
75 percentile = 3 or upper quartile = Q3
25 percentile = 1 or lower quartile = Q1
50 percentile = Median
The Interquartile Range (IQR) = Q3 – Q1.
Note: If there are an odd number of observations, the median is indeed the middle one;
but if there are an even number of observations the convention is to take the average of
the two middle ones as the median. This can be extended to computing the quartiles; if a
quartile lies between two observations, take the average.
The text explains that the first quartile is the “median” of the values to the left of the full
data set median and the third quartile is the “median” of the values to the right of the full
data set median. Beware, however, that different software packages (including Excel)
have different conventions for