AMS 5 Lecture 4
4/10/2017
(8:009:05)
Continuing from Friday: How to Construct Histograms
Example: Histogram for US household incomes from 2015
Table:
Income Level Frequency Relative Frequency
$0  $14,999 14,595,004 11.6%
$15,000  24,999 13,210,995 10.5%
$25,000  $34,999 12,581,900 10%
$35,000  $49,999 15,979,013 12.7%
$50,000  $74,999 21,011,773 16.7%
$75,000  $99,999 15,224,099 12.1%
$100,000  $149,999 17,740,479 14.1%
$150,000  $199,999 7,800,778 6.2%
$200,000 and over 7,674,959 6.1%
Example: Starting with the table of income distribution we saw earlier, we first draw the
horizontal axis…
0 50 100 150 200 250
…Using a density scale, we draw rectangles over each class interval whose areas equal
the percentages of the families in those intervals.
Note: The height of each rectangle is equal to the percentage of the observations in the
corresponding class interval divided by the length of the class interval (the width of the
rectangle)
Next we divide all of the frequency numbers by their range
11.6/15 = .793
10.5/10 = 1.05
10/10 = 1
12.7/15 = .8966
… .8

.6

.4

.2

0 50 100 150 200 250
The vertical scale here is percent per $1000 – i.e., it is the relative frequency (percentage)
divided by the width of the intervals (which in this case are measured in $1000s). It’s
always a good idea to label the axes.
Why density scale instead of percentages/frequency scale?
o The size of the bars then would just be misleading. The bins for the higher
incomes seem to be much bigger than the bins
o If bins have different widths – use the density scale
Comment: If all the bins in the distribution have the same width, then the appearance of
the histogram will be the same for all three scales. Only the units (and numbers) on the
vertical scale will change.
Example: Distribution of coal (by weight) in Christmas stockings of 40 children at
Wool’s orphanage.
In this case the density scale and frequency scale are equal since the intervals are the
same length, the only difference is what is measured by the yaxis
Statistics and parameters
Tables, histograms, and other charts are used to summarize large amounts of data. Often,
an even more extreme summary is desirable.
o A number that summarizes population data is called a parameter.
o A number that summarizes sample data that is called a statistic.
Average, Mean, Median Average = mean
Median = middle
Observations
Population parameters are (more or less) constant.
Sample statistics vary with the sample, i.e., their values depend on the particular sample
chosen. A sample statistic can be thought of as a variable.
o Ex: Find the average income for 10,000 households. Depending on the 10,000
you collect data on, the number will change.
Sample statistics are known because we can compute them from the (available) sample
data, while population parameters are often unknown, because data for the entire
population is often unavailable.
One of the most common uses of sample statistics is to estimate population parameters.
o If you compute average of sample size (assuming data is collected correctly), it
should be similar to the larger population.
Quartiles
1. Number that separates the lowest 25% from the highest 75%.
2. Median, separates the data in the lowest 50% from the highest 50%
3. Number that separates the lowest 75% from highest 25%
You could have more intervals (or “quartiles”), but quartiles are more commonly used.
Measures of central tendency
The most extreme way to summarize a list of numbers is with a single, typical value. The
most common choices are the mean and median.
o The mean (average) of a set of numbers is the sum of all the values divided by
the number of values in the set.
o The median of a set of number is the middle number, when the numbers are listed
in increasing (or decreasing) order. The median splits the data into two equally
sized sets—50% of the data lies below the median and 50% lies above.
o If the number of numbers in the set is even, then the median will be the average of
the middle two values. The mean and median are different ways of describing the center of the data. Another
statistic that is often used to describe the typical value is the mode, which is the most
frequently occurring value in the data.
Example: Find the mean, median, and mode of the following set of numbers:
{12, 5, 6, 8, 12, 17, 7, 6, 14, 6, 5, 16}
o The mean (average)
12+5+6+8+12+17+7+6+14+6+5+16
12
o The median. Arrange the data in ascending order, and find the average of the
middle two values in this case, since there
More
Less