1 This sample is drawn from the Survey of Consumer Finance (SCF), 1996. This large survey
questionnaire was completed by almost 100,000 adult Canadians and provides information on sources of
income, hours of work and family characteristics during 1995. Further information about the SCF is
available at http://trex.econ.uoguelph.ca/dprescot/courses/scf_info.htm. The subsample of 3,921
individuals used here is a random subset of the full sample. Some restrictions were imposed when the
sample was drawn. In particular, only workers who stated they worked full-time throughout the year
2 Refer to the appendix of this chapter for details on the properties of the summation operator,
1 Descriptive Statistics
The most basic application of statistical concepts is to describe data. In many situations large
quantities of data are available to researchers and typically, the most urgent problem is to find a way of
presenting the data so that the most important features can be highlighted. One useful approach is to
construct a diagram known as a histogram for each variable. Figure 1.1 is a histogram that was
constructed from 3,921 observations on the hourly pay earned by full-time Canadian workers in 1995.1
The data have been sorted into 10 bins. The centre of each bin is recorded on the horizontal axis. For
example, the first bin contains all the wage rates in the sample that lie between $2.00 and $6.00 per hour
- its centre is at $4.00 per hour. The number of observations within a bin is called the frequency and this
type of histogram is known as a frequency distribution because it shows how the frequencies are
distributed amongst the bins. Since each observation falls in only one bin, the sum of the frequencies is
the sample size, 3,921. By rescaling the vertical axis, the heights of the bars in Figure 1.1 can also be
interpreted as the relative frequencies, which are obtained by dividing each frequency by the sample size.
For example, the relative frequency of the first bin is 177/3921 = 0.045 In other words, 4.5% of the
sample falls in the first bin. Clearly, the sum of the relative frequencies (or shares) must be unity.
It will be useful if some notation is used to refer to key concepts. The size of the entire sample is
defined to be n (n = 3,921 in the example). The number of bins is m, where m < n and in the wage
example m = 10. The frequency of observations in the jth bin is denoted by fj for j = 1, 2, ..., m. In the
example, f1 = 177. The sum of the frequencies must equal the total number of observations in the
Econometrics Text by D M Prescott © Chapter 1, 2
Wage Rates ($/hour)
Distribution of Wages
Wage Rates ($/hour)
Distribution of Wages
Density = Relative Freq / bin width
By dividing each frequency by the sample size n, we obtain the relative frequencies rj = fj /n for
j = 1, 2, ..., m. The fact that the sum of the relative frequencies is unity can be confirmed by dividing
all the terms in equation [1.1] by the sample size n:
The picture of the data that we get from the histogram depends on how the bin boundaries are
defined and there is no unique way to do this. Over most of the data's range a bin width of $4.00 per hour
seems to present a useful picture of the data. However, at higher wages the data are very thinly spread.
If all bin widths were $4.00 per hour, 24 bins would be needed to cover the entire sample (since the
maximum hourly wage in this sample is $96.15 per hour) and many bins would be empty. To
accommodate the thinness of data and the large maximum wage, a single bin has been created that spans
wages from $38.00 per hour to $97.99. The centre of this bin is $68.00. There are 88 observations in the
last bin, compared to 59 observations in the bin that spans wages between $34.00 and $37.99 per hour
Although Figure 1.1 provides a useful representation of the data, it does have some deficiencies.
Econometrics Text by D M Prescott © Chapter 1, 3
3 Consider the widely encountered normal distribution, which is reviewed later in this chapter. It
is areas under the bell-shaped normal density function that represent probabilities.
It is natural to judge the relative importance of a bin not so much by the bin's height as by its area. This
creates a problem when the bin widths vary in size. The difficulty is that the areas of wider bins are too
large in relation to the narrower bins - wider bins simply look more important than they should. Figure
1.2 corrects this problem by using the area of a bin to represent its relative frequency. When the area is
used to represent the relative frequency, the height of the bar is referred to as the density, d. To calculate
the density of the data in the jth bin (dj) we use the fact that the area of the bar (the relative frequency rj)
must equal the width of the bin (wj ) times the height or density (dj ):
rj = dj.wj
The density is therefore
dj = rj/wj
Figure 1.2 is similar to a probability density function in which areas rather than heights represent
probabilities 3. Notice that the total area under the distribution in Figure 1.2 is exactly unity because the
areas represent relative frequencies and, as noted in equation , the relative frequencies add up to unity.
Several characteristics of the wage data are evident in Figure 1.2. First, not all wages are the
same; rather they are spread out or "dispersed". Second, most wage rates are close to the "central" wage,
but further away the frequencies diminish. Third, wage rates are not distributed symmetrically about the
"centre" - the maximum wage is much further away from the "central" wage than the minimum wage.
The distribution is stretched out or "skewed" to the right. Each of the concepts: "centre", "dispersion"
and "skewness" can be quantified using specific formulae. However, there are often several ways to
measure each concept, each having its own way of capturing some essential aspect of “centre”,
1.1 Measures of the "Centre" of a Distribution
The mid-range or half way point between the minimum and maximum values can be used to
define the "centre" of a distribution, but this is clearly unsatisfactory when the data are severely skewed.
In Figure 1.2, 99.5% of the wages are below the mid-range of $49.6, so this number hardly represents the
"central" wage. Often, the bin with the greatest frequency, referred to as the mode, is a useful measure of