18 Apr 2012

School

Department

Course

Professor

1 This sample is drawn from the Survey of Consumer Finance (SCF), 1996. This large survey

questionnaire was completed by almost 100,000 adult Canadians and provides information on sources of

income, hours of work and family characteristics during 1995. Further information about the SCF is

available at http://trex.econ.uoguelph.ca/dprescot/courses/scf_info.htm. The subsample of 3,921

individuals used here is a random subset of the full sample. Some restrictions were imposed when the

sample was drawn. In particular, only workers who stated they worked full-time throughout the year

were included.

2 Refer to the appendix of this chapter for details on the properties of the summation operator,

Chapter 1

Univariate Distributions

1 Descriptive Statistics

The most basic application of statistical concepts is to describe data. In many situations large

quantities of data are available to researchers and typically, the most urgent problem is to find a way of

presenting the data so that the most important features can be highlighted. One useful approach is to

construct a diagram known as a histogram for each variable. Figure 1.1 is a histogram that was

constructed from 3,921 observations on the hourly pay earned by full-time Canadian workers in 1995.1

The data have been sorted into 10 bins. The centre of each bin is recorded on the horizontal axis. For

example, the first bin contains all the wage rates in the sample that lie between $2.00 and $6.00 per hour

- its centre is at $4.00 per hour. The number of observations within a bin is called the frequency and this

type of histogram is known as a frequency distribution because it shows how the frequencies are

distributed amongst the bins. Since each observation falls in only one bin, the sum of the frequencies is

the sample size, 3,921. By rescaling the vertical axis, the heights of the bars in Figure 1.1 can also be

interpreted as the relative frequencies, which are obtained by dividing each frequency by the sample size.

For example, the relative frequency of the first bin is 177/3921 = 0.045 In other words, 4.5% of the

sample falls in the first bin. Clearly, the sum of the relative frequencies (or shares) must be unity.

It will be useful if some notation is used to refer to key concepts. The size of the entire sample is

defined to be n (n = 3,921 in the example). The number of bins is m, where m < n and in the wage

example m = 10. The frequency of observations in the jth bin is denoted by fj for j = 1, 2, ..., m. In the

example, f1 = 177. The sum of the frequencies must equal the total number of observations in the

sample2:

Econometrics Text by D M Prescott © Chapter 1, 2

f1f2.... fmM

m

j1fjn[1.1]

f1

n

f2

n....

fm

nM

m

j1

fj

n1

nM

m

j1fjn

n1

r1r2.... rmM

m

j1rj1

0

0.05

0.1

0.15

0.2

0.25

Relative Frequency

Wage Rates ($/hour)

Distribution of Wages

481

21

62

02

42

83

2

3

668

Figure 1.1

0

0.01

0.02

0.03

0.04

0.05

0.06

Density

Wage Rates ($/hour)

Distribution of Wages

481

21

62

02

42

83

2

3

668

Figure 1.2

Density = Relative Freq / bin width

By dividing each frequency by the sample size n, we obtain the relative frequencies rj = fj /n for

j = 1, 2, ..., m. The fact that the sum of the relative frequencies is unity can be confirmed by dividing

all the terms in equation [1.1] by the sample size n:

The picture of the data that we get from the histogram depends on how the bin boundaries are

defined and there is no unique way to do this. Over most of the data's range a bin width of $4.00 per hour

seems to present a useful picture of the data. However, at higher wages the data are very thinly spread.

If all bin widths were $4.00 per hour, 24 bins would be needed to cover the entire sample (since the

maximum hourly wage in this sample is $96.15 per hour) and many bins would be empty. To

accommodate the thinness of data and the large maximum wage, a single bin has been created that spans

wages from $38.00 per hour to $97.99. The centre of this bin is $68.00. There are 88 observations in the

last bin, compared to 59 observations in the bin that spans wages between $34.00 and $37.99 per hour

Although Figure 1.1 provides a useful representation of the data, it does have some deficiencies.

Econometrics Text by D M Prescott © Chapter 1, 3

3 Consider the widely encountered normal distribution, which is reviewed later in this chapter. It

is areas under the bell-shaped normal density function that represent probabilities.

It is natural to judge the relative importance of a bin not so much by the bin's height as by its area. This

creates a problem when the bin widths vary in size. The difficulty is that the areas of wider bins are too

large in relation to the narrower bins - wider bins simply look more important than they should. Figure

1.2 corrects this problem by using the area of a bin to represent its relative frequency. When the area is

used to represent the relative frequency, the height of the bar is referred to as the density, d. To calculate

the density of the data in the jth bin (dj) we use the fact that the area of the bar (the relative frequency rj)

must equal the width of the bin (wj ) times the height or density (dj ):

rj = dj.wj

The density is therefore

dj = rj/wj

Figure 1.2 is similar to a probability density function in which areas rather than heights represent

probabilities 3. Notice that the total area under the distribution in Figure 1.2 is exactly unity because the

areas represent relative frequencies and, as noted in equation [2], the relative frequencies add up to unity.

Several characteristics of the wage data are evident in Figure 1.2. First, not all wages are the

same; rather they are spread out or "dispersed". Second, most wage rates are close to the "central" wage,

but further away the frequencies diminish. Third, wage rates are not distributed symmetrically about the

"centre" - the maximum wage is much further away from the "central" wage than the minimum wage.

The distribution is stretched out or "skewed" to the right. Each of the concepts: "centre", "dispersion"

and "skewness" can be quantified using specific formulae. However, there are often several ways to

measure each concept, each having its own way of capturing some essential aspect of “centre”,

“dispersion” etc.

1.1 Measures of the "Centre" of a Distribution

The mid-range or half way point between the minimum and maximum values can be used to

define the "centre" of a distribution, but this is clearly unsatisfactory when the data are severely skewed.

In Figure 1.2, 99.5% of the wages are below the mid-range of $49.6, so this number hardly represents the

"central" wage. Often, the bin with the greatest frequency, referred to as the mode, is a useful measure of