false

Class Notes
(838,689)

Canada
(511,049)

University of Alberta
(13,468)

Statistics
(248)

STAT151
(157)

Susan Kamp
(11)

Lecture

Unlock Document

Statistics

STAT151

Susan Kamp

Fall

Description

Ch 3: Displaying and Describing Categorical Data
It is very unlikely that we can draw conclusion about a variable
simply by looking at raw data. Thus, it would be beneficial to
summarize the raw data into a more manageable form in order to
draw any useful conclusion. In this chapter, we will use graphs for
initial data exploration.
NOTE: The proper choice of graph depends on the nature of the
variable. Different types of graphs for different types of data!
Graphical Displays
Categorical Variables Numerical Variables
- Bar Chart - Dot Plots
- Pie Chart - Stem Plots
- Histograms
- Time Plots
- Box Plots
- Scatterplots
1 of 49 Graphs for Categorical data
After categorical data has been sampled it should be summarized to
provide the following information:
1. What values have been observed?
Ex: Gender: Female or Male
Ex: Car Color: red, white, blue, black, other
Ex: Smoker: Yes or No
2. How often did every value occur?
The distribution of a categorical variable is given in form of a
table providing with the following information:
- each possible category (and)
- frequency (or number) of individuals who fall into each
category or
- relative frequency (or percentage) of individuals who fall into
each category
- The relative frequency for a particular category is the
percentage of the frequency that the category appears in the
data set. It is calculated as
frequency
relative frequency
numberof observatio ns
2 of 49 Steps to Construct a Frequency or Relative Frequency
Distribution Table
1) Define the levels of the variable (category)
2) Count the number of observations in the data set corresponding
to each class (frequencies). If preferred, calculate the relative
frequency of each category.
3) Summarize the results in a table (known as the frequency or
relative frequency distribution table).
Example:
The distribution of a categorical variable (Favourite Ice-Cream
Flavour) for this Stat 151 Class:
sample size = number of observations = n =
Category Frequency Relative
Frequency
Chocolate
Strawberry
Vanilla
Other
Total
3 of 49 Once the data is summarized in a frequency distribution table, the
data can be displayed in a bar chart or pie chart. The bar chart will
effectively show the frequencies or percent in the different
categories, whereas the pie chart will show the relationship between
the parts and the whole.
Bar chart
- a graph of the distribution of a categorical variable
Steps to make a bar chart:
1) Put every category (or levels of the categorical variable)
evenly on the x-axis (can be marked with a tick).
2) Each category is represented by a bar of equal width, and
the height of the bar is proportional to the corresponding
frequency (relative frequency) of that category.
3) Label the y-axis (frequency or relative frequency).
Example:
The frequency distribution of favourite ice-cream favor is:
4 of 49 Category Frequency Relative
Frequency
Chocolate
Strawberry
Vanilla
Other
Construct a bar chart.
Pie charts
- provide an alternative kind of graph for categorical data.
- a circle is used to represent the sample.
- The size of the slice representing a particular category is
proportional to the corresponding frequency (or relative
frequency).
NOTE: Use a pie chart only when you want to emphasize each
category’s relation to the whole. It is useful when there are a
relatively small number of classes involved.
5 of 49 Steps to create a pie chart:
1) Draw a circle
2) Calculate the slice size (angle)
slice size = category relative frequency 360
(fraction of the circle for the category)
3) use protractor to mark the angles
NOTE: In a pie chart, the proportions shown by each slice of the pie
must add up to 100% and each individual must fall into only 1
category.
NOTE: Be sure to use enough individuals.
Example:
Using the data from previous example:
Category Frequency Relative Angle
Frequency (%)
Chocolate
Strawberry
Vanilla
Other
6 of 49 Example 5 (please read on your own):
On the M&M's webpage the following information on the
distribution of colors in peanut M&M's is provided:
Color brown yellow red blue orange green
Percent 12% 15% 12% 23% 23% 15%
In order to check if this distribution is a "true" description of what is
in a bag, someone bought a bag with 200 peanut M&M's and wants
to describe the colors of the contents.
Color is a categorical variable, so a relative frequency table shall be
obtained:
color Count Rel. Freq
brown 50 25%
yellow 28 14%
red 14 7%
blue 52 26%
orange 36 18%
green 20 10%
Total 200 100%
7 of 49 A bar chart would look like this:
Bar Chart of Peanut M&M's Distribution of Colors
60
50
40
30
20
Count (or Frequency)
0
brown yellow red blue orange green
Color
For the pie chart, the angles of the slices have to be determined:
color Count Rel. Freq Angle
o
brown 50 25% 90.0
o
yellow 28 14% 50.4
o
red 14 7% 25.2
o
blue 52 26% 93.6
orange 36 18% 64.8o
green 20 10% 36.0o
Total 200 100% 360.0 o
8 of 49 This results in the following pie chart:
Pie Chart of Peanut
M&M's Distribution of
Colors
green
orang brown
e 10% 25%
18%
yellow
14%
blue
26% red
7%
Contingency Table
- allows us to look at 2 categorical variables together
- shows how individuals are distributed along each variable,
contingent on the value of the other variable.
- The margins of the table (both on the right and on the bottom)
give totals and the frequency distributions for each of the
variables.
- Each frequency distribution is called a marginal distribution
of its respective variable.
9 of 49 - Each cell of the table gives the count for a combination of
values of the two values.
- A conditional distribution shows the distribution of one
variable for just the individuals who satisfy some condition on
another variable.
- The variables are considered independent when the
distribution of one variable in a contingency table is the same
for all categories of the other variable.
Heart Disease Example:
A study had been set up to study if smoking is a risk factor for heart
disease. The result is given in the following table:
Smoker Total
Yes No
Heart Yes 23 15 38
disease No 69 259 328
Total 92 274 366
10 of 49 The following is the conditional distribution of Heart Disease,
conditional on smoking:
Smoker
Yes % of column
Heart Yes 23 25%
disease No 69 75%
Total 92 100%
The following is the conditional distribution of Heart Disease,
conditional on nonsmoking:
Smoker
No % of column
Heart Yes 15 5.5%
disease No 259 94.5%
Total 274 100%
NOTE: Do not confuse similar-sounding percentages:
- The percentage of people who have heart disease and smoked
- The percentage of smokers who have heart disease
- The percentage of people with heart disease who smoked
11 of 49 What does the conditional distributions tell us?
- there is a difference in having heart disease for those who
smoked and those who don’t.
- This is better shown with pie charts of the two distributions:
- This leads us to believe that heart disease and the status of
smoking are associated, ie. They are not independent.
Independent variables
- Variables are said to be independent if the conditional
distribution of one variable is the same for each category of
another.
- In other words, there is no association between these variables.
12 of 49 Example of independent variables:
Smoker Total
Yes No
Heart Yes 30 15 45
disease No 90 45 135
Total 120 60 180
The following is the conditional distribution of Heart Disease,
conditional on the factor smoker:
Smoker Total
Yes No
Heart Yes 25% 25% 25%
disease No 75% 75% 75%
Total 100% 100% 100%
- NOTE: We see that the distribution of having heart disease for
the smokers is not different from that of the nonsmokers, so the
two variables are independent.
- It is rare for 2 variables to be entirely independent.
13 of 49 Example: Has the percentage of young girls drinking milk changed
over time? The following table is consistent with the results from
“Beverage Choices of Young Females: Changes and Impact on
Nutrient Intakes” (Shanthy A. Bowman, Journal of the American
Dietetic Association, 102(9), pp. 1234-1239):
1. Find the following:
a. What percent of the young girls reported that they drink milk?
b. What percent of the young girls were in the 1989-1991 survey?
c. What percent of the young girls who reported that they drink milk
were in the 1989-1991 survey?
d. What percent of the young girls in 1989-1991 reported that they
drink milk?
2. What is the marginal distribution of milk consumption?
14 of 49 3. Do you think that milk consumption by young girls is independent
of the nationwide survey year? Use statistics to justify your
reasoning.
4. Consider the following pie charts for a subset of the data above:
Do the pie charts above indicate that milk consumption by young
girls is independent of the nationwide survey year? Explain.
Segmented Bar Chart
- Displays the same info as a pie chart, but in the form of bars
instead of circles.
- The following is a segmented bar chart for heart disease by
smoking status:
15 of 49 Chapter 4: Displaying and Summarizing Quantitative Data
Graphs for Numerical Data
Numerical variables often take many values. We need to introduce
other types of graphs to display the data for a quantitative variable in
a fashion so that the distribution of the data becomes apparent.
Describing a distribution of a plot:
1) shapes:
a) nature of distribution (unimodal, bimodal, multimodal)
- One characterization of general shape relates to the number of
humps, or modes.
16 of 49 o unimodal – a single peak
o bimodal – two peaks; can occur when the data set consists
of observation on two quite different kinds of individuals
or objects
o multimodal – more than 2 peaks; rarely occurs
o uniform – no modes
b) symmetrical or skewed to the right/left.
Symmetry: if you can draw a vertical line so that the part to
the left is a mirror image of the part to the right, then it is
17 of 49 symmetric.
Nonsymmetric graphs are skewed.
o If the upper tail of the histogram stretches out farther than
the lower tail, then is the histogram positively skewed, or
skewed to the right.
o If the lower tail longer than the upper tail the histogram is
negatively skewed, or skewed to the left.
c) unusual values or deviations from the overall pattern.
- An important kind of deviation is an outlier, an individual
value that falls outside the overall pattern.
2) center – the value that splits the data in half or a typical range of
values at the center of the graph
- mean, median, mode
3) spread – the range of values; concentration; are most of the
values close to or far from the center?
- Range, standard deviation, IQR
18 of 49 Dotplots
A dot plot is a plot that portrays the individual observations.
To construct a dot plot:
- draw a horizontal (or vertical) line
- label the line with the name of the variable, and mark regular
values of the variable on it
- for each observation, place a dot above (or next to) its value on the
number line
NOTE1: The number of dots above a value on the number line
represents the frequency of occurrence of that value.
NOTE2: The dot plots work well for small sets of data (n ≤ 50).
Example 6a:
Construct a dotplot for the prices of 17 walking shoes (in $): 90 70
70 70 75 70 65 68 60 74 70 95 75 68 85 40 65
19 of 49 Stem-and-Leaf Displays/Stemplot
Another way to portray the individual observations of quantitative
data is a stem-and-leaf display, which works well for small sets of
data (n ≤ 50).
Each observed number is broken into two pieces called the stem and
the leaf.
How to make a stemplot:
1. Divide each data value into two parts:
- The leading digits of the number are the stems.
- The rest of the digits of the number are the leaves.
- NOTE: use the stems to label the bins (the equal-width
interval)
- NOTE 2: use only one digit for each leaf – either round or
truncate the data values to one decimal place after the stem.
2. List the stems in a column (with the smallest at the bottom), and
place a vertical line to the right of this column.
3. For each measurement, record the leaf portion in the same row
as its corresponding stem.
4. Order the leaves from lowest to highest in each stem.
5. Provide a key to your stem and leaf coding.
20 of 49 Example 6b: Draw a stemplot for the prices of walking shoes.
Prices of walking shoes in $:
90 70 70 70 75 70 65 68 60 74 70 95 75 68 85 40 65
Order them from smallest to largest:
40 60 65 65 68 68 70 70 70 70 70 74 75 75 85 90 95
10
9 05
8 5
7 00000455
6 05588
5
4 0
Prices of Walking Shoes
Stems: Tens (10)
Leaf: Ones (1)
Or 9|0 means $90
Example 7a: Draw a stemplot for the prices of running shoes.
Prices of running shoes:
56, 60, 64, 64, 64, 68, 68, 68, 68, 72, 72, 72, 72, 76, 76, 76, 76, 80,
80, 80, 80, 84, 84, 88
21 of 49 8 0000448
7 22226666
6 04448888
5 6
Prices of Running Shoes
Stems: tens (10)
Leaf: ones (1)
OR: 8|8 means $88
Sometimes the available stem choices result in a plot that contains
too few stems and a large number of leaves within each stem, so the
display will look a little crowded. In this situation, you can stretch
the stems by dividing each into several lines.
The two common choices for dividing stems are:
- Into two lines, with leaves 0 to 4 and 5 to 9
- Into 5 lines, with leaves 0-1, 2-3, 4-5, 6-7, 8-9
Example: Make a stemplot with the Running Shoes Data by
stretching the stems into 2 lines:
22 of 49 8 8
8 000044
7 6666
7 2222
6 8888
6 0444
5 6
Prices of Running Shoes
(8|8 means $88)
Sometimes there are too many stems and a small number of leaves
within each stem, usually the case with 3 or more digits. For this
situation, you can truncate (or round) the number to two places,
using the first digit as the stem and the second as the leaf.
Example: Make a stemplot for the following data of acceptance
rates at some business schools:
16.3, 17.0, 19.5, 20.3, 20.5, 21.7, 21.9, 22.1, 22.3, 23.8, 23.9, 25.2,
27.1, 28.9, 30.3, 32.5, 33.7, 35.6
23 of 49 35 6
34
33 7
32 5
31
30 3
29
28 9
27 1
26
25 2
24
23 89
22 13
21 79
20 35
19 5
18
17 0
16 3
Acceptance Rates at some business schools
Stems: Ones (1)
Leaf: Tenths (0.1)
OR (16|3 means 16.3)
Example:
The stem and leaf plot in the previous example looks a little too
spread out, and it has 3 digits, so try to truncate the data and redo the
stemplot.
After the data is truncated, it looks like:
16, 17, 19, 20, 20, 21, 21, 22, 22, 23, 23, 25, 27, 28, 30, 32, 33, 35
24 of 49 3 0235
2 00112233578
1 679
Acceptance Rates at some business schools
Stems: Tens (10)
Leaf: Ones (1)
OR (1|6 means 16)
You also can use stem-and-leaf plots for the comparison of the
distribution of two groups (back-to-back stemplot)
Histogram
The most common graph for describing numerical data is the
histogram. It helps to visualize the distribution of the underlying
variable very well, especially for large data sets.
Definition:
A histogram for a quantitative variable is a graph that uses bars to
show "how often" (measured as frequency or relative frequency)
measurements falls in a par

More
Less
Related notes for STAT151

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.