Class Notes (838,689)
Canada (511,049)
Statistics (248)
STAT151 (157)
Susan Kamp (11)
Lecture

3, 4, 5.pdf

49 Pages
144 Views
Unlock Document

Department
Statistics
Course
STAT151
Professor
Susan Kamp
Semester
Fall

Description
Ch 3: Displaying and Describing Categorical Data It is very unlikely that we can draw conclusion about a variable simply by looking at raw data. Thus, it would be beneficial to summarize the raw data into a more manageable form in order to draw any useful conclusion. In this chapter, we will use graphs for initial data exploration. NOTE: The proper choice of graph depends on the nature of the variable. Different types of graphs for different types of data! Graphical Displays Categorical Variables Numerical Variables - Bar Chart - Dot Plots - Pie Chart - Stem Plots - Histograms - Time Plots - Box Plots - Scatterplots 1 of 49 Graphs for Categorical data After categorical data has been sampled it should be summarized to provide the following information: 1. What values have been observed?  Ex: Gender: Female or Male  Ex: Car Color: red, white, blue, black, other  Ex: Smoker: Yes or No 2. How often did every value occur? The distribution of a categorical variable is given in form of a table providing with the following information: - each possible category (and) - frequency (or number) of individuals who fall into each category or - relative frequency (or percentage) of individuals who fall into each category - The relative frequency for a particular category is the percentage of the frequency that the category appears in the data set. It is calculated as frequency relative frequency numberof observatio ns 2 of 49 Steps to Construct a Frequency or Relative Frequency Distribution Table 1) Define the levels of the variable (category) 2) Count the number of observations in the data set corresponding to each class (frequencies). If preferred, calculate the relative frequency of each category. 3) Summarize the results in a table (known as the frequency or relative frequency distribution table). Example: The distribution of a categorical variable (Favourite Ice-Cream Flavour) for this Stat 151 Class: sample size = number of observations = n = Category Frequency Relative Frequency Chocolate Strawberry Vanilla Other Total 3 of 49 Once the data is summarized in a frequency distribution table, the data can be displayed in a bar chart or pie chart. The bar chart will effectively show the frequencies or percent in the different categories, whereas the pie chart will show the relationship between the parts and the whole. Bar chart - a graph of the distribution of a categorical variable Steps to make a bar chart: 1) Put every category (or levels of the categorical variable) evenly on the x-axis (can be marked with a tick). 2) Each category is represented by a bar of equal width, and the height of the bar is proportional to the corresponding frequency (relative frequency) of that category. 3) Label the y-axis (frequency or relative frequency). Example: The frequency distribution of favourite ice-cream favor is: 4 of 49 Category Frequency Relative Frequency Chocolate Strawberry Vanilla Other Construct a bar chart. Pie charts - provide an alternative kind of graph for categorical data. - a circle is used to represent the sample. - The size of the slice representing a particular category is proportional to the corresponding frequency (or relative frequency). NOTE: Use a pie chart only when you want to emphasize each category’s relation to the whole. It is useful when there are a relatively small number of classes involved. 5 of 49 Steps to create a pie chart: 1) Draw a circle 2) Calculate the slice size (angle) slice size = category relative frequency  360 (fraction of the circle for the category) 3) use protractor to mark the angles NOTE: In a pie chart, the proportions shown by each slice of the pie must add up to 100% and each individual must fall into only 1 category. NOTE: Be sure to use enough individuals. Example: Using the data from previous example: Category Frequency Relative Angle Frequency (%) Chocolate Strawberry Vanilla Other 6 of 49 Example 5 (please read on your own): On the M&M's webpage the following information on the distribution of colors in peanut M&M's is provided: Color brown yellow red blue orange green Percent 12% 15% 12% 23% 23% 15% In order to check if this distribution is a "true" description of what is in a bag, someone bought a bag with 200 peanut M&M's and wants to describe the colors of the contents. Color is a categorical variable, so a relative frequency table shall be obtained: color Count Rel. Freq brown 50 25% yellow 28 14% red 14 7% blue 52 26% orange 36 18% green 20 10% Total 200 100% 7 of 49 A bar chart would look like this: Bar Chart of Peanut M&M's Distribution of Colors 60 50 40 30 20 Count (or Frequency) 0 brown yellow red blue orange green Color For the pie chart, the angles of the slices have to be determined: color Count Rel. Freq Angle o brown 50 25% 90.0 o yellow 28 14% 50.4 o red 14 7% 25.2 o blue 52 26% 93.6 orange 36 18% 64.8o green 20 10% 36.0o Total 200 100% 360.0 o 8 of 49 This results in the following pie chart: Pie Chart of Peanut M&M's Distribution of Colors green orang brown e 10% 25% 18% yellow 14% blue 26% red 7% Contingency Table - allows us to look at 2 categorical variables together - shows how individuals are distributed along each variable, contingent on the value of the other variable. - The margins of the table (both on the right and on the bottom) give totals and the frequency distributions for each of the variables. - Each frequency distribution is called a marginal distribution of its respective variable. 9 of 49 - Each cell of the table gives the count for a combination of values of the two values. - A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. - The variables are considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable. Heart Disease Example: A study had been set up to study if smoking is a risk factor for heart disease. The result is given in the following table: Smoker Total Yes No Heart Yes 23 15 38 disease No 69 259 328 Total 92 274 366 10 of 49 The following is the conditional distribution of Heart Disease, conditional on smoking: Smoker Yes % of column Heart Yes 23 25% disease No 69 75% Total 92 100% The following is the conditional distribution of Heart Disease, conditional on nonsmoking: Smoker No % of column Heart Yes 15 5.5% disease No 259 94.5% Total 274 100% NOTE: Do not confuse similar-sounding percentages: - The percentage of people who have heart disease and smoked - The percentage of smokers who have heart disease - The percentage of people with heart disease who smoked 11 of 49 What does the conditional distributions tell us? - there is a difference in having heart disease for those who smoked and those who don’t. - This is better shown with pie charts of the two distributions: - This leads us to believe that heart disease and the status of smoking are associated, ie. They are not independent. Independent variables - Variables are said to be independent if the conditional distribution of one variable is the same for each category of another. - In other words, there is no association between these variables. 12 of 49 Example of independent variables: Smoker Total Yes No Heart Yes 30 15 45 disease No 90 45 135 Total 120 60 180 The following is the conditional distribution of Heart Disease, conditional on the factor smoker: Smoker Total Yes No Heart Yes 25% 25% 25% disease No 75% 75% 75% Total 100% 100% 100% - NOTE: We see that the distribution of having heart disease for the smokers is not different from that of the nonsmokers, so the two variables are independent. - It is rare for 2 variables to be entirely independent. 13 of 49 Example: Has the percentage of young girls drinking milk changed over time? The following table is consistent with the results from “Beverage Choices of Young Females: Changes and Impact on Nutrient Intakes” (Shanthy A. Bowman, Journal of the American Dietetic Association, 102(9), pp. 1234-1239): 1. Find the following: a. What percent of the young girls reported that they drink milk? b. What percent of the young girls were in the 1989-1991 survey? c. What percent of the young girls who reported that they drink milk were in the 1989-1991 survey? d. What percent of the young girls in 1989-1991 reported that they drink milk? 2. What is the marginal distribution of milk consumption? 14 of 49 3. Do you think that milk consumption by young girls is independent of the nationwide survey year? Use statistics to justify your reasoning. 4. Consider the following pie charts for a subset of the data above: Do the pie charts above indicate that milk consumption by young girls is independent of the nationwide survey year? Explain. Segmented Bar Chart - Displays the same info as a pie chart, but in the form of bars instead of circles. - The following is a segmented bar chart for heart disease by smoking status: 15 of 49 Chapter 4: Displaying and Summarizing Quantitative Data Graphs for Numerical Data Numerical variables often take many values. We need to introduce other types of graphs to display the data for a quantitative variable in a fashion so that the distribution of the data becomes apparent. Describing a distribution of a plot: 1) shapes: a) nature of distribution (unimodal, bimodal, multimodal) - One characterization of general shape relates to the number of humps, or modes. 16 of 49 o unimodal – a single peak o bimodal – two peaks; can occur when the data set consists of observation on two quite different kinds of individuals or objects o multimodal – more than 2 peaks; rarely occurs o uniform – no modes b) symmetrical or skewed to the right/left.  Symmetry: if you can draw a vertical line so that the part to the left is a mirror image of the part to the right, then it is 17 of 49 symmetric.  Nonsymmetric graphs are skewed. o If the upper tail of the histogram stretches out farther than the lower tail, then is the histogram positively skewed, or skewed to the right. o If the lower tail longer than the upper tail the histogram is negatively skewed, or skewed to the left. c) unusual values or deviations from the overall pattern. - An important kind of deviation is an outlier, an individual value that falls outside the overall pattern. 2) center – the value that splits the data in half or a typical range of values at the center of the graph - mean, median, mode 3) spread – the range of values; concentration; are most of the values close to or far from the center? - Range, standard deviation, IQR 18 of 49 Dotplots A dot plot is a plot that portrays the individual observations. To construct a dot plot: - draw a horizontal (or vertical) line - label the line with the name of the variable, and mark regular values of the variable on it - for each observation, place a dot above (or next to) its value on the number line NOTE1: The number of dots above a value on the number line represents the frequency of occurrence of that value. NOTE2: The dot plots work well for small sets of data (n ≤ 50). Example 6a: Construct a dotplot for the prices of 17 walking shoes (in $): 90 70 70 70 75 70 65 68 60 74 70 95 75 68 85 40 65 19 of 49 Stem-and-Leaf Displays/Stemplot Another way to portray the individual observations of quantitative data is a stem-and-leaf display, which works well for small sets of data (n ≤ 50). Each observed number is broken into two pieces called the stem and the leaf. How to make a stemplot: 1. Divide each data value into two parts: - The leading digits of the number are the stems. - The rest of the digits of the number are the leaves. - NOTE: use the stems to label the bins (the equal-width interval) - NOTE 2: use only one digit for each leaf – either round or truncate the data values to one decimal place after the stem. 2. List the stems in a column (with the smallest at the bottom), and place a vertical line to the right of this column. 3. For each measurement, record the leaf portion in the same row as its corresponding stem. 4. Order the leaves from lowest to highest in each stem. 5. Provide a key to your stem and leaf coding. 20 of 49 Example 6b: Draw a stemplot for the prices of walking shoes. Prices of walking shoes in $: 90 70 70 70 75 70 65 68 60 74 70 95 75 68 85 40 65 Order them from smallest to largest: 40 60 65 65 68 68 70 70 70 70 70 74 75 75 85 90 95 10 9 05 8 5 7 00000455 6 05588 5 4 0 Prices of Walking Shoes Stems: Tens (10) Leaf: Ones (1) Or 9|0 means $90 Example 7a: Draw a stemplot for the prices of running shoes. Prices of running shoes: 56, 60, 64, 64, 64, 68, 68, 68, 68, 72, 72, 72, 72, 76, 76, 76, 76, 80, 80, 80, 80, 84, 84, 88 21 of 49 8 0000448 7 22226666 6 04448888 5 6 Prices of Running Shoes Stems: tens (10) Leaf: ones (1) OR: 8|8 means $88 Sometimes the available stem choices result in a plot that contains too few stems and a large number of leaves within each stem, so the display will look a little crowded. In this situation, you can stretch the stems by dividing each into several lines. The two common choices for dividing stems are: - Into two lines, with leaves 0 to 4 and 5 to 9 - Into 5 lines, with leaves 0-1, 2-3, 4-5, 6-7, 8-9 Example: Make a stemplot with the Running Shoes Data by stretching the stems into 2 lines: 22 of 49 8 8 8 000044 7 6666 7 2222 6 8888 6 0444 5 6 Prices of Running Shoes (8|8 means $88) Sometimes there are too many stems and a small number of leaves within each stem, usually the case with 3 or more digits. For this situation, you can truncate (or round) the number to two places, using the first digit as the stem and the second as the leaf. Example: Make a stemplot for the following data of acceptance rates at some business schools: 16.3, 17.0, 19.5, 20.3, 20.5, 21.7, 21.9, 22.1, 22.3, 23.8, 23.9, 25.2, 27.1, 28.9, 30.3, 32.5, 33.7, 35.6 23 of 49 35 6 34 33 7 32 5 31 30 3 29 28 9 27 1 26 25 2 24 23 89 22 13 21 79 20 35 19 5 18 17 0 16 3 Acceptance Rates at some business schools Stems: Ones (1) Leaf: Tenths (0.1) OR (16|3 means 16.3) Example: The stem and leaf plot in the previous example looks a little too spread out, and it has 3 digits, so try to truncate the data and redo the stemplot. After the data is truncated, it looks like: 16, 17, 19, 20, 20, 21, 21, 22, 22, 23, 23, 25, 27, 28, 30, 32, 33, 35 24 of 49 3 0235 2 00112233578 1 679 Acceptance Rates at some business schools Stems: Tens (10) Leaf: Ones (1) OR (1|6 means 16) You also can use stem-and-leaf plots for the comparison of the distribution of two groups (back-to-back stemplot) Histogram The most common graph for describing numerical data is the histogram. It helps to visualize the distribution of the underlying variable very well, especially for large data sets. Definition: A histogram for a quantitative variable is a graph that uses bars to show "how often" (measured as frequency or relative frequency) measurements falls in a par
More Less

Related notes for STAT151

Log In


OR

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


OR

By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.


Submit