Class Notes (808,754)
Canada (493,378)
Commerce (700)
COMM 291 (92)
Lecture 4

COMM 291 Lecture 4: Lectures_4_5

10 Pages
Unlock Document

University of British Columbia
COMM 291
Jonathan Berkowitz

COMMERCE 291 – Lecture Notes – Jonathan Berkowitz (copyright, 2013) Summary of Lectures 4 and 5 Chapter 5: Displaying and Describing Quantitative Data With quantitative data, there are many values and little repetition (e.g. age of fans at a Canucks game). Here it is better to group the data into classes or intervals and find the frequency of each class. Example: Sales records – the number of product units sold by each salesperson Class Frequency Relative Frequency Percentage 0-10 3 .06 6% 11-20 10 .20 20% 21-30 17 .34 34% 31-40 15 .30 30% 41-50 2 .04 4% 51-60 3 .06 6% Total 50 1.00 100% A useful type of graph of a frequency table for a quantitative variable is a “histogram”. Here is a histogram (created in Excel) for the above frequency table. Note that in a histogram the bars are touching; in fact “histogram” means “picture of layers”. Units sold per salesperson 18 16 14 12 10 8 Frequency 4 2 0 0-10 11-20 21-30 31-40 41-50 51-60 # of units Note: The labeling says 0-10, 11-20, etc., so that there is no ambiguity on the borders. The 11-20 category is really >10 to 20. 1 Remember: The difference between a bar chart and a histogram: bar charts are for categorical data where the categories are not contiguous (i.e. touching); histograms are for quantitative data where the underlying continuum is divided into contiguous class intervals. An alternative to the histogram is the stem-and-leaf plot (also called, simply, a stemplot). Example: The following data are final exam marks (ranked in ascending order) from a rather difficult math exam. 33 48 54 60 64 68 71 74 76 82 34 49 54 60 64 70 71 74 77 83 38 50 55 61 65 70 71 74 78 85 40 50 56 62 65 70 71 75 80 86 42 51 58 64 65 71 72 76 81 90 42 52 58 64 67 71 74 76 81 91 95 Use the “10s-digit” as the stems, and the “1s-digits” as the leaves. 3 | 3 4 8 4 | 0 2 2 8 9 5 | 0 0 1 2 4 4 5 6 8 8 6 | 0 0 1 2 4 4 4 4 5 5 5 7 8 7 | 0 0 0 1 1 1 1 1 1 2 4 4 4 4 5 6 6 6 7 8 8 | 0 1 1 2 3 5 6 9 | 0 1 5 This looks like a histogram, on its side, with class intervals 30-39, 40-49, 50-59, etc. The advantage of the stemplot is that it retains the original data values. The disadvantage is that it is limited in what the class intervals can be, and is limited to the two leading significant digits. Remember: Quantitative Data Condition: data values have known units vs. Categorical Data Condition: data are counts of individuals in categories What to look for in a histogram: • Is there a single peak/mode (unimodal) or multiple peaks/modes (bimodal, multimodal)? • Is the shape symmetric or skewed (asymmetric)? • Are there outliers (unusual observations a long way from the main body of data)? 2 Numerical Summaries of Centre and Spread: Note: Categorical data can really only be summarized by counts and relative frequencies, so the numerical summaries discussed here apply only to quantitative data. Measures of centre and spread cannot easily detect gaps, multiple peaks or extreme values (i.e. outliers) in a distribution. That is why analysis should begin with a graph! Measures of Centre (also called Location): • Mean (denoted by x ): sum of the data values divided by the number of data values • Median: the middle value, after arranging the data in ascending order • Mode: the highest point in the histogram (by far the least useful of the three summaries and not often used as a summary measure) The mean can be thought of as the “centre of gravity” of the distribution. That is, if the histogram (since we have measurement data!) were sitting on a seesaw, the mean would be the point at which the seesaw was in balance. The median can be though of as the middle of the data; 50% are on each side of the median. Note: If there is an even number of data values, the median is the average of the two middle values. For symmetric distributions, the mean equals the median. For asymmetric or skewed distributions, the mean and the median are not equal. If the distribution has a long right-hand tail (i.e. skewed right) the mean is greater than the median. If the distribution has a long left-hand tail (i.e. skewed left), the mean is less than the median. The median is a more “robust” measure than the mean; that is, it is not highly affected by the presence of outliers. Example: For data values: 1,2,3,4,5,6,7,8,9, both the mean and median are 5; For data values: 1,2,3,4,5,6,7,8,99, the mean is 15 but the median is still 5. Example (adapted from Martin Gardner): The management team of ABC Company consisted of Mr. X, his brother and six relatives. The work force consisted of five forepersons and ten workers. When the company was ready to expand, Mr. X interviewed applicants and told them that the average weekly salary at ABC was $600. A new worker, Y, was hired and, after a few days, complained that Mr. X had misled him. Y had checked with the other workers and discovered that none was getting more than $200 a week, so how could the average be $600. Mr. X explained the salary structure: Each week Mr. X got $4800, his brother got $2000, each of the six relatives made $500, the forepersons each made $400 and the ten workers each got $200. That made a weekly total payroll of $13,800 for 23 people, leading to an average of $600. What about the median and mode? The median is the middle value, in this case, the 12 value, which is $400. The mode is the most frequently occurring value, which is $200. 3 So here we have three measures of location: $200, $400 and $600. Which is the preferred one? Since the distribution is so skewed and also has a large extreme value, the median is preferred. Management would, of course, want to report the mean, but the Union would probably prefer to report the mode during contract negotiations! * * * * * Extra material: There are actually three kinds of means – arithmetic, geometric, and harmonic. Arithmetic mean (the familiar one, x ) – for non-related values (Note that “arithmetic” is pronounced with the emphasis on the “met” because it is an adjective here – arith-MET-ic) Harmonic mean – for rates which are NOT dependent on each other HM = n 1 1 1  ... x1 x 2 xn Example: Drive at 40 kph from point A to point B and 60 kph from point B to point A. What is the average speed for the round trip? 2 HM = = 48 kph 1 1  40 60 Geometric mean – for rates where each measurement depends on the previous one. Denote the rates by R1,R 2...Rn and the geometric mean by R g. Then (1 R g  (1 R )(1 R )2(1 R ) n Then R g (1 R )(11R )21 R ) 1 n Example: Suppose an investment experiences 100% growth in the first year and then a 50% loss in the second year. What is the average in increase? (The arithmetic mean would be [100%+(-50%)]/2 = 25% but that’s nonsense since you would actually be right back where you started from!) 2 R g (11.0)(1(0.5)) 1 = 1–1 = 0 End of extra material * * * * * 4 Measures of Spread (also called Scale or Dispersion): Range = Maximum value – Minimum value Variance (s ) and Standard Deviation (s or SD) 2 2  x i x  x 2  xi s =  i A computational formula is: s = n n 1 n 1 The standard deviation (sometimes referred to as the SD) can best be interpreted as “the typical distance from a data value to the mean”. Neither the mean nor the standard deviation is resistant to outliers. They are also a poor choice of summary if the distribution is highly skewed. Percentiles: The p percentile is a value below which p% of the data values fall. Some pethentiles have srdcial names: 75 percentile = 3 or upper quartile = Q3 25 percentile = 1 or lower quartile = Q1 50 percentile = Median The Interquartile Range (IQR) = Q3 – Q1. Note: If there are an odd number of observations, the median is indeed the middle one; but if there are an even number of observations the convention is to take the average of the two middle ones as the median. This can be extended to computing the quartiles; if a quartile lies between two observations, take the average. The text explains that the first quartile is the “median” of the values to the left of the full data set median and the third quartile is the “median” of the values to the right of the full data set median. Beware, however, that different software packages (including Excel) have different conventions for
More Less

Related notes for COMM 291

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.