Examining a Distributions
-in any graph of data, look at the overall pattern and for dramatic deviations from that
- describe pattern by its shape, center and spread
-important kind of deviation is an outlier, an individual value that falls outside the overall
-describe center of distribution by its midpoint, the value with roughly half the observations
taking smaller values and half taking larger values. We can describe the spread of
distribution by giving the smallest and largest values
-describe the spread of distribution by stating the smallest and largest values (Q1, Q3)
Stemplots and histograms display this. Stemplots on its side with the larger value lies to
•Does the distribution have one or several major peaks called modes? Unimodal- one
•Is it symmetric or skewed? Symmetric- values smaller and larger than its midpoint
are mirrored. Ex. heights of young women. Skewed- tails. Ex. money amounts,
skewed to the right.
-outliers: look for points that are clearly apart from the body of data, not just the most
extreme observations in a distribution. Sometimes they point to errors made in recording
-time plots (pg. 19): of a variable plots each observation against the time at which it was
measured. Always put the time on x axis (horizontal) scaled of your plot and the variable
you are measuring on y axis. Connecting the points will show change over time. data
collected over time, plot observations in time order. Displays of stemplots of histograms
ignore time order, so it can be misleading when there is systematic change over time.
-time series: measurements of a variable taken at regular intervals over time. Government,
economic, and social data are often published as this. Ex. monthly unemployment rate and
the quarterly gross domestic product. Time plots reveal the main features of a time series.
-in a time series:
•Trend: is a persistent, long term rise or fall
•Seasonal variation: a pattern that repeats itself at known regular intervals of a time
-many economic time series show strong seasonal variation. Government agencies adjust
this variation before releasing economic data, it’s called seasonally adjusted (helps avoid
-residuals: removing trends and seasonal variation and what remains after the patterns are
-exploratory data analysis: uses graphs and numerical summaries to describe the variables
in a data set and the relations among them
-distribution of a variable- what values and how often it takes these values
1.2- Describing Distributions with Numbers
-numerical summaries make comparisons more specific
-brief description should include its shape and numbers describing its center and spread,
based on inspection of the histogram or stemplot
-graphs are aide to understanding no the answer
-measures of center are the mean(average value) and median(middle value)
-to figure out mean: mean . Add their values and divide by the number of observations. If
the n observations are , ,…..,, their mean is
= or in more compact notation: =
is sigma. Is the mean short for add them all up.
: the bar on top indicates the mean of all the x values.
: keep the n observations separate. Not necessarily indicate order or any other special
facts about the data
-the mean is sensitive to the influence of a few extreme observations ex. outliers. Since
mean can’t resist the influence of extreme values, it’s not a resistant measure of center.
-median: formal version of midpoint of a distribution. Half the observations are smaller
than the median and the other half are larger than the median. Rule for finding the
1. arrange all values in order of size, from smallest to largest.
2. if the number of observations n is odd, the median M is the center value in the ordered
list. Find the location of the median by counting (n + 1)/2 observations up from the bottom
of the list
3. if the number of observations n is even, the median M is the mean of the two center
observations in the ordered list. The location of the median is again (n+1)/2 from the bottom
of the list.
- if the distribution is exactly symmetric, the mean and median are exactly the same
-don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we
might describe by the median
-quartiles: elaborate more on the spread or variability of the incomes and drug potencies as
well as their centers.
-most useful descriptions explain both a measure of center and measure of spread
-describe spread or variability, by giving several percentiles
-median divides the data in two, we call the median the 50th percentile. Upper quartile is
the median of the upper half of the data. (same for the lower quartile, lower half)
-quartiles divide the data into 4 equal parts
-pth percentile of a distribution is the value that has p percent of the observations fall at or
-to calculate percentile, arrange values in increasing order and count up the required
percent from the bottom of the list. There is not always a value with exactly p percent of the
data at or below it.
-quartiles Q1 and Q3: to calculate the quartiles:
1. arrange values in increasing order and locate median M in the ordered list.
2. first quartile Q1 is the median of the values whose position in the ordered list is to the
left of the location of the overall median.
3. third quartile relates the median on the right.