Chapter 1
Cases: objects described by set of data customers, companies, subjects in a study
Label: special variable used in some data sets to distinguish the different cases
Variable: is a characteristic of a case
Categorical Variable: places case into one of several groups/categories- bar graphs, pie charts
Quantitative variable: numerical values (arithmetic operations) – stemleaf/histograms/boxplots
Distribution of variable tells values it takes and how often it takes these values
Distribution of categorical variables lists the categories and gives either the count or the percent of cases who
fall in each category
Describe the overall pattern of a histogram(frequency, percent-relative frequency, density) by shape
(symmetric), centre (midpoint) and spread (outliers)
Outlier: individual value that falls outside overall pattern
Mean: x = x1 + x2....+ xn
n
Median: numbers from smallest to largest, if odd amount of numbers(n) – (n+1)/2 - -> 50 percentile, if n is
even-- the median is the mean of two centre observations(Q1 and Q3 include the median numbers)
pth percentile: of distribution is value that has a p percent of the observations fall at or below it
First Quartile Q1: median of observations, position in ordered list is to the left of location of overall median
Third Quartile Q3: median of the observations, position is to the right of the location of the overall median
Five number summary: Minimum Q1 M Q3 Maximum
boxplot: graph of the five number summary
IQR = Q3-Q1
The 1.5 X IQR rules for outliers.
Example: Q1= 87, Q2 = 52, IQR= 87-52 = 35, 1.5 x 35 = 52.5
upper quartile = 52.5 +87= 139.5(limit)
lower quartile = 52 – 52.5 = -0.5 (limit)
Modified boxplot: suspected outliers identified individually
The variance s2 of a set of observations is the average of the squares of the deviations of the observations from
their mean s2 = (x1 -x)2 + (x2-x)2 ...... + (xn-x)2
n-1
S- measures spread about mean, S = 0 when there is no spread – otherwise s>0
Density curve: always on or above horizontal axis. has area exactly 1 underneath it, +
Mode: location where the curve is highest (peak point)
The usual notation of the mean of an idealized distribution mu (u)standard deviation is sigma (o)
The 68-95-99.7 rule
− Approx. 68% of the observations fall withing sigma of the mean mu
− Approx 95% of the observations fall within 2sigma of mu
− Approx 99.7% of the observations fall within 3sigma of mu
z-score: z = x-u Standardized normal distribution: Z= X-u N(0,1) mean-0,standard dev-1
o o
Normal distribution is more than the area(24). ex. N(22,0.7) – 24-22/0.7=2.86 --> 0.9979 --> 1-0.9979 = 0.21%
Bimodal distributions – two peaks
Chapter 2:
Response variable: measures outcome of a study
Explanatory(Independent-x) variable: explains causes or changes in the response variable-variable you can
manipulate
Scatterplot: relationship of two quantitative variables
--> Overall pattern -form(linear), direction(positive), and strength(weak, moderate, strong)
Two variables are positively associated when above-average values of one tend to accompany above-average
values of the other and below-average values also tend to occur together
Two variables are negatively associated when above-average values of one tend to accompany below-average
values of the other and visa versa
Correlation: measures the direction and strength of the linear relationship between two quantitative variables. r. − correlation r always a number between -1 and 1, measure the strength of the linear relationship between
two variable

More
Less