Descriptive statistics utilizes numerical and graphical methods to look for patterns
and data sets.
Inferential statistics utilizes sample data to make estimates, decisions, predictions, or
other generalizations about a larger set of data.
Experimental unit: something about which we collect data.
Population: a set of units to study.
Variable: a characteristic property of an individual unit.
Sample: a subset of the units of the population.
Measure of reliability is a statement (usually quantitative) about the degree of uncer-
tainty associated with a statistical inference.
Four Elements of Descriptive Statistical Problems:
1. Population or sample of interest
2. One or more variables that are to be investigated
3. Tables, graphs, or numerical summary tools
4. Identification of patterns in the data.
Five Elements of Inferential Statistical Problems:
1. Population of interest
2. One or more variables that are to be investigated
3. The sample of population units
4. The inference about the population based on information contained in the sample
5. A measure of the reliability of the inference.
Quantitative and Qualitative:
Quantitative data are measurements recorded on a numerical scale
Qualitative (or categorical) data are measurements that cannot be measured on a
natural numerical scale; they can only be classified into groups of categories.
A designed experiment is a data collection method where the researcher exerts full
control over the characteristics of the experimental units sampled. These experiments
typically involve a group of experimental units that are assigned the treatment and an
untreated (or control) group. (Can be 2 different treatment groups)
An observed experiment is a data collection method where the experimental units
sampled are observed in their natural setting. No attempt is made to control the charac-
teristics of the experimental units sampled. (Eg. surveys)
If we wish to infer something from sample data, the sample should be a representative
sample: a sample that exhibits characteristics typical of those possessed by the
How do we get a representative sample?
A random sample of n experimental units is a sample selected from the population in
such a way that every different sample of sixe n has an equal chance of selection. Types of Error
Selection bias results when a subset of the experimental units in the population is ex-
cluded so that these units have no chance of being selected in the sample.
Nonresponse bias results when the researchers conducting a survey or study are un-
able to obtain data on all experimental units selected for the sample.
Measurement error refers to inaccuracies in the values of the data recorded. In sur-
veys, this kind of error may be due to ambiguous or leading questions and the inter-
viewer's effect on the respondent.
A class is one of the categories into which qualitative data can be classified.
The class frequency is the number of observations in the data set that fall into a partic-
The class relative frequency is the class frequency divided by the total number of ob-
servations in the data set.
The class percentage is the class relative frequency multiplied by 100.
Ways to represent qualitative data:
- Bar graph
- Pie Chart
- Pareto diagram: bar graph with the classes in decreasing order.
Ways to represent quantitative data:
- A stem-and-leaf display presents data in a convenient format. We'll take the stem to
be the portion of the value left of the decimal point, and the rest (to the right of the deci-
mal point) called the leaf. The stems are listed in one column, and the leaf for each ob-
servation in another column.
Eg. A data set of these numbers: 31.5, 31.7, 33.6, 35.0, 37.1, 37.2, 37.8 would be
shown as ( | represents the decimal point):
31 | 57
33 | 6
35 | 0
DATASETNAME 1, at least (1-1/k^2) of the measurements will fall within k standard
deviations of the mean.
The Empirical Rule: For data sets with frequency distributions that are mound shaped and symmetric (like a
bell curve) (so the mean, median and mode are roughly the same):
- Approximately 68% od the measurements will fall within 1 standard deviation of the
- Approximately 95% of the measurements will fall within 2 standard deviations of the
- Approximately 99.7% of the measurements will fall within 3 standard deviations of the
Rats run through a maze. Thirty times are recorded, and are stored in a file, RATMAZE.
We wish to determine what percentage of measurements fall within xbar +or- s, xbar
> ratmaze ratmaze
> str(ratmaze) 'data.frame':30 obs. of 1 variable:
$ RUNTIME: num 1.97 1.74 3.77 0.6 2.75 5.36 4.02 3.81 1.06 3.2 ...
 1.97 1.74 3.77 0.60 2.75 5.36 4.02 3.81 1.06 3.20 9.70 1.71 1.15 8.29 2.47 6.06
5.63 4.25 4.44 5.21 1.93 2.02 4.55 5.15 3.37 7.60 2.06 3.65 3.16 1.65
> # This is a com ment.
> # That 's how you can f ind the variance
> # That 's the standard deviat ion
> # To check that sd is the square root of the variance:
> # Tsal l good!
> # To not having to retype al l that, name your values:
> xbar s # So let 's compute:
> xbar + sd
Error in xbar + sd : non-numeric argument to binary operator
> # oops
> xbar + s
> xbar - s
> # So now, we ask how many measurements (runt imes) fa l l within (xbar-s,xbar+s) =
> # Let 's just see the measurements that are less than xbar+s
> ratmaze$RUNTIME[ratmaze$RUNTIME < xbar + s]
 1.97 1.74 3.77 0.60 2.75 5.36 4.02 3.81 1.06 3.20 1.71 1.15 2.47 5.63 4.25 4.44
5.21 1.93 2.02 4.55 5.15 3.37 2.06 3.65 3.16 1.65
Frequency distributions: a graph of the frequency of measurements (eg. a histogram).
Graphically: the median divides the graph into two equal areas; the mean is the bal-
ancing point (a little trickier to visualize).
***We can use this on R to verify the empirical rule:
> ratmaze$RUNTIME[ratmaze$RUNTIME < xbar +s & ratmaze$RUNTIME > xbar - s]
 1.97 1.74 3.77 2.75 5.36 4.02 3.81 3.20 1.71 2.47 5.63 4.25 4.44 5.21
 1.93 2.02 4.55 5.15 3.37 2.06 3.65 3.16 1.65 > length(ratmaze$RUNTIME[ratmaze$RUNTIME < xbar +s & ratmaze$RUNTIME >
xbar - s])
> # percentage of the values between xbar +/- s
> # ~77% of the values fa l l within one standard deviat ion of the mean. GREAT!
> length(ratmaze$RUNTIME[ratmaze$RUNTIME < xbar + 2*s & ratmaze$RUNTIME >
xbar - 2*s])
For any set of n measurements (arranged in order), the pth percentile is a number
such that p% of measurements fall below it, and (100-p)% fall above it.
Quartiles partition of the dataset into 4 categories each containing 25% of the measure-
Lower quartile : Ql, lower than 25%
Middle Quartile (Qm or M): 50th percentile aka median
Upper quartile: Qu, over 75%
> quanti le(ratmaze$RUNTIME)
0% 25% 50% 75% 100%
0.6000 1.9825 3.5100 5