Statistic Notes: Chapter 1 – 7.1
Chapter 1: Introduction to Statistical Data
1.1 Why Study Statistics?
- Statistics can be defined as the art of collecting, classifying, interpreting, reporting numerical information
related to a particular subject.
- Population The total set of objects or measurements that are of interest to a decision maker.
- Descriptive Statistics Focused on summarizing and presenting information.
Such as pie graphs, bar graph, etc comparing populations of different provinces.
- Data set The set of all observations for a given project or purpose.
- Inferential Statistics Goes beyond the data asset that is at hand.
Designed for making estimates, or interferences, about the characteristics of a
population, based on information found in one or more samples.
- Samples Representative subsets of the population.
Two main types of problems call for inferential statistics:
1) Estimate a characteristic of a population, based on data from a sample.
a. Example: Survey finds that 52% of questioned voters support a certain piece of legislation. Estimate the
proportion of all voters that support the legislation.
2) Test a claim or assumption about a characteristic of a population, based on data from the sample.
a. Example: Report claims that “a majority” of voters support a certain piece of legislation. A survey finds
that 52% of questioned voters support that legislation.
1.2 Populations, Samples, and Inference
- Infinite population In practice, it is not realistically possible to identify and count every member of the
population, especially if you include cars for sale anywhere in the world, including on the internet.
- Finite population it is possible and realistic to count every member of the population.
- Census A set of observations taken from every member of a population.
o Example: Statistics Canada takes a census of all households in Canada.
o Possible only for a finite population.
- Parameter a value for summarizing the measurements of a quantifiable feature of the population.
Example: 9/10, there is 10 as total.
- Statistic An estimate or infer that the sample statistic is “close to” or “approximately equal to” the population
Example: 1/12, we estimate to have 8/100 then.
- Several valid reasons to using samples:
1) Time constraints: It may be too time-consuming to survey or observe all the members of the
population. Several possible reasons for this include:
a. Decision Deadlines: Sampling allows people to make their close/don’t-close decisions for
each situation in a timely fashion and to post necessary signs.
b. Trends: Example if you’re expanding your business internationally, it would take a long time
to conduct a census, the population parameters will be changing even as you collect the
data. c. Seasonality: Some variables increase and decrease in value over time in a cyclical pattern.
2) Cost Constraints: It costs less to survey fewer people or make fewer observations than to survey or
observe a whole population.
3) Unknown Population Size: Infinite populations are in this category.
4) Destructive tests
5) Greater accuracy of some samples: Can possibly yield more accurate results than a census.
- A good sample should be representative of the population that it is drawn from.
1.3 Data types and Level of Measurements
- Datum A single observation
- Data raw materials for analysis. They are sets of numeric or nonnumeric facts that represent records of
o Constant an observed characteristic whose value does not change over time or in different
o Variable An observed characteristic, however the values of a variable can change from observation to
Random variable Something that happens by chance or is unplanned, therefore whose value
cannot be known in advance prior to its discovery by observation or experiment.
Qualitative (categorical) random variable Does not result in a numerical value.
Observes a trait or characteristic that can be classified into one of a number of
categories. (Ex. Age)
Quantitative (numerical) random variable A random variable takes values that vary in
magnitude. (Ex. Number of hits at a ball game)
Discrete random variable Restraints on the values that could be
observed. (Counting all cars, you cannot have an inbetween number.
Cannot be 1.5 cars)
Continuous Random variable Possibly assume any value over a
particular range of values. (example: a person’s height.)
- If a measuring instrument is capable of precision only to 4 significant digits, then in practice there is a small gap
between the measured values that could be observed.
Levels of Measurement
- The higher the level of measurement of data, the more information these data contain.
- Nominal-Level Measurement lowest range of measurement
o Qualitative data are generally measured tat this level, unless the data values
imply some kind of ranking.
o Example: Mines classified as Surface or underground.
o There is no order.
o 2 important rules must be observed:
The full set of categories should cover all possible outcomes.
All of the categories should be mutually exclusive.
- Ordinal-Level Measurement second level of measurement, conveys more information.
o Are both categorized and ranked.
o Example: Gold, Silver, Bronze. o The mathematical differences between rank numbers have no interpretation;
only the relative order of the numbers convey information.
- Interval-Level Management These numbers can be compared meaningfully to distinguish higher from lower
o Equal measurement scale correspond to equal differences between the real
world characteristics being measured.
o Example: Temperature
o The relative orders of the numbers and the mathematical difference between
the numbers convey information.
o The ratios between numbers on these scales have no interpretation.
- Ratio-Level Measurement The relative orders of the numbers, the mathematical differences between the
numbers, and the ratios between numbers on the scale all apply meaningfully to the characteristic being
o Example: 20 grams, 40 grams, 60 grams.
o This last property of meaningful ratios can apply only when a scale originates at
a “true” of zero.
Chapter 2: Obtaining the data
2.1 Source of Data
- Scientific approach the search for data begins with the recognition of a problem o investigate.
1) Determine the objectives of the research
2) Determine the sources of the data
3) Design the data-gathering instruments
4) Develop a sampling plan
5) Collect and analyze the data
6) Report and follow up on the findings.
Types of research 1) Exploratory Research Preliminary research, conducted when very little is known as yet about the problem
1. Example: Pilot studies or focus groups.
2. These projects or interactive sessions can help in better formulating the research
problems, fine-tuning the questions in survey instruments.
2) Conclusive research Can be conducted when the research objectives are clear and the problems are
i. Descriptive Research is used to describe the characteristics of a population. The goal to make
predictions or inferences about a population.
ii. Casual Research Designs (experimental research designs) go beyond describing a population as
it is, to exploring possible cause-and-effect relationships among the observed factors.
Types of Data
- Primary Source
Observation (Personal, Mechanical)
Survey (Personal interview, Mail)
Experiment (Laboratory, test market)
- Secondary Sources
Internal (Accounting records, previous studies)
External (Government reports, published research)
- Not available prior to the research.
o Observation Involves actually observing what is happening in order to gather the data.
o Surveys Obtain data by asking people questions about their experiences, opinions, preferences,
backgrounds, and many other variables.
o Experiments Formal testing of a casual hypothesis requires an experiment.
Not always feasible in studies of human behaviour or in studies where human beings are
affected by the outcomes.
- When you use previously collected data as input to your own research.
- May not exactly match your own objectives.
o Internal Secondary Data Originally collected in one’s own organization for other purposes.
o External secondary data Originally collected outside of one’s own organization.
2.2 Designing a Sampling Plan
- Two basic reasons for sampling:
To estimate the values of characteristics in a population based on their values in the sample.
To evaluate previous assumptions or hypotheses that have been made about a population. - Statistical Sampling (probability sampling) uses random selection to best ensure that the collected samples
are representatives of the population.
o If sampling is conducted using a random selection, then the likelihood or probability of obtaining a non-
representative sample can be estimated and taken into account.
- Sampling frame a list of all individuals or objects from which the sample will be drawn.
- Simple random sampling a sample is selected in such a way that:
Each object in the sampling frame has an equal chance of being selected.
Each possible combination of objects of a given size has an equal chance of being
selected from the sampling frame.
- Stratified Random Sampling If the data contains subpopulations that are relatively small and you want to
ensure that all subpopulations have reasonable representation in the sample.
- Simple Cluster Sampling The population is divided into clusters or groups.
- Systematic Sampling when a list such as order of last names is available, the procedure is to determine a
sampling interval (k).
Next choose randomly a starting position in the list between 1 and the kth position on
Example: 51 fishes, and she decides to collect a sample of 6. K = 54/6 = 9. She starts at 4
and then counts every 9 fish.
Nonstatistical Sampling techniques
- Data can be collected in many ways without using the randomizing strategies that are fundamental for statistical
- From the viewpoint of quantitative statistics, these alternatives are, at best, compromises to be used with
caution and, at worst, error-prone techniques.
- These methods can play a role in qualitative statistics, which aims for a deeper view of how people are thinking
in certain contexts and how they relate their ideas.
- Example: A crowd mingles for big sports playoff and the reporter asks some individuals. No way to know if these
convenient to get samples are representatives of the entire population.
- Has many variations with names like critical case sampling, typical case sampling and extreme case sampling.
- These particular individuals can provide some key information that might be missed or diluted in a broader
- Only good as the judgment of the researcher.
- A special case of judgment sampling.
- With quota sampling, he deliberately chooses a number of individuals to question from each gender, income
group, and so on, based on his expectations of their exposure to the ad.
Errors associated with sampling
2) Cost 3) Quality
- Best defined as the difference between the information in the sample and the information in the population,
that occurs simply because the sample is a subset of the population.
- The difference between information in the sample and information in the population that is due to missing data
or incorrect measurement.
- Can also be called measurement errors.
- Different class of error
- Many people who receive questionnaires choose not to reply or refuse to take part in a telephone survey.
- Such response patterns could bias the results; that is, cause the collected data to tend misleadingly more toward
one possible conclusion than another.
Chapter 3: Displaying Data Distributions
3.1 Constructing distribution Tables
- Univariate descriptive statistics descriptive statistics for one variable.
- Descriptive statistics characterize available data in terms of patterns, clusters and other features.
- Clean data recorded numbers or computer codes must be correct.
- Outliers values that appear remote from all or most of the other values for that variable.
- Frequency distribution data counts how many times (the frequency) that a specific data value reappears in
the dataset; or, if the data values are group into ranges, the distribution counts how many values fall into each
Discrete Quantitative Frequency Table
- Point spread data are quantitative and discrete, because the projections are limited to specific numeric values –
those ending in “.0” or “.5”.
- Any specific data value that occurred can be examined to determine how often it appeared in the overall data
- Percentage x = (frequency x / n) * 100%
- Cumulative percent frequency distribution the cumulative percent is the percentage of all cases having
values up to or including the value displayed at the left of that row.
- Cumulative percent for the first data row = percent frequency for that row
- Cumulative percent for any subsequent row = cumulative percent for the previous row + percent frequency for
the current row. Continuous Quantitative Frequency Tables
- When the data are continuous, the number of possible values is potentially infinite.
- Values displayed in a continuous quantitative frequency table are grouped into ranges (called classes).
- The frequencies in the table represent how many individual data values fall into each of the classes.
- The basic principles of constructing and interpreting this table are the same as for a discrete frequency
Guidelines for Manually Constructing the tables
1) The displayed classes should be mutually exclusive.: Any specific data value should fall into one, and only one,
of the classes.
2) The classes should be collectively exhaustive: there is a class to which every data value can be assigned.
3) Try to use the same class width for all classes.
4) Try to include all classes, even if the frequency is zero.
5) Select convenient numbers for class limits.: Select limits that appear to be reasonable and easy to interpret yet
are consistent with the other guidelines.
6) Choose an appropriate number of classes. : Typically the number of classes in between 5 – 15, depending on
the amount of the data available.
7) The sum of the class frequencies must equal the number of original data values: This serves as a check on your
8) Combine these guidelines – with a touch of common sense.
Steps for Manually Constructing Tables from Raw Data
1) Sort the data: The first step is to put the data into an ordered array, which means that the data are recorded
into either ascending order (lowest to highest) or descending order (highest number to lowest).
a. We can also identify immediately the range of the full data set, defined as the spread or difference
between the highest and lowest values
2) Choose a tentative number of classes: As a general rule, more classes are needed when more raw data are
a. Sturge’s Rule: Estimated number of classes = 1 + 3.3log(n)
b. N = the number of values in the data set.
c. Example: Given a data set of 100 points, the formula = 1 + 3.3Log(100) = 8 classes 3) Choose a tentative class width: Defined as the difference between the lower limit of one class and the lower
limit of the next class.
a. Tentative class width = Range of full data set / tentative number of classes.
4) Define all classes: Considering the results from steps 1-3, keep in mind the guidelines, establish specific limits
for all of the classes.
a. Once you have determined the lower limit for the first class, each successive class begins xactly one class
width larger in value.
b. Midpoint = (Lower limit of class + Upper limit of class) / 2
5) Construct the table: Group and counting the data, then the total counts of values to be assigned to particular
classes are determined.
Qualitative Frequency Table
- Qualitative data nonnumerical in character.
- Qualitative frequency distribution table can help you visualize how the data values are distributed and shows
how often each particular value or category of the qualitative variable has occurred.
- The column of frequencies in a qualitative frequency distribution table can be supplemented by additional
columns for percent and cumulative percent frequencies.
- Percentage x = (Frequency x / n) * 100%
Constructing Qualitative Frequency Tables
1) Define and code the categories: For fixed-answer variables, this step occurs before the data are collected.
i. It is generally advisable to not have too many categories, which would thin out the frequencies
for particular choices
ii. Open-ended variables advantage is that it may elicit some unexpected responses that add to the
researcher’s understanding of the variable.
2) Construct the table: Constructing the qualitative frequency table is in practice very similar to constructing a
discrete quantitative frequency table.
3.2 Graphing Quantitative Data
- A graph is especially useful for revealing the shape of the distribution of quantitative data, which (if the future is
like the past) may give a sense of approximately into what values the data tend “Naturally” to fall.
- A graph that uses the lengths of bars to represent the frequency of values in each class of a frequency
- Percent Frequency histogram analogous to converting the frequencies in a freque