Class Notes
(808,384)

Canada
(493,172)

University of Alberta
(12,897)

Statistics
(237)

STAT151
(146)

Helen Frost
(1)

Lecture

# Statistics 151 - Notes - Chapters 1-15.docx

Unlock Document

University of Alberta

Statistics

STAT151

Helen Frost

Winter

Description

Clinton Richardson Home
Statistics 151 – Introduction to Applied
Statistics
Course Outline
http://www.stat.ualberta.ca/statslabs/stat151/ihttps://app.tophat.com/login
Chapter 1: Stats Starts Here
Chapter 2: Data
Chapter 3: Displaying and Describing Categorical Data
Chapter 4: Displaying and Summarizing Quantitative Data
Chapter 5: Understanding and Comparing Distributions
Chapter 6: The Standard Deviation as a Ruler and the Normal Model
Chapter 7: Scatterplots, Association, and Correlation
Chapter 8: Linear Regression
Chapter 9: Regression Wisdom
Chapter 10: Re-expressing Data: Get It Straight!
Chapter 11: Understanding Randomness
Chapter 12: Sample Surveys
Chapter 13: Experiments and Observational Studies
Chapter 11-13: Gathering Data (Assigned Reading)
Chapter 14: From Randomness to Probability
Chapter 15: Probability Rules!
Chapter 16: Random Variables
Chapter 17: Probability Models
Chapter 18: Sampling Distribution Models
Chapter 19: Confidence Intervals for Proportions
Chapter 20: Testing Hypotheses about Proportions
Chapter 21: More about Tests
Chapter 22: Comparing Two Proportions
Chapter 23: Inferences about Means
Chapter 24: Comparing Means
Chapter 25: Paired Samples and Blocks
Chapter 26: Comparing Counts
Chapter 27: Inferences for Regression
Chapter 28: Analysis of Variance
Chapter 11: Understanding Randomness Clinton Richardson Home
Textbook Notes
Terms
1. Random – an outcome is random if we know the possible values I t can have, but
not which particular value it takes
2. Generating random numbers – random numbers are hard to generate.
Nevertheless, several web sites offer an unlimited supply of equally likely random
values
3. Simulation – a simulation models a real-world situation by using random-digit
outcomes to mimic the uncertainty of a response variable of interest
4. Trial – the sequence of several components representing events that we are
pretending will take place
5. Simulation component – a component uses equally likely random digits to model
simple random occurrences whose outcomes may not be equally likely
6. Response variable – values of the response variable record the results of each
trial with respect to what we were interested in.
It’s Not Easy Being Random
• The best ways we know to generate data that five a fair and accurate picture of
the world rely on randomness, and how we draw conclusions from those data
depends on randomness too
A Simulation
• We call each time we obtain a simulated answer to our question a trial
• In fact, the reason we’re doing the simulation is that it’s hard to guess how many
boxes we’d expect to open
Building a Simulation
• Opening one box is the basic building block, called a component of our
simulation
1. Identify the component to be repeated
2. Explain how you will model the component’s outcome
3. State clearly what the response variable is
4. Explain how you will combine the components into a trial model the
response variable
5. Run several trials
6. Collect and summarize the results of all the trials
7. State your conclusion
What Have We Learned?
• We’ve learned to harness the power of randomness. We’ve learned that a
simulation model can help us investigate a question for which many outcomes
are possible, for which we can’t (or don’t want to) collect data, and for which a
mathematical answer is hard to calculate. We’ve learned how to base our
simulation on random values generated by a computer, generated by a
randomizing device such as a die or spinner, or found on the internet. Like all
models, simulations can provide us with useful insights about the real world.
Chapter 12: Sample Surveys Clinton Richardson Home
Textbook Notes
Terms:
1. Population – the entire group of individuals or instances about whom we hope to
learn
2. Sample – a (representative) subset of a population, examined in hope of learning
about the population
3. Sample survey – a study that asks questions of a sample drawn from some
population in the hope of learning something about the entire population
4. Bias – any systematic failure of a sampling method to represent its population is
bias. It is almost impossible to recover from bias, so efforts to avoid it are well
spent
5. Randomization – the best defence against bias is randomization, in which each
individual is given a fair, random chance of selection
6. Sample size – the number of individuals in a sample. The sample size
determines how well the sample represents the population, not the fraction of the
population samples
7. Census – a sample that consists of the entire population is called a census
8. Population parameter – a numerically valued attribute of a model for a population
9. Statistic, sample statistic – statistics are values calculated for sampled data.
Those that correspond to, and thus estimate, a population parameter are of
particular interest
10.Representative – a sample is said to be representative if the statistics computed
form it accurately reflect the corresponding population parameters
11.Simple random sample (SRS) – a simple random sample of sample size (n) is a
sample in which each set of (n) elements in the population has an equal chance
of selection
12.Sampling frame – a list of individuals from whom the sample is drawn is called
the sampling frame. Individuals who may be in the population of interest, but who
are not in the sampling frame, cannot be included in any sample
13.Sampling variability – the natural tendency of randomly drawn samples to differ,
one from another
14.Stratified random sample – a sampling design in which the population is divided
into several subpopulations, or strata, and random samples are then drawn from
each stratum
15.Cluster sample – a sampling design in which entire groups, or clusters, are
chosen at random
16.Multistage sample – a sampling scheme that involves multiple stages of random
sampling, where at each successive stage, we sample from lists of ever smaller
clusters (hierarchical in nature).
17.Systematic sample – a sample drawn by selecting individuals systematically from
a sampling frame.
18.Pilot – a small trial run of a survey to check whether questions are clear. A pilot
study can reduce errors due to ambiguous questions
19.Voluntary response bias – bias introduced to a sample when individuals can
choose on their own whether to participate in the sample. Samples based on Clinton Richardson Home
voluntary response are always invalid and cannot be recovered, no matter how
large the sample size
20. Convenience sample – a convenience sample consists of individuals who are
conveniently available. They often fail to be representative because every
individual in the population is not equally convenient to sample.
21. Undercoverage – a sampling scheme that biases the sample in a way that gives
a part of the population less representation than it has in the population suffers
from undercoverage
22. Nonresponse bias – bias introduced when a large fraction of those sampled fails
to respond. Those who do respond are likely to not represent the entire
population
23. Response bias – anything in a survey design that influences responses falls
under the heading of response bias. One typical response bias arises from the
wording of questions, which may suggest a favoured response.
Idea 1: Examine a Part of the Whole
• We’d like to know about an entire population of individuals, but examining all of
them is usually impractical, if not impossible. So we settle for examining a
smaller group of individuals – a sample – selected from the population
• Polls are examples of sample surveys, designed to ask questions of a small
group of people in the hope of learning something about the entire population
Bias
• Sampling methods that, by their nature, tend to over – or under – emphasize
some characteristics of the population are said to be biased
• Solution: They select individuals to sample at random. The importance of
deliberately using randomness is one of the great insights of statistics
Idea 2: Randomize
• Best way is to select at random. Randomizing protects us from in the influences
of all the features of our population by making sure that, on average, the sample
looks like the rest of the population
Idea 3: It’s the Sample Size
• The fraction of the population that you’ve sampled doesn’t matter. It’s the sample
size itself that’s important
Does a Census Make Sense?
• Wouldn’t it be better to just include everyone and “sample” the entire population?
Such a special sample is called a census
• It can be difficult to complete a census. They are longer to complete and can be
more complex than sampling
Populations and Parameters
• Models use mathematics to represent reality
• A parameter used in a model for a population is sometimes called a population
parameter
• We use summaries of the data to estimate the population parameters
• Any summary found from the data is a statistic. Sometimes you’ll see the term
sample statistic Clinton Richardson Home
• When the statistics we compute from the sample accurately reflect the
corresponding parameters, then the sample is said to be representative
Simple Random Samples
• Each combination of people has an equal chance of being selected as well. A
sample drawn in this way is called a simple random sample (SRS)
• The sampling frame is a list of individuals from which the sample is drawn
• Samples drawn at random generally differ one from another. Each draw of
random numbers selects different people for our sample. These differences lead
to different values for the variables we measure. We call these sample-to-sample
differences sampling variability.
Stratified Sampling
• All statistical sampling designs have in common the idea that chance, rather than
human choice, is used to select the sample
• Sometimes the population is first sliced into groups called strata, before the
sample is selected
• We usually try to create these strata so that they are relatively homogenous
• Sampling is used within each stratum, after which all the results from the various
strata are combined to produce population estimates. This common sampling
technique is called stratified random sampling
Cluster and Multistage Sampling
• Splitting the population in to clusters can make sampling more practical. Then we
could simply select a number of clusters at random and perform a census within
each of them. This sampling design is called simple cluster sampling
• With proper random selection of clusters (and within clusters, if there are multiple
stages of sampling), cluster sampling will give us an unbiased sample
• Clusters are randomly selected; strata are not
• What we care about is maximizing the precision obtained for some total
expenditure, not for some fixed sample size, and the same expenditure can yield
a much bigger cluster sample than a simple random sample
• When we subsample and study just a portion of each selected cluster, we are
introducing a second stage of sampling
Systematic Samples
• You might survey every tenth person on an alphabetical list of students. To make
it random, you must start the systematic selection from a randomly selected
individual. When the order of the list is not associated in any way with the
responses sought, systematic sampling is similar to drawing an SRS
• It will give more accurate (less variable) results than an SRS when the list order
is associated to some extent with the response variable, as long as the
systematic rule doesn’t coincide with some “cycle” in the sampling frame
Defining the “Who”: You Can’t Always Get What You Want
• Even if the population seems well defined, it may not be a practical group from
which to draw a sample
• Specify a sampling frame
• The target are the individuals for whom you intend to measure responses Clinton Richardson Home
• The actual respondents are the individuals about whom you do dget data and
can draw conclusions
The Valid Survey
• A valid survey yields the information we are seeking about the population we are
interested in.
o Know what you want to know – understand what you hope to learn
o Use the right frame – appropriate respondents
o Tune your instrument – survey, questionnaire
o Ask specific rather than general questions
o Ask for quantitative results when possible
o Be careful in phrasing questions
o Be careful in phrasing answers
• A pilot is a trial run of the survey you eventually plan to give to a larger group,
using a draft of your survey questions administered to a small sample drawn from
the same sampling frame you intend to use
Sample Badly with Volunteers
• In a voluntary response sample, a large group of individuals is invited to respond,
and all who do respond are counted
• The sample is not representative, even though every individual in the population
may have been offered the chance to respond. The resulting voluntary response
bias invalidates the survey
Sample Badly, but Conveniently
• In convenience sampling we simply include the individuals who are convenient
for us to sample
Undercoverage
• Many survey designs suffer from undercoverage, in which some portion of the
population is not sampled at all or has a smaller representation in the sample
than it has in the population
What Else Can Go Wrong?
• A common and serious potential source of bias for most surveys is nonresponse
bias. No survey succeeds in getting responses from everyone. The problem is
that those who don’t respond may differ from those who do. And they may differ
on just the variables we care about.
• Response bias refers to anything in the survey design that influences the
responses
How to Think About Biases
• There’s no way to recover from a biased sample or survey that asks biased
questions Clinton Richardson Home
Chapter 13: Experiments and Observational Studies
Observational Studies
• In observational studies, researchers don’t assign choices; they simply observe
them. In addition, this was a retrospective study, because researchers first
identified subjects who studied music and then collected data on their past
grades Clinton Richardson Home
• Observational studies are valuable for discovering trends and possible
relationships
• Because retrospective records are based on historical data, they can have errors
• Identifying subjects in advance and collecting data as events unfold would make
this a prospective study
Randomized, Comparative Experiments
• An experiment requires a random assignment of subjects to treatments
• Experiments study the relationship between two or more variables.
• An experimenter must identify at least one explanatory variable, called a factor,
to manipulate, and at least one response variable to measure. What
distinguishes an experiment from other types of investigation is that the
experimenter actively and deliberately manipulates the factors to control the
details of the possible treatments, and assigns the subjects to those treatments
at random
• The individuals on whom or which we experiment are referred to as experimental
units; however, humans who are experimented on are commonly called subjects
or participants
• The specific values that the experimenter chooses for a factor are called the
levels of the factor
• The combination of specific levels from all the factors that an experimental unit
receives is known as its treatment
The Four Principles of Experimental Design
1. Control
• We control sources of variation other than the factors we are testing by
making conditions as similar as possible for all treatment groups
• We control other sources of variation to prevent them from changing and
affecting the response variable
2. Randomize
• Randomization allows us to equalize the effects of unknown or
uncontrollable sources of variation
• If experimental units were not assigned to treatments at random, we would
not be able to use statistics to draw conclusions from an experiment.
• Reduces bias
3. Replicate
• The outcome of an experiment on a single subject is an anecdote, not
data
• Replication of an entire experiment with the controlled sources of variation
at different levels is an essential step in science
4. Blocking
• Sometimes, attributes of the experimental units that we are not studying
and that we can’t control may nevertheless affect the outcomes of an
experiment. If we group similar individuals together and then randomize
within each of these blocks, we can remove much of the variability due to
the difference among the blocks.
Diagrams Clinton Richardson Home
• The diagram emphasizes the random allocation of subjects to treatment groups,
the separate treatments applied to these groups, and the ultimate comparison of
results
Does the Difference Make a Difference?
• The real question is whether the differences we observed are about as big as
we’d get just from the randomization alone, or whether they’re bigger than that. If
we decide that they’re bigger, we’ll attribute the differences to the treatments. In
that case we say the differences are statistically significant
• A difference is statistically significant if we don’t believe that it’s likely to have
occurred only by chance
Experiments and Samples
• Both experiments and sample surveys use randomization to get unbiased data
• Sample surveys try to estimate population parameters
• Experiments try to assess the effects of treatments
• Unless the experimental units are chosen from the population at random, you
should be cautious about generalizing experiment results to larger populations
until the experiment has been repeated under different circumstances
Control Treatments
• Control group – the experimental units assigned to a baseline treatment level,
typically either the default treatment, which is well understood, or a null, placebo
treatment. Their responses provide a basis for comparison
Blinding
• Blinding helps eliminate bias by hiding loyalties that participants to the treatment
might have
• When all the individuals of either those who could influence the results or those
who evaluate the results are blinded, an experiment is said to be a single-blind.
• When everyone in both of the above classes is blinded, we call the experiment a
double-blind
Placebos
• A fake treatment that looks just like the treatments being tested is called a sham
treatment or placebo
• The placebo effect highlights both the importance of effective blinding and the
importance of comparing treatments with a control
• The best experiments are usually: randomized, double-blind, comparative,
placebo-controlled
Blocking Clinton Richardson Home
• When groups of experimental units are similar, it is often a good idea to gather
them together into blocks. By blocking, we isolate the variability attributable to the
differences between the blocks, so that we can see the differences caused by the
treatments more clearly.
• Because the randomization occurs only within the blocks, we call this a
randomized block design
• In retrospective or prospective study, subjects are sometimes paired because
they are similar in ways not under study. Matching subjects in this way can
reduce variation in much the same way as blocking
• We use blocks to reduce variability so we can see the effects of the factors; we\re
not usually interested in studying the effects of the blocks themselves
Adding More Factors
• This experiment is a completely randomized two-factor experiment because any
plant could end up assigned at random to any of the six treatments (and we have
two factors)
• Experiments with more than one factor are both more efficient and provide more
information than one-at-a-time experiments
Confounding
• When the levels of one factor are associated with the levels of another factor, we
say that these two factors are confounded
• A confounding variable, then, is a variable associated with the experimental
factor that may also have some effect on the response Because of the Clinton Richardson Home
confounding; we find that we can’t tell whether any affect we see was caused by
our factor or by the confounding variable – or even by both working together.
Chapter 11-13: Gathering Data
Lecture Notes
Does a Census Make Sense?
• Census is a special sample that includes everyone the entire population
• Problems: cost, undercount, time, complexity
Populations and Parameters
• Population parameter – a parameter that is part of a model for a population (often
unknown)
• Sample statistics – a summary that is found from data in a sample (known once
data is observed
Two Types of Conclusions (Inferences)
• Population inference – results are generalized (estimate)
• Casual (cause and effect) Inference – the difference in the responses is caused
by the difference in treatments when comparing the results from two treatment
groups
Population Inference
• We should only make population inferences when we have random sampling
• Randomizing protects us from the influences of all the features of our population
o Makes sure that on the average the sample looks like the rest of the
population
• Non-random sampling leads to biased results
o No way to fix a biased sample
• We should not generalize our results in a non-random sample
Random Sampling Methods
1. Simple Random Samples (SRS) – each sample of size (n) in the population has
the same chance of being selected
• Samples drawn at random generally differ from one another
• Sampling variability – sample-to-sample differences
2. Stratified Random Sampling
• Population is divided into homogenous groups called strata, then take
SRS within each stratum before the results are combined
• Reduces bias and variability of results
3. Systematic Random Samples
• Start the systematic selection from a randomly selected individual, then
th
sample every k person
• Systematic sampling is much less expensive than true random sampling Clinton Richardson Home
4. Cluster Random Sampling
• Splitting the population into clusters and selecting one or a few clusters at
random and perform a census within each of them
• Should be unbiased
• Not the same as stratified sampling
Sources of Bias:
The tendency for a sample to differ from the corresponding population in some
systematic way
1. Selection Bias (Under coverage)
• When some portion of the population is not sampled at all or has a smaller
representation in the sample than it has in the population
2. Response Bias
• Refers to anything in the survey design that influences the responses
• Respondents may lie, especially if asked about illegal or unpopular
behaviour
3. Nonresponse Bias
• A common and serious potential source of bias for most surveys
• Occurs when responses are not obtained from individuals selected for the
study
Casual (cause-and-effect) Inference
• We should only make cause and effect inferences when we have random
allocation
• Lurking variables are variables that are related to both group membership and to
the response
Study Designs
1. Observational Study
• The investigator observes individuals and measures variables of interest
but does not attempt to influence the responses
• Retrospective study – useful when one outcome is rare, based on
historical data
• Prospective study – observe explanatory and response variables, more
costly, but better data
• Valuable for discovering trends and possible relationships, but not
possible to demonstrate a causal relationship
Randomized, Comparative Experiments
• An experiment is a study design that allows us to prove a cause-and-effect
relationship
• An experiment:
o Manipulates factor levels to create treatments
o Randomly assigns subjects to these treatment levels
o Compares the responses of the subject groups across treatment levels Clinton Richardson Home
Summary
1. Random Selection of Individuals (Random Sampling):
• The individuals in the sample are selected randomly from the population
• Population inferences are allowed
2. Random Allocation:
• The individuals are randomly assigned to different treatment groups
• Causal inferences are allowed
• We cannot make causal inferences from observational studies
Chapter 11-13 Random Selection?
Yes No
Random Yes Both Casual Inference
Assignment? No Population Inference None
Chapter 1: Stats Starts Here
Lecture Notes
What is (are) Statistics?
• Statistics (the discipline) is a way of reasoning, a collection of tools and methods,
designed to help us understand the world with the use of data
• Particular calculations made from data
Data
• Values with a context
• Numbers, record names, or other labels
Why We Need to Collect Data
• To find answers to the questions that cannot be answered otherwise
• Results in different outcomes (variability)
Statistical Methods
• Personal life, politics and economics, biology, medicine, psychology, social
science, engineering, dentistry, business
The Process of Making a Data Driven Decision:
1. State the question
2. Decide how to collect the data and which statistical methods have to be applied
to answer the question or to make a decision
3. Collect the data – draw a sample from the population of interest
4. Describe and summarize the data. Find the statistics
5. Use appropriate statistical tools, to find an answer to the question. Use formal
data analysis and make statistical inference. Generalize the finding from the
sample to the entire population. Clinton Richardson Home
NOTE: in order to understand and apply inferential procedures, we need to
understand Probability Theory
Goal of this Class
• Learn appropriate methods for data collection
• Learn how to formulate, carry out, and interpret simple statistical analysis
• Learn how to follow basic statistical arguments
• Develop a crucial attitude toward quantitative claims
• Develop statistical reasoning
• Summary: Become capable of making decisions based on data
Definition:
• Population of Interest – the entire collection of individuals
• Sample – a subset of the population, selected for study in some prescribed
manner
• Parameter – a number that describes a characteristic of the population (often
unkown) (“p” in parameter describes population)
• Statistic – a number that describes a characteristic of the sample (are known
once the data are observed) (“s” in statistic describes sample)
• Greek letters denote parameters and Latin letters denote statistics
Example: I want to estimate the proportion of male students for this Stat 151 Lecture by
randomly selecting 10 students in this class.
Population of Interest: All the students in this Stats 151 Lecture
Sample: the 10 randomly selected students
Parameter: the proportion of male students in this class
Statistic: the proportion of male students in the sample
Example: An investigator wants to estimate the average height of Canadian females by
measuring the height of 1000 randomly picked Canadian females.
Population of Interest: All Canadian females
Sample: 1000 randomly selected Canadian females Clinton Richardson Home
Parameter: the average height of Canadian females
Statistic: the average height of the Canadian females in the sample
Textbook Notes
What is (Are?) Statistics?
• Whenever there are data and a need for understanding the world, we need
statistics
Statistics in a Word
• Statistics is about variation
Chapter 2: Data
Terms:
1. Data – systematically recorded information, whether numbers or labels, together
with its context
2. Context – the context ideally tells, who was measured, what was measured, how
the data were collected, where the data were collected, and when and why the
study was performed
3. Data Table – an arrangement of data in which each row represents a case and
each column represents a variable
4. Case – a case is an individual about whom or which we have data
5. Variable – a variable holds information about the same characteristic for many
cases
6. Unit – a quantity or amount adopted as a standard of measurement, such as
dollars, hours, or grams
7. Categorical Variable a variable that names categories (whether with words or
numerals) is called categorical
8. Quantitative Variable – a variable in which the numbers act as numerical values
is called quantitative. Quantitative variables always have units
9. Identifier Variable – a variable holding a unique name, ID number, or other
identification for a case, identifiers are particularly useful in matching data from
two different databases or relations
Lecture Notes
Data – can be numbers, record names, or other labels.
• Not all data are represented by numbers
• It must have: who, what, when, where, and why
Who: who are you interested in, target population?
• Subjects or participants – people whom we experiment
o Population – entire set of subjects
o Sample – set of subjects you observe
• Respondents – individuals who answer a survey Clinton Richardson Home
• Experimental units – animals, plants, and inanimate subjects
What: what characteristic (or variable) are you measuring?
• Variable – characteristic recorded about each individual
o Can take different values for different individuals
o Tell how each value has been measured and tell the scale of the
measurement
Variables
• Categorical • Quantitative
o Nominal o Discrete
o Ordinal o Continuous
Definition:
1. A categorical variable places a subject into one of several groups or categories
(or levels)
o Usually we determine the counts of cases that fall into each category
o Two types:
1. Nominal: the levels have no order
2. Ordinal: the levels have some order
o Example:
a) Gender (M or F) - nominal
b) Hair colour (blond, white, black, red, etc…) - nominal
c) Nationality (Canadian, American, German, French, Chinese, etc…)
– nominal
d) Letter Grade (a+, a, a-, etc…) – ordinal
e) Car Manufacturer (dodge, Honda, etc…) – nominal
f) Opinion (strongly agree, agree, neutral, etc…) – ordinal
g) Education level (high school diploma, undergraduate degree,
graduate degree) – ordinal
2. A Quantitative variable measures a numerical quantity or amount in each subject
o Two types:
1. Discrete: can only take on distinct values
2. Continuous: can take on any value in a given interval
o Example:
a) Age (years) – discrete variable Clinton Richardson Home
b) Race (Asian, Black, white, or other) - discrete
c) Smoker (yes or no) - discrete
d) Systolic blood pressure (millimeters of mercury) – continuous
quantitative variable
e) Level of calcium in blood (micrograms per millilitre) – continuous
Why: Why are you doing this? What is the reason for collecting data?
• The questions of interest shape that we think about and how we treat the variable
• We need the who, what, and why to analyze data
When and Where: gives us some nice info about the context
• Example: values recorded at the U of A in 1960s mean something different than
similar values recorded last year. (eg. Average pr

More
Less
Related notes for STAT151