false

Study Guides
(248,403)

Canada
(121,540)

Western University
(12,964)

Statistical Sciences
(56)

Sohail Khan
(3)

Final

Unlock Document

Statistical Sciences

Statistical Sciences 1024A/B

Sohail Khan

Summer

Description

Module One: Sampling
Population: the whole group of individuals
Sample: part of population we collect information from
Unit: individuals
Parameter: characteristics of the population we want to learn about
Statistic: the sample version of a parameter
Variable: a characteristic of an individual
Sampling Design: describes exactly how to choose a sample from the population
Census: information of entire population (no sample)
Bias: errors in way the sample represents the population
Undercoverage: occurs when groups are left out
Nonresponse: when individuals can’t be contacted or won’t participate
Response Bias: behaviour of the respondent or interviewer alters answers
Voluntary Response Sample: people who chose themselves by responding to a broad appeal- biased
Convenience Sample: takes members of the population that are easiest to reach- unrepresentative
Random Sampling: uses chance to select a sample
Simple Random Sample (SRS): every individual has the same likelihood of being selected
Stratified Radom Sample: group into strata then have a SRS in each sample
Cluster Sample: grouping people and using an entire group or multiple groups but not others as the sample
Multistage Sample: Ex. random sample of telephone exchanges stratified by region, SRS of telephone #s and a
random adult from each Module Two: Study Design
Observational Study: variables are observed/measured and recorded- does nothing to influence
Confounding: when effects of variables can’t be distinguished from each other
Response Variable: a particular quantity that we ask a question about
Explanatory Variable/Factor: a factor that influences the response variable
Experiment: individuals are influenced and then observed
Levels:
Treatment: a specific experimental condition applied to individuals
Randomization: use chance to assign treatments
Replication:
Control: control effects of variables
Completely Randomized Design: all subjects are allocated at random among all the treatments- can compare
any number of treatments
Randomized Block Design: the random assignment of individuals to treatments is done by block
Block: a group of individuals known to be similar before the experiment
Placebo: Dummy treatment
Blinding: Don’t know which treatment is being received.
Double Blind: the subjects and the people who interact with them don’t know which treatment they’re
receiving.
Statistical Significance: an effect so large it couldn’t be observed by chance
Matched Pairs Design: Closely matched pairs are given different treatments to compare Module Three: Descriptive Stats- One Variable
(Qualitative) Categorical Variable: places an individual in categories- not meaningful numbers
Pie chart: shows the distribution of the categorical variable as a “pie”- use when you want to emphasize
each category’s relation to the whole
Bar graph: represents each category as a bar- show category counts or percents- more flexible than pie
charts
Quantitative variables and data: deal with units of measurement- numerical values for which averaging
makes sense
Continuous: continues
Discrete: finite numbers
Histogram: most common graph of the distribution of one quantitative variable- percent is at the bottom
unlike a bar graph
Stemplot: for small data sets
Boxplot: data represented as a box
Frequency: Total number
Relative frequency:
Measures of centre:
Mean: average of n values
Median: value that divides the dataset in half
Measures of position- quartiles: 25% and 75%- median of median
Measures of spread:
Range: maximum-minimum
Interquartile range: Q3-Q1
Variance: typical squared deviation between the observations and their mean
Standard deviation: square root of sample variance
Symmetry: left and right side of histogram approximately mirror each other
Skewness: one side extends farther than the other
Outliers: individual that falls outside the overall pattern
Five number summary: Min Q1 Med Q3 Max Module Four: Descriptive Stats: Two Variables
Side-by-side boxplots: provide visual comparison for the distribution of a quantitative variable across a
qualitative one
Two-way table: two qualitative variables- each cell represents the count of individuals or units that have a
particular combination of characteristics
Marginal distribution: provides information about the distribution of one qualitative variable but does not
provide any information about an association between variables- focuses on one row, and the column of totals
(the margin)
Conditional distribution: the distribution of a variable for a specific value of the other- summarizes its
distribution for specific values of the other variables
Simpson’s paradox: an association holds consistently for all groups, but then when the data is considered
together the direction of association is reversed due to a lurking variable
Scatterplot: visual of the association between two quantitative variable the have been measured on the same
units or individuals
Time plot: measures intervals over time
Trend: positive or negative if the data shows a consistent upward or downward tendency over time
Cycle: when the data shows regular up and down fluctuations over time
Change points: changes in the overall pattern-striking deviations- suggest something happened around the time
of change that impacted the variable Module Five: Linear Relationships
Correlation: association
Causation: one variable causes another, NOT ALWAYS THE CASE
Explanatory (x) variable: causes outcome of interest
Response (y) variable: outcome of interest
Least-squares regression: minimizes the vertical distance between each square and the line
Ecological correlation: correlations between the average values for x and y- tend to be higher than correlations
based on x and y
Influential point or observation: outliers, if the point was not there the results would be dramatically different
Extrapolation: don’t use what you have in one chart to predict the correlation of something outside the chart (if
you know BAC vs. # of beers for 1-9 don’t guess about 12 or 20 because it could be different)
Lurking variable: another variable
Predicted (or fitted) value: line of regression
Coefficient of determination: the proportion or fraction of total variation in y that is explained by the least
squares regression line- the square of the correlation- COD= explained variation/explained + unexplained
variation
Residual: observed-predicted values (actual point and line)
Residual plot: x is the same, y is residuals Module Six: Quantifying Uncertainty
Randomness: a property of a phenomenon whose outcomes cannot be predicted in short run but that behaves in
a certain way in the long run
Experiment: a process for which a single outcome occurs but in which there are more than one possible
outcomes. Thus, we are uncertain which outcome will occur and cannot predict this outcome in advance.
Sample space: The sample space, denoted by S, of an experiment is the set of all possible outcomes of the
experiment. Let us consider sample space for an experiment when we toss three coins:
S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Simple event and event:
An event A is a subset of the sample space S. It is a set containing a subset of the outcomes of a particular
interest. In roll of a die ("die" is the singular form of "dice"), our interest might be if die showed an even
number. This event will have 3 outcomes. Each outcome is called a simple event.
Venn diagram: used to represent samples spaces and events Rules of Probability:
Conditional probability: Disjoint (i.e., mutually exclusive) events: events which cannot occur together. In other words, one stops the
occurrence of the other
Independent events: Independence implies that occurrence of one event does not affect the occurrence (or non-
occurrence) of the other event.
Two-way tables and probability trees:
Random variable: variable which takes numerical values corresponding to outcomes of a random phenomenon
Probability distribution: a rule, formula or function which tells us which values the random variable can take
on and provides us a method to assign probabilities to these values of the random variable.
Discrete probability models: probabilities associated with fixed outcomes in a sample space
Continuous probability models: describes the pattern of a random phenomenon using a density curve Module Seven: Variables and Distributions
Random variables: a variable which takes numerical values corresponding to outcomes of a random
phenomenon
Histograms: summarize observed quantitative data. Density curves are often used to model (or describe) results
of random phenomena. For the theory (i.e., the model) to coincide with the data, then, the histogram and the
density curve should be similar.
Density curves: A density curve describes the theoretical pattern or distribution of a random variable and this
description is in terms of a mathematical function. A density curve always sits on or above the horizontal axis
and has an area of exactly 1 underneath it. The area under the curve for a given range of values is the
probability that the random variable takes on values in that range. We often use density curves to model (or
describe) results of random phenomena.
Probability distributions: a rule, formula or function that tells us which values a random variable can take on
and provides us a method to assign probabilities to the values of the random variable
Skewness: one side is more spread out than the other- positive/to the right if mean>median
Symmetry: no skewness
Normal distribution: Normal distributions are defined by two parameters (mean and standard deviation). The
mean \(\mu\) of the distribution determines where the curve is centered and the standard deviation (SD),
\(\sigma\), determines how spread out the distribution is. All normal distributions have the same shape - they are
symmetric and bell-shaped
68%-95%-99.7%/ Empirical Rule: It states that if your distribution is bell-shaped and symmetrical then you
can expect about 68% of the observations (data) within one standard deviation of the mean (\(\mu\) ±
\(\sigma\)), about 95% of the observations within two standard deviation of the mean (\(\mu\) ± 2\(\sigma\)) and
almost all (about 99.7%) of the observations within three standard deviation of the mean (\(\mu\) ± 3\(\sigma\)).
These percentages are properties of normal distributions. Standard normal distribution:
Uniform/Rectangular distribution: The continuous uniform distribution or rectangular distribution is a
symmetric probability distribution. A uniformly distributed random variable takes values which are equally
probable. Thus, the height of the rectangle for all values of the random variable is constant. Suppose, we like to
define a uniform probability distribution over an interval “a” and “b”. The density curve of this uniform random
variable will look like: Triangular distribution: The triangular distribution is a symmetric probability distribution. As the name
suggests, the shape of the density curve for the rectangular distribution looks like a triangle. To solve problems
about a triangular distribution, the key is to sketch the curve and use the formula for area of a triangle to find the
required probabilities. Module 8: Sampling Distributions
Parameter: characteristics of the population we want to learn about
Statistic: the sample version of the parameter
Sampling variability or sampling error: not a mistake, it’s a natural consequence of sampling- the difference
between the statistic and the parameter it estimates- doesn’t quite represent the whole because the sample is
very small- when you take the same sample size with different units you will get a different statistic- this is
sampling variability
Law of Large Numbers: as the number of observations randomly chosen from a population with finite mean
(\mu\) increases, the mean of the observations (\bar{x}\) gets closer and closer to the mean of the population
Population Distribution: summarizes the variable values for the whole population
Sampling Distribution: summarizes the variable values for the whole sample
Central Limit Theorem: when n is large, the sampling distribution of x-bar will be approximately normally
distributed with mean mu and standard deviation sigma/ square root n.
Sampling Distribution of the sample mean: summarizes the values that the sample mean takes on across all
the possible SRS’s of the same size from the population
Sampling Distribution of the sample proportion: summarizes the values that the sample proportion takes on
across all the possible SRS’s of the same size from the population
• A few conclusions about the Sampling Distribution of \(\bar{X}\) in general:
In summary, knowing how \(\bar{X}\) varies from sample to sample provides some insight into how well an \(\bar{x}\) from a sample
will estimate \(\mu\) which will help us with

More
Less
Related notes for Statistical Sciences 1024A/B

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.