Study Guides
(238,439)

Canada
(115,137)

Queen's University
(4,208)

Psychology
(552)

PSYC 301
(4)

Jill A Jacobson
(4)

Final

# PSYC 301 Final: PSYC 301 Final Exam Study Guide

Unlock Document

Queen's University

Psychology

PSYC 301

Jill A Jacobson

Fall

Description

PSYC301 Advanced Statistical Inference
Final Exam Review
Weeks 1 – 12
Week #1: The Crisis in Science
P-Hacking: “data dredging”; the use of data mining to uncover patterns in data that can be presented as
statistically significant
• Replication studies are rarely published
• Work is much more likely to be published if it tests a novel/unexpected hypothesis (and if it
present significant findings/statistics)
• Examples:
o Collecting data until significant effect is found
o Excluding some observations, measures, conditions
o Combining conditions
o Adding control variables
o Combining/transforming specific measures
o Optional Stopping
HARKING (Kerr, 1998) hypothesizing after results are known
• Post-HOC hypotheses are presented as a priori
• Examples:
o Poorly specified moderators
o Too perfect a confirmation of the theory
o Research design and research question are equivalent
• Problematic:
o Sample-dependent, not generalizable
o Type 1 errors
o Promotes statistical abuse
• Solutions = preregistration, replication, proper labeling
Decline Effect (Lehrer, 2010) when the effects observed in an original study become weaker/harder
to detect in subsequent studies
• Examples: drug studies, Schooler Verbal Overshadowing, Attraction/mating success & physical
symmetry
• Why?
o Publication Bias = initially positive results are more likely to be published but later
null/negative results are more novel
o Type 1 error in initial study could be perpetuated
o Original study could have involved smaller N/larger effect size
Weird Samples (Heinrich et al., 2010)
• Samples tend to be western, educated, industrialized, rich, democratic, etc.
• Problem = research assumes that findings are universal but some effects don’t generalize
• All experimental psychology sample are unusual, not just human adults (infants, children, rats,
etc.) Some Examples of Fraud:
- Marc Hauser (2010) – 8 counts of scientific misconduct; biologist
- Bem (2011) – problems with 8/9 experiments
- Diederik Stapel (2011) – created data (ratted out by his grad students)
- Simmons, Nelson, and Simonsohn (2011) – False-Positive Psychology Paper (engaged in p-
hacking)
- Doyen et al. (2012) – failed to replicate Bargh et al. (1996) social priming study
- Smeesters (2012) – engaged in QRPs
- Stapel – fabrication of data
- Simonsohn – cherry-picking papers to report
Guidelines for Authors
• Decide on rules for data collection before it begins and report in article/paper
• Minimum of 20 observations per cell
• List all variable collected in the study
• Report all experimental conditions (including ones that failed)
• Report statistical results with and without observations eliminated
• Report results with and without covariates included
• 21-word solution to increase credibility (in Methods section)
Schimmack Recommendations
• Instead of multiple studies, we need larger studies
• Require researchers with multiple papers to calculate total power and justify sample sizes
• Allow publication of failed results
• Publish replications
• Adopt new statistics instead of p values (confidence intervals and effect sizes)
Week #2: The Crisis Continued…
Replication ability to produce the same findings using a different group
Reproducibility ability to produce the same results using the same data (same group, same design,
etc.)
The Allure of Neuro
• McCabe and Castel showed that adding a brain image made research seem more compelling
• Quality of the brain picture influenced perceptions about quality of research
• Bias to believe that hard sciences are more credible
• Many researchers do not use correction w/ imaging (fMRI, etc.)
Voodoo Correlations looking to see what is significant and then only keeping what is significant
“double dipping”
• Example: Yarkoni (2009)
SHARKing Selecting hypothesized brain areas after results are known
• One of the most QRPed areas of research Button et al. (2013): Low Power
• Problems that contribute to low reliability of findings;
o Low Power
▪ The less power you have, the lower the likelihood that a significant result is a
true effect rather than a false positive
o PPV (positive predictive value
▪ Less power = lower PPV
o Low Power means that the Effect Size is inflated
Overall:
• Neuroscience is often more persuasive but had very high false positive rates (even in the
absence of QRPs)
• QRP rate is very high
• Usually small Ns, low power
• High researchers degrees of freedom
• Software errors
• Multiple comparisons now standard but buggy software is often used for corrections
• Insufficient reporting
• Lack of replication
Week #3: p-Values
History of NHST
• RA Fisher
o First person to propose p-Values
o Not proposed to be formal hypothesis testing (meant to be a measure of surprise)
• Neyman-Pearson
o Wished to ensure that false positives occurred only at a predefined rate
o Called this “alpha” – generally set at .05
• Alpha and p-Value were never designed to work together
• The p-Value is not equal to the false positive rate
So how did we get here? (Hubbard and Ryan, 2000)
• The p-value was simple and appealing (didn’t need to do comparisons)
• Unawareness of the test’s limitation
• No simple replacement
• Failure of statisticians to debunk
Base Rate Fallacy
• Actual error rate is almost certainly higher than .05/5%
• Using alpha/p does not tell you anything about how many true effects there really are
• Falsely conclude that p < .05 means that there is a 95% chance that the result is true
o But ACTUAL RATES depend on the BASE RATE
The p-Value: Probability, under the assumption that there is no true effect/difference, of collecting data
that shows a difference equal to or more extreme than the observed difference
• Like a measure of surprise • NOT a measure of:
o Truth
o Effect Size
o Practical significance (statistically significant doesn’t necessarily mean important)
ASA Statement on p-Values (2016)
• Can be used to indicate how incompatible the data are with a specified statistical model
• Do not measure the probability that the hypothesis is true or the probability that the data is the
result of chance alone
• Conclusions/policies/decisions should not be based on p-Value only
• Proper inference requires full reporting and transparency
• A p-value does not measure the size of an effect of the important of a result
Pitfalls of p-Values:
• Statistically Significant does not mean important, just that it is worthy of more study
• Cannot prove whether a hypothesis is true/not
o Most common misconception: p-Value of .05 means that there is only a 5% chance that
the null hypothesis is true
• More is not necessarily better = more data/more questions increases risk of p-Value problems
• A p value that is not statistically significant could be the absence of evidence, not evidence of an
absence
• Some potholes are deliberately hidden
Intro Stats Method integration of p-Value and alpha; reliance on statistical testing
• Journal editors used significance testing to decide whether to publish or reject a paper/study
• Mechanically applied; seems less subjective
• Common language/identity
Inverse Probability Error falsely believing that p values indicate how right or wrong you are, i.e. the
probability that the null hypothesis is true
Sizeless Science focus on p values while ignoring effect size
Costs with Significance Testing:
• When power is low, clinically significant differences often statistically nonsignificant, so may
miss identifying beneficial treatments
• Focus on significance led to disregard for psychometrics
In Defense of p-Values
• NHST doesn’t kill science, people kill science – shouldn’t be banned just because people tend to
misuse/misinterpret
• While NHST is not effective in establishing effect size, it is good at determining relative
differences
• Most theories and hypotheses are about comparisons or relative differences rather than specific
values
• Effect Size is always important in real world application, but p-Values might help identify
potential treatments for calculating effect size • Bayesian analysis and p-Values often lead to similar conclusions, but Bayesian does not allow for
error control of multiple tests
• P-Values help distinguish real effects from artifacts
• P-Values play a definite but limited role in statistical inference, shouldn’t try to replace
completely
• Problem is not P-Values but failure to adjust for them
Week #4: Likelihood & Bayes Method
Lindley’s Paradox even when a p value is very small, there can still be a high probability that the null
hypothesis is true
• A result can be unlikely when the null hypothesis is true, but can be even more unlikely when
the alternative hypothesis is true and power is high
Likelihood is not a probability but it is proportional to a probability
• Likelihood does not need to equal 1
• With Likelihood, data are fixed or given but hypothesis can vary
• Study with small N provides more evidence against the null hypothesis because you must have
had a stronger effect size to achieve the same p value
o But we know that large N is better
Likelihood Axiom probability of observed data (D), conditional on some hypothesis (H) being true is
equal to P(D|H)
Law of Likelihood if an event is more probable under hypothesis A than B, then occurrence of the
event provides more support for Hypothesis A
Likelihood Ratio tool of inference; the value of a single likelihood is meaningless in isolation, only in
comparing likelihoods do we find meaning
• Quantifies evidence in favor of one hypothesis relative to another
• A Likelihood Ratio of 1 suggests equal support for 2 hypotheses
• A Likelihood Ratio of 10 suggests that Hypothesis A is 10 times more probable than Hypothesis B
given the data
• Conditional on observed data
Advantages of Likelihood Ratios
• Depend only on probability model and observed data | so optional stopping and multiple testing
don’t influence
• Can combine Likelihood ratios for 2 similar but independent studies by multiplying them
together
• Unlike p-value, likelihood ratio has direct interpretation in terms of strength of evidence and can
be compared across similar experiments with different Ns
Rule of Thumb for interpreting Likelihood (Dienes 2008, Royall 2000)
• Likelihood Ratio =/> 8 is moderate/fairly strong evidence for hypothesis A over B
• Likelihood Ratio =/> 32 is strong evidence for hypothesis A over B With Likelihood Theory:
• Evidence cannot be in error but it can be misleading
• If hypothesis parameter values are close, it’s hard to find evidence for one over the other
Relationship between N and the Likelihood Curve
• Bigger N = narrower likelihood curve meaning there is less chance for misleading evidence
Potential Advantage with Bayes Method
• Estimate probabilities of hypotheses
• Compare likelihoods of competing hypotheses
• Estimate inferential Confidence Intervals under explicit assumptions
• Systematic framework for revising plausibility of hypotheses as new data collected (can update)
Principles Supported by Bayesian
• Initial differences in perceived hypothesis plausibility become less important as results
accumulate
o Different initial beliefs are usually driven towards the same conclusion
o Skeptics will require more data to reach same level of belief as those who initially
supported
• Data that are not precise have less sway on subsequent plausibility of a hypothesis than data
that are more precise
Probability represents the degree of belief in a fact or prediction (between 0 and 1)
Conditional Probability probability based on some background information
Conjoint Probability probability that two things are true
• Probability of A and B
Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might
be related to the event
p(H|D) = p(H)p(D|H)/p(D)
• p(H|D) = posterior (probability of hypothesis after see data)
• p(H) = prior (probability of hypothesis before see data – subjective)
• p(D|H) = likelihood (probability of the data under the hypothesis)
• p(D) = normalizing constant (probability of the data under any hypothesis)
Priors
• Bayesian Analysis requires prior belief about a certain effect, which combines with data
observed (likelihood) to update one’s belief (posterior)
• Most common criticism of Bayesian is that the priors are too subjective
• Prior should be based on data we have seen before
• Extraordinary claims require extraordinary evidence
• If you are concerned that your prior is wrong, collect more data
• Must be able to justify your prior to a skeptic
• Priors are not fixed
• More N = more likely that the posterior will match the likelihood of the data
Jarosz and Wiley – Interpretation of Bayes • 1 - .33 = weak support for Hypothesis
• .33 - .05 = positive support for Hypothesis
• .05 - .0067 = strong support for Hypothesis
• < .0067 = very strong support for Hypothesis
Summary of Three Types of Inference (Dienes, 2008)
Meaning of Long-Run Error
Aim Inference Sensitive to
Probability Rates
Provide a Stopping rules,
reliable
decision what counts as
Objective Black and Controlled at the family of
NHST frequencies procedure with white decisions fixed values tests, timing of
controlled
long-term explanation
error rates relative to data
Indicate how
prior
Continuous Not
Bayes Subjective probabilities degree of guaranteed to Prior opinion
should be posterior belief be controlled
changed by the
data
Indicate Continuous Errors of weak
Can be used with relative and misleading None of the
Likelihood either interpretation strength of degree of evidence above
posterior belief
evidence controlled
Week #6: Power
Hypothesis testing statistical procedure that allows researchers to use sample data to make decisions
about population
1. State the Hypothesis
o Null Hypothesis
o Alternative Hypothesis
▪ Non-Directional Two-Tailed Hypothesis – some type of change
▪ Directional One-Tailed Hypothesis – specifies direction of change/effect
2. Set Decision Criteria o Define exactly what is meant by low and high probability by selecting specific probability
value (aka set alpha)
o Critical Region = extremely unlikely values that an alpha level defines, very unlikely to
occur if H0 is true
3. Collect Data and Compute Sample Statistics
o Select random sample
o Summarize raw sample data with appropriate statistics
o Compute sample mean and compare by computing a z-score (see where the sample
mean falls relative to hypothesized population mean)
4. Make a decision
o Use z-score to make a decision about null hypothesis
1. Reject Null Hypothesis because sample data falls in critical region; indicating a large
discrepancy between sample and null hypothesis
OR
2. Fail to Reject the Null Hypothesis because sample data is not in the critical region;
indicating that the data is reasonably close to the null hypothesis
Errors in Hypothesis Testing
• Type 1 Error reject null hypothesis when it is actually true
• Type 2 Error fail to reject null hypothesis when it is actually false
Power probability of correctly rejecting a false Null Hypothesis
• Power is a probability
• Refers to correct rejection of Null Hypothesis
• BEFORE data is collected
o But can be important after data has been collected if fail to reject the null hypothesis
Factor that Affect Power:
1. N (Sample Size) bigger N = more power (and less standard error and bigger z score)
2. Effect Size bigger effect size = more power (and bigger z score; less difference between
observed/expected)
3. Underlying Standard Deviation less standard deviation = more power (and less standard
error)
o Standard deviation can be decreased by making sample homogenous or by increasing
reliability of measures
4. ALPHA Level bigger alpha = more power (more of distribution will be in the critical region)
5. One vs. Two-Tailed Tests one-tailed (directional) test = more power than two-tailed (non-
directional test) Summary of Factors that Affect Power:
• 2 factors narrow the widths of the sampling distribution of means
o Bigger N
o Less Standard Deviation
• 1 factor widens distance between means
o Bigger effect size
• 2 factors increase size of rejection region
o Bigger Alpha
o Two-Tailed, Directional Test
Retrospective Power power calculated after experiment completed based on the experiment’s
results (aka a post-hoc power)
• Don’t use to explain results
Reinhart (2015) discussed underpowered studies
Kanazawa (2005) when small Ns yield big Effect Sizes by wary! They are likely underpowered
With alpha = 5% and Power = 80%, most likely outcome of a study is a TRUE NEGATIVE
If we increase Power, we increase the likelihood of finding a TRUE POSITIVE
If we reduce alpha, TRUE NEGATIVE is the highest likelihood
Improving the Prior = best outcome yet, strong likelihood of TRUE POSITIVE
***Choose a likely hypothesis (good theory, sound priors)
Null Hypothesis and Alternative Hypothesis & Normal Distribution
When setting the error rate, must consider goals, discipline, resources
• Best to try and minimize both types of errors
If there is a true effect, the p-Values you can expect to observe are completely determined by the
statistical power 4 Common Power Fallacies
1. Experiments have Power
NO: studies do not have power, only statistical tests have power (power curves)
2. Power is a conditional probability
NO: power and type 1 error rates are hypothetical probabilities, NOT conditional
3. Sample size choice should be based on previous results
NO: won’t have sufficient power to detect ES that are smaller than the one in previous study
4. Sample size choice should be based on what the effect is believed to be
NO: theories aren’t specified precisely enough in terms of effect size
Week #7: Sample Size & Optional Stopping
Problems with Small Sample Size:
• More Type 2 Errors
• Less Power
• Large variation so inaccurate estimates (poor generalizability)
Why are Studies Underpowered?
• People use heuristics to plan sample size (but intuition for needed N are usually wrong)
o Don’t use Heuristics! Be systematic
• Don’t think about how to design study
o Within-subjects designs have more power so needs smaller N to get same level of power
as Between-subjects designs (use within-subjects if recruitment is a problem)
▪ But not always possible
Selecting N:
• Plan for N based on width of confidence interval want to achieve
o Easy if you know Standard Deviation and % Means should fall within desired range
(rarely known this though)
• Select N based on desired probability that you find p < .05
• Some suggest basing N on previous effect size (but studies may not be equivalent + publication
bias inflates effect size)
o Allows data peaking but controls Type 1 error of multiple tests
o If possible, use an unbiased effect size estimate
o It is risky to base power analysis on a single study
Power for Interactions
• Normally you need more participants for an interaction than a simple effect or main effect
Why have so many interactions been found?
• P-Hacking
• Bad inferences (fail to test interaction effect)
• Cross-over interactions
• Error
• Bigger Ns Optional Stopping collecting data until p < .05
• Problem: increases rate of Type I errors
• Effects of Optional Stopping = Null Hypothesis is true (so p distribution should be
uniform/constant at 1%)
o True Effect Size is 0 so can only observe true negatives and false positives
• False Positive Rate is 5% in the long run
• When performing a single test, Type 1 error rate is equal to the probability of finding a p value
lower than alpha when there is no effect
• With optional stopping, the Type 1 error rate at each look is approximately alpha, but the overall
Type 1 error rate will be greater than alpha
Sequential Analysis
• Controls Type 1 error rates
• Divide .05 by the # of looks and use that new p value as the cut off
• There are ethical reasons to look at the data during collection, just need to declare # of looks
before starting
Pocock Boundary easy was to control for Type 1 error in sequential analysis
• Use smaller p value for each additional look
• Without Pocock Boundary = small p values are less likely than slightly higher p values
Preregistration
• You can’t test a hypothesis on the data used to generate it (because can’t tell if it’s a type one
error)
• Can do one-tailed tests
• Controls error rates
• Prevents publication bias
1. Justify your sample size (stopping rule + specify type 1 and 2 error controls)
2. List all IVs and DVs for each test
3. Describe analysis plan
Gelman and Carlin
• Propose performing design analysis rather than power analysis
• Don’t focus on Type 1 and 2 errors but instead on errors in:
o Type S sign or direction of effect size
o Type M magnitude or strength of effect size
Design Analysis can reveal 3 Problems:
1. A study with low power is unlikely to yield statistically significant results
2. Possible for a study to have p < .05 and for there to be a high chance that the effect is in the
opposite direction
3. Using p < .05 as a screener can lead researchers to drastically overestimate the magnitude of an
effect
Overall Goals for Determining Sample Size:
• Plan for Accuracy
• Plan for desired statistical power • Plan for feasibility
• Plan for low Type M/Type S Error Rates
Week #8: Effect Size, Confidence Intervals, and Meta-Analysis
The “New Statistics”: (but actually none of them are new!)
***Goal of replacing p-Values and NHST
- Effect Size (ES)
- Confidence Intervals (CIs)
- Meta-Analysis
New Statistics Strategy: The 8 Step Process
1. Formulate research question(s) in estimation terms – avoid dichotomous thinking!
2. Identify the ES that will best answer the research question
o Difference between 2 means? Then difference = ES
o Does a model describe data? Then ES is measure of goodness of fit
3. Declare fully details of intended procedure/data analysis (preregistration)
4. Calculate point estimate & CIs for chosen ES
5. Make figures that include the CIs
6. Interpret the ES and CIs
7. Use Meta-Analytic Thinking to present and integrate findings
8. Report – be fully transparent
Effect Size amount of anything of interest
• NOT p value
• Can be means, differences between means, frequencies, correlations, etc.
• Sample Effect Size = the point estimate of a population effect size
3 Main Goals of Effect Size
1. Communicate Practical Significance of Results
2. Draw Meta-Analytic Conclusions
3. Perform Power Analyses
Unstandardized Effect Size
• Scale in original metric
• Reports difference between 2 groups
• Very easy to understand because clear meaning
• But very few measures in psychology have meaningful metrics
Standardized Effect Size
• Converted to a common metric
• ES ranges from 1 to 1
Different Families of Effect Size
• “d” Family – standardized mean differences
o Cohen’s d, Hedge’s g
• “r” Family – strength of association o “r”, R , eta-squared, omega-squared, epsilon-squared
Cohen’s d:
• Ranges from 0 to infinity
• Mean Difference / SD
• “d” tends to overestimate true Effect Size especially with smaller Ns
o Hedge’s g is unbiased version (always use if you can!)
• Cohen’s d can be converted to r
The r Family:
• Proportion of total variance can be explained by an effect
• How much does the relationship between X and Y reduce the error?
Cohen’s Effect Size Guidelines
T-Shirt Effect Size Guidelines^ (Cautions!)
• They were made up by Cohen w/ no empirical basis
• It is better to relate your study’s Effect Size to know the effect in the real world to help people
understand
Criticisms of Effect Size Approach
• Often insightful to show that minimal manipulation of the independent variable still accounts
for some variance of the dependent variable
• Tajifel’s Minimal Group Paradigm
o Even when individuals were randomly/meaninglessly assigned to a group, they still
favored it/fellow members (demonstrating fundamental in-group favoritism)
• Can also change that IV can affect a hard-to-change DV
o Such as judgments of defendant (attractiveness as IV)
• Effect Sizes are often inflated or biased
Effect Size CANNOT be Estimated in Lab • N in psychology studies is usually too small to accurately distinguish among small, medium, and
large effect size
• Without much larger Ns than typically affordable, Confidence Intervals will span the range of
small to large, meaning there will be overlap/inclusion of multiple categories
Confidence Intervals
• Statement about what % of Confidence Intervals contain the true parameter value in the long
run
• Can calculate Confidence Interval around any estimate
• Bigger N = Less Standard Error = Narrower Confidence Interval
• Used to indicate precision of Effect Size estimates (New Statistics Strategy)
• Long Term: 95% of 95% Confidence Intervals will contain the true population value
• Smaller N = More Variation of CI width
o Variation can be so large that it does not yield a meaningful CI
Confidence Intervals & p Values
• Directly related
• When the overlap between 2 confidence intervals is approximately half of one side of the
confidence interval, the group difference will be p < .05 for INDEPENDENT GROUPS
Common Misinterpretations of Confidence Intervals
1. If 2 CIs overlap, the difference between the 2 estimates is Non-Significant
NO: they can overlap and still yield p < .05
It is true that if 2 CIs fail to overlap, p < .05 & that if one CI contains point estimate of the other,
then p > .05

More
Less
Related notes for PSYC 301