Study Guides (238,439)
Canada (115,137)
Psychology (552)
PSYC 301 (4)

PSYC 301 Final: PSYC 301 Final Exam Study Guide

26 Pages
Unlock Document

Queen's University
PSYC 301
Jill A Jacobson

PSYC301 Advanced Statistical Inference Final Exam Review Weeks 1 – 12 Week #1: The Crisis in Science P-Hacking: “data dredging”; the use of data mining to uncover patterns in data that can be presented as statistically significant • Replication studies are rarely published • Work is much more likely to be published if it tests a novel/unexpected hypothesis (and if it present significant findings/statistics) • Examples: o Collecting data until significant effect is found o Excluding some observations, measures, conditions o Combining conditions o Adding control variables o Combining/transforming specific measures o Optional Stopping HARKING (Kerr, 1998)  hypothesizing after results are known • Post-HOC hypotheses are presented as a priori • Examples: o Poorly specified moderators o Too perfect a confirmation of the theory o Research design and research question are equivalent • Problematic: o Sample-dependent, not generalizable o Type 1 errors o Promotes statistical abuse • Solutions = preregistration, replication, proper labeling Decline Effect (Lehrer, 2010)  when the effects observed in an original study become weaker/harder to detect in subsequent studies • Examples: drug studies, Schooler Verbal Overshadowing, Attraction/mating success & physical symmetry • Why? o Publication Bias = initially positive results are more likely to be published but later null/negative results are more novel o Type 1 error in initial study could be perpetuated o Original study could have involved smaller N/larger effect size Weird Samples (Heinrich et al., 2010) • Samples tend to be western, educated, industrialized, rich, democratic, etc. • Problem = research assumes that findings are universal but some effects don’t generalize • All experimental psychology sample are unusual, not just human adults (infants, children, rats, etc.) Some Examples of Fraud: - Marc Hauser (2010) – 8 counts of scientific misconduct; biologist - Bem (2011) – problems with 8/9 experiments - Diederik Stapel (2011) – created data (ratted out by his grad students) - Simmons, Nelson, and Simonsohn (2011) – False-Positive Psychology Paper (engaged in p- hacking) - Doyen et al. (2012) – failed to replicate Bargh et al. (1996) social priming study - Smeesters (2012) – engaged in QRPs - Stapel – fabrication of data - Simonsohn – cherry-picking papers to report Guidelines for Authors • Decide on rules for data collection before it begins and report in article/paper • Minimum of 20 observations per cell • List all variable collected in the study • Report all experimental conditions (including ones that failed) • Report statistical results with and without observations eliminated • Report results with and without covariates included • 21-word solution to increase credibility (in Methods section) Schimmack Recommendations • Instead of multiple studies, we need larger studies • Require researchers with multiple papers to calculate total power and justify sample sizes • Allow publication of failed results • Publish replications • Adopt new statistics instead of p values (confidence intervals and effect sizes) Week #2: The Crisis Continued… Replication  ability to produce the same findings using a different group Reproducibility  ability to produce the same results using the same data (same group, same design, etc.) The Allure of Neuro • McCabe and Castel showed that adding a brain image made research seem more compelling • Quality of the brain picture influenced perceptions about quality of research • Bias to believe that hard sciences are more credible • Many researchers do not use correction w/ imaging (fMRI, etc.) Voodoo Correlations  looking to see what is significant and then only keeping what is significant “double dipping” • Example: Yarkoni (2009) SHARKing  Selecting hypothesized brain areas after results are known • One of the most QRPed areas of research Button et al. (2013): Low Power • Problems that contribute to low reliability of findings; o Low Power ▪ The less power you have, the lower the likelihood that a significant result is a true effect rather than a false positive o PPV (positive predictive value ▪ Less power = lower PPV o Low Power means that the Effect Size is inflated Overall: • Neuroscience is often more persuasive but had very high false positive rates (even in the absence of QRPs) • QRP rate is very high • Usually small Ns, low power • High researchers degrees of freedom • Software errors • Multiple comparisons now standard but buggy software is often used for corrections • Insufficient reporting • Lack of replication Week #3: p-Values History of NHST • RA Fisher o First person to propose p-Values o Not proposed to be formal hypothesis testing (meant to be a measure of surprise) • Neyman-Pearson o Wished to ensure that false positives occurred only at a predefined rate o Called this “alpha” – generally set at .05 • Alpha and p-Value were never designed to work together • The p-Value is not equal to the false positive rate So how did we get here? (Hubbard and Ryan, 2000) • The p-value was simple and appealing (didn’t need to do comparisons) • Unawareness of the test’s limitation • No simple replacement • Failure of statisticians to debunk Base Rate Fallacy • Actual error rate is almost certainly higher than .05/5% • Using alpha/p does not tell you anything about how many true effects there really are • Falsely conclude that p < .05 means that there is a 95% chance that the result is true o But ACTUAL RATES depend on the BASE RATE The p-Value: Probability, under the assumption that there is no true effect/difference, of collecting data that shows a difference equal to or more extreme than the observed difference • Like a measure of surprise • NOT a measure of: o Truth o Effect Size o Practical significance (statistically significant doesn’t necessarily mean important) ASA Statement on p-Values (2016) • Can be used to indicate how incompatible the data are with a specified statistical model • Do not measure the probability that the hypothesis is true or the probability that the data is the result of chance alone • Conclusions/policies/decisions should not be based on p-Value only • Proper inference requires full reporting and transparency • A p-value does not measure the size of an effect of the important of a result Pitfalls of p-Values: • Statistically Significant does not mean important, just that it is worthy of more study • Cannot prove whether a hypothesis is true/not o Most common misconception: p-Value of .05 means that there is only a 5% chance that the null hypothesis is true • More is not necessarily better = more data/more questions increases risk of p-Value problems • A p value that is not statistically significant could be the absence of evidence, not evidence of an absence • Some potholes are deliberately hidden Intro Stats Method  integration of p-Value and alpha; reliance on statistical testing • Journal editors used significance testing to decide whether to publish or reject a paper/study • Mechanically applied; seems less subjective • Common language/identity Inverse Probability Error  falsely believing that p values indicate how right or wrong you are, i.e. the probability that the null hypothesis is true Sizeless Science  focus on p values while ignoring effect size Costs with Significance Testing: • When power is low, clinically significant differences often statistically nonsignificant, so may miss identifying beneficial treatments • Focus on significance led to disregard for psychometrics In Defense of p-Values • NHST doesn’t kill science, people kill science – shouldn’t be banned just because people tend to misuse/misinterpret • While NHST is not effective in establishing effect size, it is good at determining relative differences • Most theories and hypotheses are about comparisons or relative differences rather than specific values • Effect Size is always important in real world application, but p-Values might help identify potential treatments for calculating effect size • Bayesian analysis and p-Values often lead to similar conclusions, but Bayesian does not allow for error control of multiple tests • P-Values help distinguish real effects from artifacts • P-Values play a definite but limited role in statistical inference, shouldn’t try to replace completely • Problem is not P-Values but failure to adjust for them Week #4: Likelihood & Bayes Method Lindley’s Paradox  even when a p value is very small, there can still be a high probability that the null hypothesis is true • A result can be unlikely when the null hypothesis is true, but can be even more unlikely when the alternative hypothesis is true and power is high Likelihood is not a probability but it is proportional to a probability • Likelihood does not need to equal 1 • With Likelihood, data are fixed or given but hypothesis can vary • Study with small N provides more evidence against the null hypothesis because you must have had a stronger effect size to achieve the same p value o But we know that large N is better Likelihood Axiom  probability of observed data (D), conditional on some hypothesis (H) being true is equal to P(D|H) Law of Likelihood  if an event is more probable under hypothesis A than B, then occurrence of the event provides more support for Hypothesis A Likelihood Ratio  tool of inference; the value of a single likelihood is meaningless in isolation, only in comparing likelihoods do we find meaning • Quantifies evidence in favor of one hypothesis relative to another • A Likelihood Ratio of 1 suggests equal support for 2 hypotheses • A Likelihood Ratio of 10 suggests that Hypothesis A is 10 times more probable than Hypothesis B given the data • Conditional on observed data Advantages of Likelihood Ratios • Depend only on probability model and observed data | so optional stopping and multiple testing don’t influence • Can combine Likelihood ratios for 2 similar but independent studies by multiplying them together • Unlike p-value, likelihood ratio has direct interpretation in terms of strength of evidence and can be compared across similar experiments with different Ns Rule of Thumb for interpreting Likelihood (Dienes 2008, Royall 2000) • Likelihood Ratio =/> 8 is moderate/fairly strong evidence for hypothesis A over B • Likelihood Ratio =/> 32 is strong evidence for hypothesis A over B With Likelihood Theory: • Evidence cannot be in error but it can be misleading • If hypothesis parameter values are close, it’s hard to find evidence for one over the other Relationship between N and the Likelihood Curve • Bigger N = narrower likelihood curve meaning there is less chance for misleading evidence Potential Advantage with Bayes Method • Estimate probabilities of hypotheses • Compare likelihoods of competing hypotheses • Estimate inferential Confidence Intervals under explicit assumptions • Systematic framework for revising plausibility of hypotheses as new data collected (can update) Principles Supported by Bayesian • Initial differences in perceived hypothesis plausibility become less important as results accumulate o Different initial beliefs are usually driven towards the same conclusion o Skeptics will require more data to reach same level of belief as those who initially supported • Data that are not precise have less sway on subsequent plausibility of a hypothesis than data that are more precise Probability  represents the degree of belief in a fact or prediction (between 0 and 1) Conditional Probability  probability based on some background information Conjoint Probability  probability that two things are true • Probability of A and B Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event p(H|D) = p(H)p(D|H)/p(D) • p(H|D) = posterior (probability of hypothesis after see data) • p(H) = prior (probability of hypothesis before see data – subjective) • p(D|H) = likelihood (probability of the data under the hypothesis) • p(D) = normalizing constant (probability of the data under any hypothesis) Priors • Bayesian Analysis requires prior belief about a certain effect, which combines with data observed (likelihood) to update one’s belief (posterior) • Most common criticism of Bayesian is that the priors are too subjective • Prior should be based on data we have seen before • Extraordinary claims require extraordinary evidence • If you are concerned that your prior is wrong, collect more data • Must be able to justify your prior to a skeptic • Priors are not fixed • More N = more likely that the posterior will match the likelihood of the data Jarosz and Wiley – Interpretation of Bayes • 1 - .33 = weak support for Hypothesis • .33 - .05 = positive support for Hypothesis • .05 - .0067 = strong support for Hypothesis • < .0067 = very strong support for Hypothesis Summary of Three Types of Inference (Dienes, 2008) Meaning of Long-Run Error Aim Inference Sensitive to Probability Rates Provide a Stopping rules, reliable decision what counts as Objective Black and Controlled at the family of NHST frequencies procedure with white decisions fixed values tests, timing of controlled long-term explanation error rates relative to data Indicate how prior Continuous Not Bayes Subjective probabilities degree of guaranteed to Prior opinion should be posterior belief be controlled changed by the data Indicate Continuous Errors of weak Can be used with relative and misleading None of the Likelihood either interpretation strength of degree of evidence above posterior belief evidence controlled Week #6: Power Hypothesis testing  statistical procedure that allows researchers to use sample data to make decisions about population 1. State the Hypothesis o Null Hypothesis o Alternative Hypothesis ▪ Non-Directional Two-Tailed Hypothesis – some type of change ▪ Directional One-Tailed Hypothesis – specifies direction of change/effect 2. Set Decision Criteria o Define exactly what is meant by low and high probability by selecting specific probability value (aka set alpha) o Critical Region = extremely unlikely values that an alpha level defines, very unlikely to occur if H0 is true 3. Collect Data and Compute Sample Statistics o Select random sample o Summarize raw sample data with appropriate statistics o Compute sample mean and compare by computing a z-score (see where the sample mean falls relative to hypothesized population mean) 4. Make a decision o Use z-score to make a decision about null hypothesis 1. Reject Null Hypothesis because sample data falls in critical region; indicating a large discrepancy between sample and null hypothesis OR 2. Fail to Reject the Null Hypothesis because sample data is not in the critical region; indicating that the data is reasonably close to the null hypothesis Errors in Hypothesis Testing • Type 1 Error  reject null hypothesis when it is actually true • Type 2 Error  fail to reject null hypothesis when it is actually false Power  probability of correctly rejecting a false Null Hypothesis • Power is a probability • Refers to correct rejection of Null Hypothesis • BEFORE data is collected o But can be important after data has been collected if fail to reject the null hypothesis Factor that Affect Power: 1. N (Sample Size)  bigger N = more power (and less standard error and bigger z score) 2. Effect Size  bigger effect size = more power (and bigger z score; less difference between observed/expected) 3. Underlying Standard Deviation  less standard deviation = more power (and less standard error) o Standard deviation can be decreased by making sample homogenous or by increasing reliability of measures 4. ALPHA Level  bigger alpha = more power (more of distribution will be in the critical region) 5. One vs. Two-Tailed Tests  one-tailed (directional) test = more power than two-tailed (non- directional test) Summary of Factors that Affect Power: • 2 factors narrow the widths of the sampling distribution of means o Bigger N o Less Standard Deviation • 1 factor widens distance between means o Bigger effect size • 2 factors increase size of rejection region o Bigger Alpha o Two-Tailed, Directional Test Retrospective Power  power calculated after experiment completed based on the experiment’s results (aka a post-hoc power) • Don’t use to explain results Reinhart (2015) discussed underpowered studies Kanazawa (2005)  when small Ns yield big Effect Sizes by wary! They are likely underpowered With alpha = 5% and Power = 80%, most likely outcome of a study is a TRUE NEGATIVE If we increase Power, we increase the likelihood of finding a TRUE POSITIVE If we reduce alpha, TRUE NEGATIVE is the highest likelihood Improving the Prior = best outcome yet, strong likelihood of TRUE POSITIVE ***Choose a likely hypothesis (good theory, sound priors) Null Hypothesis and Alternative Hypothesis & Normal Distribution When setting the error rate, must consider goals, discipline, resources • Best to try and minimize both types of errors If there is a true effect, the p-Values you can expect to observe are completely determined by the statistical power 4 Common Power Fallacies 1. Experiments have Power NO: studies do not have power, only statistical tests have power (power curves) 2. Power is a conditional probability NO: power and type 1 error rates are hypothetical probabilities, NOT conditional 3. Sample size choice should be based on previous results NO: won’t have sufficient power to detect ES that are smaller than the one in previous study 4. Sample size choice should be based on what the effect is believed to be NO: theories aren’t specified precisely enough in terms of effect size Week #7: Sample Size & Optional Stopping Problems with Small Sample Size: • More Type 2 Errors • Less Power • Large variation so inaccurate estimates (poor generalizability) Why are Studies Underpowered? • People use heuristics to plan sample size (but intuition for needed N are usually wrong) o Don’t use Heuristics! Be systematic • Don’t think about how to design study o Within-subjects designs have more power so needs smaller N to get same level of power as Between-subjects designs (use within-subjects if recruitment is a problem) ▪ But not always possible Selecting N: • Plan for N based on width of confidence interval want to achieve o Easy if you know Standard Deviation and % Means should fall within desired range (rarely known this though) • Select N based on desired probability that you find p < .05 • Some suggest basing N on previous effect size (but studies may not be equivalent + publication bias inflates effect size) o Allows data peaking but controls Type 1 error of multiple tests o If possible, use an unbiased effect size estimate o It is risky to base power analysis on a single study Power for Interactions • Normally you need more participants for an interaction than a simple effect or main effect Why have so many interactions been found? • P-Hacking • Bad inferences (fail to test interaction effect) • Cross-over interactions • Error • Bigger Ns Optional Stopping  collecting data until p < .05 • Problem: increases rate of Type I errors • Effects of Optional Stopping = Null Hypothesis is true (so p distribution should be uniform/constant at 1%) o True Effect Size is 0 so can only observe true negatives and false positives • False Positive Rate is 5% in the long run • When performing a single test, Type 1 error rate is equal to the probability of finding a p value lower than alpha when there is no effect • With optional stopping, the Type 1 error rate at each look is approximately alpha, but the overall Type 1 error rate will be greater than alpha Sequential Analysis • Controls Type 1 error rates • Divide .05 by the # of looks and use that new p value as the cut off • There are ethical reasons to look at the data during collection, just need to declare # of looks before starting Pocock Boundary  easy was to control for Type 1 error in sequential analysis • Use smaller p value for each additional look • Without Pocock Boundary = small p values are less likely than slightly higher p values Preregistration • You can’t test a hypothesis on the data used to generate it (because can’t tell if it’s a type one error) • Can do one-tailed tests • Controls error rates • Prevents publication bias 1. Justify your sample size (stopping rule + specify type 1 and 2 error controls) 2. List all IVs and DVs for each test 3. Describe analysis plan Gelman and Carlin • Propose performing design analysis rather than power analysis • Don’t focus on Type 1 and 2 errors but instead on errors in: o Type S  sign or direction of effect size o Type M  magnitude or strength of effect size Design Analysis can reveal 3 Problems: 1. A study with low power is unlikely to yield statistically significant results 2. Possible for a study to have p < .05 and for there to be a high chance that the effect is in the opposite direction 3. Using p < .05 as a screener can lead researchers to drastically overestimate the magnitude of an effect Overall Goals for Determining Sample Size: • Plan for Accuracy • Plan for desired statistical power • Plan for feasibility • Plan for low Type M/Type S Error Rates Week #8: Effect Size, Confidence Intervals, and Meta-Analysis The “New Statistics”: (but actually none of them are new!) ***Goal of replacing p-Values and NHST - Effect Size (ES) - Confidence Intervals (CIs) - Meta-Analysis New Statistics Strategy: The 8 Step Process 1. Formulate research question(s) in estimation terms – avoid dichotomous thinking! 2. Identify the ES that will best answer the research question o Difference between 2 means? Then difference = ES o Does a model describe data? Then ES is measure of goodness of fit 3. Declare fully details of intended procedure/data analysis (preregistration) 4. Calculate point estimate & CIs for chosen ES 5. Make figures that include the CIs 6. Interpret the ES and CIs 7. Use Meta-Analytic Thinking to present and integrate findings 8. Report – be fully transparent Effect Size  amount of anything of interest • NOT p value • Can be means, differences between means, frequencies, correlations, etc. • Sample Effect Size = the point estimate of a population effect size 3 Main Goals of Effect Size 1. Communicate Practical Significance of Results 2. Draw Meta-Analytic Conclusions 3. Perform Power Analyses Unstandardized Effect Size • Scale in original metric • Reports difference between 2 groups • Very easy to understand because clear meaning • But very few measures in psychology have meaningful metrics Standardized Effect Size • Converted to a common metric • ES ranges from 1 to 1 Different Families of Effect Size • “d” Family – standardized mean differences o Cohen’s d, Hedge’s g • “r” Family – strength of association o “r”, R , eta-squared, omega-squared, epsilon-squared Cohen’s d: • Ranges from 0 to infinity • Mean Difference / SD • “d” tends to overestimate true Effect Size especially with smaller Ns o Hedge’s g is unbiased version (always use if you can!) • Cohen’s d can be converted to r The r Family: • Proportion of total variance can be explained by an effect • How much does the relationship between X and Y reduce the error? Cohen’s Effect Size Guidelines T-Shirt Effect Size Guidelines^ (Cautions!) • They were made up by Cohen w/ no empirical basis • It is better to relate your study’s Effect Size to know the effect in the real world to help people understand Criticisms of Effect Size Approach • Often insightful to show that minimal manipulation of the independent variable still accounts for some variance of the dependent variable • Tajifel’s Minimal Group Paradigm o Even when individuals were randomly/meaninglessly assigned to a group, they still favored it/fellow members (demonstrating fundamental in-group favoritism) • Can also change that IV can affect a hard-to-change DV o Such as judgments of defendant (attractiveness as IV) • Effect Sizes are often inflated or biased Effect Size CANNOT be Estimated in Lab • N in psychology studies is usually too small to accurately distinguish among small, medium, and large effect size • Without much larger Ns than typically affordable, Confidence Intervals will span the range of small to large, meaning there will be overlap/inclusion of multiple categories Confidence Intervals • Statement about what % of Confidence Intervals contain the true parameter value in the long run • Can calculate Confidence Interval around any estimate • Bigger N = Less Standard Error = Narrower Confidence Interval • Used to indicate precision of Effect Size estimates (New Statistics Strategy) • Long Term: 95% of 95% Confidence Intervals will contain the true population value • Smaller N = More Variation of CI width o Variation can be so large that it does not yield a meaningful CI Confidence Intervals & p Values • Directly related • When the overlap between 2 confidence intervals is approximately half of one side of the confidence interval, the group difference will be p < .05 for INDEPENDENT GROUPS Common Misinterpretations of Confidence Intervals 1. If 2 CIs overlap, the difference between the 2 estimates is Non-Significant NO: they can overlap and still yield p < .05 It is true that if 2 CIs fail to overlap, p < .05 & that if one CI contains point estimate of the other, then p > .05
More Less

Related notes for PSYC 301

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.