Class Notes (808,384)
Canada (493,172)
Statistics (237)
STAT151 (146)

Statistics 151 - Notes - Chapters 1-15.docx

37 Pages
Unlock Document

University of Alberta
Helen Frost

Clinton Richardson Home Statistics 151 – Introduction to Applied Statistics Course Outline Chapter 1: Stats Starts Here Chapter 2: Data Chapter 3: Displaying and Describing Categorical Data Chapter 4: Displaying and Summarizing Quantitative Data Chapter 5: Understanding and Comparing Distributions Chapter 6: The Standard Deviation as a Ruler and the Normal Model Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression Chapter 9: Regression Wisdom Chapter 10: Re-expressing Data: Get It Straight! Chapter 11: Understanding Randomness Chapter 12: Sample Surveys Chapter 13: Experiments and Observational Studies Chapter 11-13: Gathering Data (Assigned Reading) Chapter 14: From Randomness to Probability Chapter 15: Probability Rules! Chapter 16: Random Variables Chapter 17: Probability Models Chapter 18: Sampling Distribution Models Chapter 19: Confidence Intervals for Proportions Chapter 20: Testing Hypotheses about Proportions Chapter 21: More about Tests Chapter 22: Comparing Two Proportions Chapter 23: Inferences about Means Chapter 24: Comparing Means Chapter 25: Paired Samples and Blocks Chapter 26: Comparing Counts Chapter 27: Inferences for Regression Chapter 28: Analysis of Variance Chapter 11: Understanding Randomness Clinton Richardson Home Textbook Notes Terms 1. Random – an outcome is random if we know the possible values I t can have, but not which particular value it takes 2. Generating random numbers – random numbers are hard to generate. Nevertheless, several web sites offer an unlimited supply of equally likely random values 3. Simulation – a simulation models a real-world situation by using random-digit outcomes to mimic the uncertainty of a response variable of interest 4. Trial – the sequence of several components representing events that we are pretending will take place 5. Simulation component – a component uses equally likely random digits to model simple random occurrences whose outcomes may not be equally likely 6. Response variable – values of the response variable record the results of each trial with respect to what we were interested in. It’s Not Easy Being Random • The best ways we know to generate data that five a fair and accurate picture of the world rely on randomness, and how we draw conclusions from those data depends on randomness too A Simulation • We call each time we obtain a simulated answer to our question a trial • In fact, the reason we’re doing the simulation is that it’s hard to guess how many boxes we’d expect to open Building a Simulation • Opening one box is the basic building block, called a component of our simulation 1. Identify the component to be repeated 2. Explain how you will model the component’s outcome 3. State clearly what the response variable is 4. Explain how you will combine the components into a trial model the response variable 5. Run several trials 6. Collect and summarize the results of all the trials 7. State your conclusion What Have We Learned? • We’ve learned to harness the power of randomness. We’ve learned that a simulation model can help us investigate a question for which many outcomes are possible, for which we can’t (or don’t want to) collect data, and for which a mathematical answer is hard to calculate. We’ve learned how to base our simulation on random values generated by a computer, generated by a randomizing device such as a die or spinner, or found on the internet. Like all models, simulations can provide us with useful insights about the real world. Chapter 12: Sample Surveys Clinton Richardson Home Textbook Notes Terms: 1. Population – the entire group of individuals or instances about whom we hope to learn 2. Sample – a (representative) subset of a population, examined in hope of learning about the population 3. Sample survey – a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population 4. Bias – any systematic failure of a sampling method to represent its population is bias. It is almost impossible to recover from bias, so efforts to avoid it are well spent 5. Randomization – the best defence against bias is randomization, in which each individual is given a fair, random chance of selection 6. Sample size – the number of individuals in a sample. The sample size determines how well the sample represents the population, not the fraction of the population samples 7. Census – a sample that consists of the entire population is called a census 8. Population parameter – a numerically valued attribute of a model for a population 9. Statistic, sample statistic – statistics are values calculated for sampled data. Those that correspond to, and thus estimate, a population parameter are of particular interest 10.Representative – a sample is said to be representative if the statistics computed form it accurately reflect the corresponding population parameters 11.Simple random sample (SRS) – a simple random sample of sample size (n) is a sample in which each set of (n) elements in the population has an equal chance of selection 12.Sampling frame – a list of individuals from whom the sample is drawn is called the sampling frame. Individuals who may be in the population of interest, but who are not in the sampling frame, cannot be included in any sample 13.Sampling variability – the natural tendency of randomly drawn samples to differ, one from another 14.Stratified random sample – a sampling design in which the population is divided into several subpopulations, or strata, and random samples are then drawn from each stratum 15.Cluster sample – a sampling design in which entire groups, or clusters, are chosen at random 16.Multistage sample – a sampling scheme that involves multiple stages of random sampling, where at each successive stage, we sample from lists of ever smaller clusters (hierarchical in nature). 17.Systematic sample – a sample drawn by selecting individuals systematically from a sampling frame. 18.Pilot – a small trial run of a survey to check whether questions are clear. A pilot study can reduce errors due to ambiguous questions 19.Voluntary response bias – bias introduced to a sample when individuals can choose on their own whether to participate in the sample. Samples based on Clinton Richardson Home voluntary response are always invalid and cannot be recovered, no matter how large the sample size 20. Convenience sample – a convenience sample consists of individuals who are conveniently available. They often fail to be representative because every individual in the population is not equally convenient to sample. 21. Undercoverage – a sampling scheme that biases the sample in a way that gives a part of the population less representation than it has in the population suffers from undercoverage 22. Nonresponse bias – bias introduced when a large fraction of those sampled fails to respond. Those who do respond are likely to not represent the entire population 23. Response bias – anything in a survey design that influences responses falls under the heading of response bias. One typical response bias arises from the wording of questions, which may suggest a favoured response. Idea 1: Examine a Part of the Whole • We’d like to know about an entire population of individuals, but examining all of them is usually impractical, if not impossible. So we settle for examining a smaller group of individuals – a sample – selected from the population • Polls are examples of sample surveys, designed to ask questions of a small group of people in the hope of learning something about the entire population Bias • Sampling methods that, by their nature, tend to over – or under – emphasize some characteristics of the population are said to be biased • Solution: They select individuals to sample at random. The importance of deliberately using randomness is one of the great insights of statistics Idea 2: Randomize • Best way is to select at random. Randomizing protects us from in the influences of all the features of our population by making sure that, on average, the sample looks like the rest of the population Idea 3: It’s the Sample Size • The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important Does a Census Make Sense? • Wouldn’t it be better to just include everyone and “sample” the entire population? Such a special sample is called a census • It can be difficult to complete a census. They are longer to complete and can be more complex than sampling Populations and Parameters • Models use mathematics to represent reality • A parameter used in a model for a population is sometimes called a population parameter • We use summaries of the data to estimate the population parameters • Any summary found from the data is a statistic. Sometimes you’ll see the term sample statistic Clinton Richardson Home • When the statistics we compute from the sample accurately reflect the corresponding parameters, then the sample is said to be representative Simple Random Samples • Each combination of people has an equal chance of being selected as well. A sample drawn in this way is called a simple random sample (SRS) • The sampling frame is a list of individuals from which the sample is drawn • Samples drawn at random generally differ one from another. Each draw of random numbers selects different people for our sample. These differences lead to different values for the variables we measure. We call these sample-to-sample differences sampling variability. Stratified Sampling • All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample • Sometimes the population is first sliced into groups called strata, before the sample is selected • We usually try to create these strata so that they are relatively homogenous • Sampling is used within each stratum, after which all the results from the various strata are combined to produce population estimates. This common sampling technique is called stratified random sampling Cluster and Multistage Sampling • Splitting the population in to clusters can make sampling more practical. Then we could simply select a number of clusters at random and perform a census within each of them. This sampling design is called simple cluster sampling • With proper random selection of clusters (and within clusters, if there are multiple stages of sampling), cluster sampling will give us an unbiased sample • Clusters are randomly selected; strata are not • What we care about is maximizing the precision obtained for some total expenditure, not for some fixed sample size, and the same expenditure can yield a much bigger cluster sample than a simple random sample • When we subsample and study just a portion of each selected cluster, we are introducing a second stage of sampling Systematic Samples • You might survey every tenth person on an alphabetical list of students. To make it random, you must start the systematic selection from a randomly selected individual. When the order of the list is not associated in any way with the responses sought, systematic sampling is similar to drawing an SRS • It will give more accurate (less variable) results than an SRS when the list order is associated to some extent with the response variable, as long as the systematic rule doesn’t coincide with some “cycle” in the sampling frame Defining the “Who”: You Can’t Always Get What You Want • Even if the population seems well defined, it may not be a practical group from which to draw a sample • Specify a sampling frame • The target are the individuals for whom you intend to measure responses Clinton Richardson Home • The actual respondents are the individuals about whom you do dget data and can draw conclusions The Valid Survey • A valid survey yields the information we are seeking about the population we are interested in. o Know what you want to know – understand what you hope to learn o Use the right frame – appropriate respondents o Tune your instrument – survey, questionnaire o Ask specific rather than general questions o Ask for quantitative results when possible o Be careful in phrasing questions o Be careful in phrasing answers • A pilot is a trial run of the survey you eventually plan to give to a larger group, using a draft of your survey questions administered to a small sample drawn from the same sampling frame you intend to use Sample Badly with Volunteers • In a voluntary response sample, a large group of individuals is invited to respond, and all who do respond are counted • The sample is not representative, even though every individual in the population may have been offered the chance to respond. The resulting voluntary response bias invalidates the survey Sample Badly, but Conveniently • In convenience sampling we simply include the individuals who are convenient for us to sample Undercoverage • Many survey designs suffer from undercoverage, in which some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population What Else Can Go Wrong? • A common and serious potential source of bias for most surveys is nonresponse bias. No survey succeeds in getting responses from everyone. The problem is that those who don’t respond may differ from those who do. And they may differ on just the variables we care about. • Response bias refers to anything in the survey design that influences the responses How to Think About Biases • There’s no way to recover from a biased sample or survey that asks biased questions Clinton Richardson Home Chapter 13: Experiments and Observational Studies Observational Studies • In observational studies, researchers don’t assign choices; they simply observe them. In addition, this was a retrospective study, because researchers first identified subjects who studied music and then collected data on their past grades Clinton Richardson Home • Observational studies are valuable for discovering trends and possible relationships • Because retrospective records are based on historical data, they can have errors • Identifying subjects in advance and collecting data as events unfold would make this a prospective study Randomized, Comparative Experiments • An experiment requires a random assignment of subjects to treatments • Experiments study the relationship between two or more variables. • An experimenter must identify at least one explanatory variable, called a factor, to manipulate, and at least one response variable to measure. What distinguishes an experiment from other types of investigation is that the experimenter actively and deliberately manipulates the factors to control the details of the possible treatments, and assigns the subjects to those treatments at random • The individuals on whom or which we experiment are referred to as experimental units; however, humans who are experimented on are commonly called subjects or participants • The specific values that the experimenter chooses for a factor are called the levels of the factor • The combination of specific levels from all the factors that an experimental unit receives is known as its treatment The Four Principles of Experimental Design 1. Control • We control sources of variation other than the factors we are testing by making conditions as similar as possible for all treatment groups • We control other sources of variation to prevent them from changing and affecting the response variable 2. Randomize • Randomization allows us to equalize the effects of unknown or uncontrollable sources of variation • If experimental units were not assigned to treatments at random, we would not be able to use statistics to draw conclusions from an experiment. • Reduces bias 3. Replicate • The outcome of an experiment on a single subject is an anecdote, not data • Replication of an entire experiment with the controlled sources of variation at different levels is an essential step in science 4. Blocking • Sometimes, attributes of the experimental units that we are not studying and that we can’t control may nevertheless affect the outcomes of an experiment. If we group similar individuals together and then randomize within each of these blocks, we can remove much of the variability due to the difference among the blocks. Diagrams Clinton Richardson Home • The diagram emphasizes the random allocation of subjects to treatment groups, the separate treatments applied to these groups, and the ultimate comparison of results Does the Difference Make a Difference? • The real question is whether the differences we observed are about as big as we’d get just from the randomization alone, or whether they’re bigger than that. If we decide that they’re bigger, we’ll attribute the differences to the treatments. In that case we say the differences are statistically significant • A difference is statistically significant if we don’t believe that it’s likely to have occurred only by chance Experiments and Samples • Both experiments and sample surveys use randomization to get unbiased data • Sample surveys try to estimate population parameters • Experiments try to assess the effects of treatments • Unless the experimental units are chosen from the population at random, you should be cautious about generalizing experiment results to larger populations until the experiment has been repeated under different circumstances Control Treatments • Control group – the experimental units assigned to a baseline treatment level, typically either the default treatment, which is well understood, or a null, placebo treatment. Their responses provide a basis for comparison Blinding • Blinding helps eliminate bias by hiding loyalties that participants to the treatment might have • When all the individuals of either those who could influence the results or those who evaluate the results are blinded, an experiment is said to be a single-blind. • When everyone in both of the above classes is blinded, we call the experiment a double-blind Placebos • A fake treatment that looks just like the treatments being tested is called a sham treatment or placebo • The placebo effect highlights both the importance of effective blinding and the importance of comparing treatments with a control • The best experiments are usually: randomized, double-blind, comparative, placebo-controlled Blocking Clinton Richardson Home • When groups of experimental units are similar, it is often a good idea to gather them together into blocks. By blocking, we isolate the variability attributable to the differences between the blocks, so that we can see the differences caused by the treatments more clearly. • Because the randomization occurs only within the blocks, we call this a randomized block design • In retrospective or prospective study, subjects are sometimes paired because they are similar in ways not under study. Matching subjects in this way can reduce variation in much the same way as blocking • We use blocks to reduce variability so we can see the effects of the factors; we\re not usually interested in studying the effects of the blocks themselves Adding More Factors • This experiment is a completely randomized two-factor experiment because any plant could end up assigned at random to any of the six treatments (and we have two factors) • Experiments with more than one factor are both more efficient and provide more information than one-at-a-time experiments Confounding • When the levels of one factor are associated with the levels of another factor, we say that these two factors are confounded • A confounding variable, then, is a variable associated with the experimental factor that may also have some effect on the response Because of the Clinton Richardson Home confounding; we find that we can’t tell whether any affect we see was caused by our factor or by the confounding variable – or even by both working together. Chapter 11-13: Gathering Data Lecture Notes Does a Census Make Sense? • Census is a special sample that includes everyone the entire population • Problems: cost, undercount, time, complexity Populations and Parameters • Population parameter – a parameter that is part of a model for a population (often unknown) • Sample statistics – a summary that is found from data in a sample (known once data is observed Two Types of Conclusions (Inferences) • Population inference – results are generalized (estimate) • Casual (cause and effect) Inference – the difference in the responses is caused by the difference in treatments when comparing the results from two treatment groups Population Inference • We should only make population inferences when we have random sampling • Randomizing protects us from the influences of all the features of our population o Makes sure that on the average the sample looks like the rest of the population • Non-random sampling leads to biased results o No way to fix a biased sample • We should not generalize our results in a non-random sample Random Sampling Methods 1. Simple Random Samples (SRS) – each sample of size (n) in the population has the same chance of being selected • Samples drawn at random generally differ from one another • Sampling variability – sample-to-sample differences 2. Stratified Random Sampling • Population is divided into homogenous groups called strata, then take SRS within each stratum before the results are combined • Reduces bias and variability of results 3. Systematic Random Samples • Start the systematic selection from a randomly selected individual, then th sample every k person • Systematic sampling is much less expensive than true random sampling Clinton Richardson Home 4. Cluster Random Sampling • Splitting the population into clusters and selecting one or a few clusters at random and perform a census within each of them • Should be unbiased • Not the same as stratified sampling Sources of Bias: The tendency for a sample to differ from the corresponding population in some systematic way 1. Selection Bias (Under coverage) • When some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population 2. Response Bias • Refers to anything in the survey design that influences the responses • Respondents may lie, especially if asked about illegal or unpopular behaviour 3. Nonresponse Bias • A common and serious potential source of bias for most surveys • Occurs when responses are not obtained from individuals selected for the study Casual (cause-and-effect) Inference • We should only make cause and effect inferences when we have random allocation • Lurking variables are variables that are related to both group membership and to the response Study Designs 1. Observational Study • The investigator observes individuals and measures variables of interest but does not attempt to influence the responses • Retrospective study – useful when one outcome is rare, based on historical data • Prospective study – observe explanatory and response variables, more costly, but better data • Valuable for discovering trends and possible relationships, but not possible to demonstrate a causal relationship Randomized, Comparative Experiments • An experiment is a study design that allows us to prove a cause-and-effect relationship • An experiment: o Manipulates factor levels to create treatments o Randomly assigns subjects to these treatment levels o Compares the responses of the subject groups across treatment levels Clinton Richardson Home Summary 1. Random Selection of Individuals (Random Sampling): • The individuals in the sample are selected randomly from the population • Population inferences are allowed 2. Random Allocation: • The individuals are randomly assigned to different treatment groups • Causal inferences are allowed • We cannot make causal inferences from observational studies Chapter 11-13 Random Selection? Yes No Random Yes Both Casual Inference Assignment? No Population Inference None Chapter 1: Stats Starts Here Lecture Notes What is (are) Statistics? • Statistics (the discipline) is a way of reasoning, a collection of tools and methods, designed to help us understand the world with the use of data • Particular calculations made from data Data • Values with a context • Numbers, record names, or other labels Why We Need to Collect Data • To find answers to the questions that cannot be answered otherwise • Results in different outcomes (variability) Statistical Methods • Personal life, politics and economics, biology, medicine, psychology, social science, engineering, dentistry, business The Process of Making a Data Driven Decision: 1. State the question 2. Decide how to collect the data and which statistical methods have to be applied to answer the question or to make a decision 3. Collect the data – draw a sample from the population of interest 4. Describe and summarize the data. Find the statistics 5. Use appropriate statistical tools, to find an answer to the question. Use formal data analysis and make statistical inference. Generalize the finding from the sample to the entire population. Clinton Richardson Home NOTE: in order to understand and apply inferential procedures, we need to understand Probability Theory Goal of this Class • Learn appropriate methods for data collection • Learn how to formulate, carry out, and interpret simple statistical analysis • Learn how to follow basic statistical arguments • Develop a crucial attitude toward quantitative claims • Develop statistical reasoning • Summary: Become capable of making decisions based on data Definition: • Population of Interest – the entire collection of individuals • Sample – a subset of the population, selected for study in some prescribed manner • Parameter – a number that describes a characteristic of the population (often unkown) (“p” in parameter describes population) • Statistic – a number that describes a characteristic of the sample (are known once the data are observed) (“s” in statistic describes sample) • Greek letters denote parameters and Latin letters denote statistics Example: I want to estimate the proportion of male students for this Stat 151 Lecture by randomly selecting 10 students in this class. Population of Interest: All the students in this Stats 151 Lecture Sample: the 10 randomly selected students Parameter: the proportion of male students in this class Statistic: the proportion of male students in the sample Example: An investigator wants to estimate the average height of Canadian females by measuring the height of 1000 randomly picked Canadian females. Population of Interest: All Canadian females Sample: 1000 randomly selected Canadian females Clinton Richardson Home Parameter: the average height of Canadian females Statistic: the average height of the Canadian females in the sample Textbook Notes What is (Are?) Statistics? • Whenever there are data and a need for understanding the world, we need statistics Statistics in a Word • Statistics is about variation Chapter 2: Data Terms: 1. Data – systematically recorded information, whether numbers or labels, together with its context 2. Context – the context ideally tells, who was measured, what was measured, how the data were collected, where the data were collected, and when and why the study was performed 3. Data Table – an arrangement of data in which each row represents a case and each column represents a variable 4. Case – a case is an individual about whom or which we have data 5. Variable – a variable holds information about the same characteristic for many cases 6. Unit – a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams 7. Categorical Variable a variable that names categories (whether with words or numerals) is called categorical 8. Quantitative Variable – a variable in which the numbers act as numerical values is called quantitative. Quantitative variables always have units 9. Identifier Variable – a variable holding a unique name, ID number, or other identification for a case, identifiers are particularly useful in matching data from two different databases or relations Lecture Notes Data – can be numbers, record names, or other labels. • Not all data are represented by numbers • It must have: who, what, when, where, and why Who: who are you interested in, target population? • Subjects or participants – people whom we experiment o Population – entire set of subjects o Sample – set of subjects you observe • Respondents – individuals who answer a survey Clinton Richardson Home • Experimental units – animals, plants, and inanimate subjects What: what characteristic (or variable) are you measuring? • Variable – characteristic recorded about each individual o Can take different values for different individuals o Tell how each value has been measured and tell the scale of the measurement Variables • Categorical • Quantitative o Nominal o Discrete o Ordinal o Continuous Definition: 1. A categorical variable places a subject into one of several groups or categories (or levels) o Usually we determine the counts of cases that fall into each category o Two types: 1. Nominal: the levels have no order 2. Ordinal: the levels have some order o Example: a) Gender (M or F) - nominal b) Hair colour (blond, white, black, red, etc…) - nominal c) Nationality (Canadian, American, German, French, Chinese, etc…) – nominal d) Letter Grade (a+, a, a-, etc…) – ordinal e) Car Manufacturer (dodge, Honda, etc…) – nominal f) Opinion (strongly agree, agree, neutral, etc…) – ordinal g) Education level (high school diploma, undergraduate degree, graduate degree) – ordinal 2. A Quantitative variable measures a numerical quantity or amount in each subject o Two types: 1. Discrete: can only take on distinct values 2. Continuous: can take on any value in a given interval o Example: a) Age (years) – discrete variable Clinton Richardson Home b) Race (Asian, Black, white, or other) - discrete c) Smoker (yes or no) - discrete d) Systolic blood pressure (millimeters of mercury) – continuous quantitative variable e) Level of calcium in blood (micrograms per millilitre) – continuous Why: Why are you doing this? What is the reason for collecting data? • The questions of interest shape that we think about and how we treat the variable • We need the who, what, and why to analyze data When and Where: gives us some nice info about the context • Example: values recorded at the U of A in 1960s mean something different than similar values recorded last year. (eg. Average pr
More Less

Related notes for STAT151

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.