Statistical Sciences 1024A/B Final: Vocabulary/Definitions

19 Pages
Unlock Document

Statistical Sciences
Statistical Sciences 1024A/B
Sohail Khan

Module One: Sampling Population: the whole group of individuals Sample: part of population we collect information from Unit: individuals Parameter: characteristics of the population we want to learn about Statistic: the sample version of a parameter Variable: a characteristic of an individual Sampling Design: describes exactly how to choose a sample from the population Census: information of entire population (no sample) Bias: errors in way the sample represents the population Undercoverage: occurs when groups are left out Nonresponse: when individuals can’t be contacted or won’t participate Response Bias: behaviour of the respondent or interviewer alters answers Voluntary Response Sample: people who chose themselves by responding to a broad appeal- biased Convenience Sample: takes members of the population that are easiest to reach- unrepresentative Random Sampling: uses chance to select a sample Simple Random Sample (SRS): every individual has the same likelihood of being selected Stratified Radom Sample: group into strata then have a SRS in each sample Cluster Sample: grouping people and using an entire group or multiple groups but not others as the sample Multistage Sample: Ex. random sample of telephone exchanges stratified by region, SRS of telephone #s and a random adult from each Module Two: Study Design Observational Study: variables are observed/measured and recorded- does nothing to influence Confounding: when effects of variables can’t be distinguished from each other Response Variable: a particular quantity that we ask a question about Explanatory Variable/Factor: a factor that influences the response variable Experiment: individuals are influenced and then observed Levels: Treatment: a specific experimental condition applied to individuals Randomization: use chance to assign treatments Replication: Control: control effects of variables Completely Randomized Design: all subjects are allocated at random among all the treatments- can compare any number of treatments Randomized Block Design: the random assignment of individuals to treatments is done by block Block: a group of individuals known to be similar before the experiment Placebo: Dummy treatment Blinding: Don’t know which treatment is being received. Double Blind: the subjects and the people who interact with them don’t know which treatment they’re receiving. Statistical Significance: an effect so large it couldn’t be observed by chance Matched Pairs Design: Closely matched pairs are given different treatments to compare Module Three: Descriptive Stats- One Variable (Qualitative) Categorical Variable: places an individual in categories- not meaningful numbers Pie chart: shows the distribution of the categorical variable as a “pie”- use when you want to emphasize each category’s relation to the whole Bar graph: represents each category as a bar- show category counts or percents- more flexible than pie charts Quantitative variables and data: deal with units of measurement- numerical values for which averaging makes sense Continuous: continues Discrete: finite numbers Histogram: most common graph of the distribution of one quantitative variable- percent is at the bottom unlike a bar graph Stemplot: for small data sets Boxplot: data represented as a box Frequency: Total number Relative frequency: Measures of centre: Mean: average of n values Median: value that divides the dataset in half Measures of position- quartiles: 25% and 75%- median of median Measures of spread: Range: maximum-minimum Interquartile range: Q3-Q1 Variance: typical squared deviation between the observations and their mean Standard deviation: square root of sample variance Symmetry: left and right side of histogram approximately mirror each other Skewness: one side extends farther than the other Outliers: individual that falls outside the overall pattern Five number summary: Min Q1 Med Q3 Max Module Four: Descriptive Stats: Two Variables Side-by-side boxplots: provide visual comparison for the distribution of a quantitative variable across a qualitative one Two-way table: two qualitative variables- each cell represents the count of individuals or units that have a particular combination of characteristics Marginal distribution: provides information about the distribution of one qualitative variable but does not provide any information about an association between variables- focuses on one row, and the column of totals (the margin) Conditional distribution: the distribution of a variable for a specific value of the other- summarizes its distribution for specific values of the other variables Simpson’s paradox: an association holds consistently for all groups, but then when the data is considered together the direction of association is reversed due to a lurking variable Scatterplot: visual of the association between two quantitative variable the have been measured on the same units or individuals Time plot: measures intervals over time Trend: positive or negative if the data shows a consistent upward or downward tendency over time Cycle: when the data shows regular up and down fluctuations over time Change points: changes in the overall pattern-striking deviations- suggest something happened around the time of change that impacted the variable Module Five: Linear Relationships Correlation: association Causation: one variable causes another, NOT ALWAYS THE CASE Explanatory (x) variable: causes outcome of interest Response (y) variable: outcome of interest Least-squares regression: minimizes the vertical distance between each square and the line Ecological correlation: correlations between the average values for x and y- tend to be higher than correlations based on x and y Influential point or observation: outliers, if the point was not there the results would be dramatically different Extrapolation: don’t use what you have in one chart to predict the correlation of something outside the chart (if you know BAC vs. # of beers for 1-9 don’t guess about 12 or 20 because it could be different) Lurking variable: another variable Predicted (or fitted) value: line of regression Coefficient of determination: the proportion or fraction of total variation in y that is explained by the least squares regression line- the square of the correlation- COD= explained variation/explained + unexplained variation Residual: observed-predicted values (actual point and line) Residual plot: x is the same, y is residuals Module Six: Quantifying Uncertainty Randomness: a property of a phenomenon whose outcomes cannot be predicted in short run but that behaves in a certain way in the long run Experiment: a process for which a single outcome occurs but in which there are more than one possible outcomes. Thus, we are uncertain which outcome will occur and cannot predict this outcome in advance. Sample space: The sample space, denoted by S, of an experiment is the set of all possible outcomes of the experiment. Let us consider sample space for an experiment when we toss three coins: S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. Simple event and event: An event A is a subset of the sample space S. It is a set containing a subset of the outcomes of a particular interest. In roll of a die ("die" is the singular form of "dice"), our interest might be if die showed an even number. This event will have 3 outcomes. Each outcome is called a simple event. Venn diagram: used to represent samples spaces and events Rules of Probability: Conditional probability: Disjoint (i.e., mutually exclusive) events: events which cannot occur together. In other words, one stops the occurrence of the other Independent events: Independence implies that occurrence of one event does not affect the occurrence (or non- occurrence) of the other event. Two-way tables and probability trees: Random variable: variable which takes numerical values corresponding to outcomes of a random phenomenon Probability distribution: a rule, formula or function which tells us which values the random variable can take on and provides us a method to assign probabilities to these values of the random variable. Discrete probability models: probabilities associated with fixed outcomes in a sample space Continuous probability models: describes the pattern of a random phenomenon using a density curve Module Seven: Variables and Distributions Random variables: a variable which takes numerical values corresponding to outcomes of a random phenomenon Histograms: summarize observed quantitative data. Density curves are often used to model (or describe) results of random phenomena. For the theory (i.e., the model) to coincide with the data, then, the histogram and the density curve should be similar. Density curves: A density curve describes the theoretical pattern or distribution of a random variable and this description is in terms of a mathematical function. A density curve always sits on or above the horizontal axis and has an area of exactly 1 underneath it. The area under the curve for a given range of values is the probability that the random variable takes on values in that range. We often use density curves to model (or describe) results of random phenomena. Probability distributions: a rule, formula or function that tells us which values a random variable can take on and provides us a method to assign probabilities to the values of the random variable Skewness: one side is more spread out than the other- positive/to the right if mean>median Symmetry: no skewness Normal distribution: Normal distributions are defined by two parameters (mean and standard deviation). The mean \(\mu\) of the distribution determines where the curve is centered and the standard deviation (SD), \(\sigma\), determines how spread out the distribution is. All normal distributions have the same shape - they are symmetric and bell-shaped 68%-95%-99.7%/ Empirical Rule: It states that if your distribution is bell-shaped and symmetrical then you can expect about 68% of the observations (data) within one standard deviation of the mean (\(\mu\) ± \(\sigma\)), about 95% of the observations within two standard deviation of the mean (\(\mu\) ± 2\(\sigma\)) and almost all (about 99.7%) of the observations within three standard deviation of the mean (\(\mu\) ± 3\(\sigma\)). These percentages are properties of normal distributions. Standard normal distribution: Uniform/Rectangular distribution: The continuous uniform distribution or rectangular distribution is a symmetric probability distribution. A uniformly distributed random variable takes values which are equally probable. Thus, the height of the rectangle for all values of the random variable is constant. Suppose, we like to define a uniform probability distribution over an interval “a” and “b”. The density curve of this uniform random variable will look like: Triangular distribution: The triangular distribution is a symmetric probability distribution. As the name suggests, the shape of the density curve for the rectangular distribution looks like a triangle. To solve problems about a triangular distribution, the key is to sketch the curve and use the formula for area of a triangle to find the required probabilities. Module 8: Sampling Distributions Parameter: characteristics of the population we want to learn about Statistic: the sample version of the parameter Sampling variability or sampling error: not a mistake, it’s a natural consequence of sampling- the difference between the statistic and the parameter it estimates- doesn’t quite represent the whole because the sample is very small- when you take the same sample size with different units you will get a different statistic- this is sampling variability Law of Large Numbers: as the number of observations randomly chosen from a population with finite mean (\mu\) increases, the mean of the observations (\bar{x}\) gets closer and closer to the mean of the population Population Distribution: summarizes the variable values for the whole population Sampling Distribution: summarizes the variable values for the whole sample Central Limit Theorem: when n is large, the sampling distribution of x-bar will be approximately normally distributed with mean mu and standard deviation sigma/ square root n. Sampling Distribution of the sample mean: summarizes the values that the sample mean takes on across all the possible SRS’s of the same size from the population Sampling Distribution of the sample proportion: summarizes the values that the sample proportion takes on across all the possible SRS’s of the same size from the population • A few conclusions about the Sampling Distribution of \(\bar{X}\) in general: In summary, knowing how \(\bar{X}\) varies from sample to sample provides some insight into how well an \(\bar{x}\) from a sample will estimate \(\mu\) which will help us with
More Less

Related notes for Statistical Sciences 1024A/B

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.