Final cheatsheet.docx

5 Pages
Unlock Document

University of British Columbia
Land & Food Systems
LFS 252
Erin Friesen

Numerical – quantities //Categorical- qualities – describing them- car color,//A variable is a characteristic of people or things e.g. weight, eye color//Population is the collection of all data values//A sample is a subset of the population.-> A sample is used to get a partial, representing a portion of a big group//Coding categorical data: Categorical sometimes will contain numbers, but it still counts as categorical data.Reason- easier to input in file (into computer)e.g. How do you like the course? 1 ok 2 good 3 dislike -> Results: 1,2,3,2,3,2,1 -> still categorical//Categorical data in two- way tables : counts and frequency//Understanding causality: Designing experiments1. Observation study: uses group that are already created and recorded the difference ->We can only say the treatment and outcomes are associated instead of causation Why? Individuals from each group may not be identical (ppl eat garlic my also eat more ginger) ->Confounding variable -> may have another variable actually causing the result e.g. maybe smoking is actually causing the disease not eating garlic2.Controlled experiment - > Establishing causality: show that an outcome is affected by some treatment- >Divide into treatment group and control group) # The sample sizes have to be large enough account for the variability// # assigning should be done randomly .Placebo effect: the phenomenon of reacting after being told of receiving a treatment even if there was no actual treatment again.Blind study: is a control study where the participate do not know whether they are taking treatment. Double blind study ( Ideal, but not necessary for the experiment) : when the person collecting the data doesn’t know which treatments the participates are taking. Distribution: describe values, frequencies and shape of the data.Visualizing numerical data : 1.Dot pot Pros: show the individual data values, easy to spot outliers, describe the distribution visually.Cons: not very common and not good for data with too many individual values2.Stem and leaf plots:Stem: all digits before the last digit//Leaf: the last digit Show individual data values, but also classifies data into bins with a width of 10.3.Histograms ( horizontal: numerical data, vertical: frequency)- Group data into bins( also called intervals or classes)- The widths of bars are meaningful represent constant numbers, and have to be the same size- Cannot have gaps between bars, except when there is no data go to that bar- Only has one order//Pros: can display larger amounts of data, easy to analysis//Different width of bins display the chart differently -> Too small of the widths : too much details.Relative frequency histograms: the vertical axis represents relative frequency:But still have the same shape as the FH, just the scale of VA changed -RFH: want to see what portion of the total range( represents percentage), FH: represents quantity//Aspects of a distribution1.Shape:Skewed right: lower at right, Skewed left: lower at left, Symmetric: same amount of values in both right and left hand sides, Number of Mounds: Unimodal(1), bimodal( 2 bumps) multimodal( more than 2) Outliers:Reasons: indicate error in data No error e.g. one person is high salaried since he is owner but his employee is low waged-> even its outlier it is still the fact not error. Center:- typical value: higher number in the middle-> normal distribution ( bell shape)- non typical value: lower number in the middle -> bimodal or skewed distributions Variability: Low- numbers distribute only in certain intervals ( total 5 intervals but number mostly spread out 2 intervals//High- number spread out even. Visualizing categorical data:1.Bar charts: Like histogram, but horizontal represent categorical data//Could be different order//Have gaps between bars//The width of bars is meaningless, but the widths of each categories have to be the same2.Pie charts: Circle look, divide into pies and each pie represents a portion to the frequency of the outcome//Better display of how much of a share each category has of the whole3.Pareto charts: categories from largest to smallest ( from left to right) Mode: category occurs with the highest frequency( a thought of typical outcome)Variability: means diversity in different categories, not means many frequency.Side- by- side bar chart: the picture that one category contains two bars E.g. in reading book chart: separate readers into females and males.Misleading graphs: Frequency scale not starting at 0//Use symbols rather than bars -> confuse readers, cant tell the difference between numbers//Unequal width bars Mean: describes the centerFor skewed right histogram, the means is to the right of the typical value SD: describes the spreadLarge SD: the distribution ( bell shape) is wide and short in the center; small SD: the distribution is narrow and high in the center => N( 50,8): Means- 50, SD- 8, sample is normally distributed.Define Majority? Use Empirical Rule (approximate):1SD: 68 % of samples fell in between 1SD away from Mean,2SD : 95 %,3SD: 99 %.Z- score: standardize the observation  How many SD away from mean//The resulting units are called Standard units.Skewed distribution: samples fell in certain group of intervals, mean will be pulled to the tail of the graphic (NOT IDEAL TO DESCRIBE THE CENTER) Median will be better representation.Median: middle number of the average of the two middle numbers if the sample size is even///Symmetric distribution: medium and mean are similar,Skewed distribution: medium and mean are not similar Quartiles: used to measure spread of a skewed numeric distribution.below 1Q: 25%,below 3Q 75%, IQR( Interquartile range)=Q3-Q1 = 50%>>An outlier affect Mean, SD, and Range but not Median and IQR.Boxplots: less details (median, Q1, Q3)-> useful for comparing different distributions and potential to find outliers Potential outliers: data value that is a distance of more than 1.5 interquartile ranges ( below 1Q or above 3Q)///Boxplots show:Typical range of values,Possible Q,variation//No show :Mode: cant tell which number has more, Mean, Anything for small data sets, especially <5./// Regression analysis :Exam the relationship ( association) between two variable dataUsed scatterplots: Used to investigate a positive, negative or no association between two numerical variables//Strength of association: -strong: small spread of y values -weak: large spread of y values..///Linear trends: A trend is linear if there is a line and the data generally not stray far from the line Correlation Coefficient ( r) : how strength of the linear association between two numerical variables -1< = r <= 1 close to 1: strong positive linear association// close to -1: strong negative //close to 0: weak or no association > calculation of r : need each observation data( from both x and y) and the number of observations >> note that switching x and y has no effect on -- r is unitless -- an outlier an strongly influence the r ( and the regression line) Modeling Linear Trends – the regression line--predicting the values of the graphic X- predictor: influence data Y – independent variable/// Y- predicted-dependent variable * Regression line: y=a+ bx : B= slope, A= y intercept/ r= ∑ /b= r / Sy, Sx= SD/Y-intercept is predicted value when it makes sense to have a value of 0 for x////Slop( B) is only meaningful when the data follows a linear model.The equation changes when x and y are switched, but the r doesn’t..If the linear model is a “ good fit” for data, then the mean value of y for a given x will nearly lie on the regression line.Aggregate data: calculate the mean of every Y variables in each X variable-May increase the r . r² : measures how much the variation in the response of y variable Would be best for making predictions about the response variable///If r² high: it is a high value variable y can be explained by the data x -> how many percentage that x influence y value.Randomness: no predictable pattern occur e.g. dice, heads or tails.Probability: theoretical and empirical.Theoretical probability: long run relative frequencies based on theory e.g. flipped coin: eventually the probability will be 50% and 50% .Probability will be in the between of 0 and 1.Complement: that an event doesn’t happened – represent other events.Equally likely outcomes: each event is equally to occur e.g. each number on dice will have an equal chance to occur .Probability rules:And: the probability of both events occur.Or: the probability of at least one of the events must occur.Exclusive event: the probability of both events occur is 0.Conditional probability: Event A and B are associated. Independent events: event A and B are independent-> B doesn’t influence A.Empirical probability: short run relative frequencies.Laws of large numbers: large numbers of trials empirical approach to theoretical probability.Sample space: a list that contains all possible( and equal likely) outcomes is called the sample space.Random variables can be discrete ( no decimal, can be listed or counted )or continuous (over a range)Use a probability model to predict the likely hood of event: Normal model or Binomial model. Sample: make decision toward to the population.Samples have to be: 1. part of the whole population2. randomize: make sure on the average- make sure represent the whole population, avoid bias) 3. sample size: makes difference in sampling.Population : a group of objects being studied.Parameter: numerical value that characterize some aspects of the population e.g. probability. Census: a survey in which every member of the population is measure(may be too expensive,takes time,destructive ( in order to destroy sth to get the data) nature: ask someone to drink 100 cans of beer or kill all the animals, etc.).Statistical inference: drawing conclusions about a population on the basis of observing only a small subset of the populations involve uncertainly( use terms like: predict, maybe) Sampling bias: occur becoz the sample doesn’t represent of the sample.voluntary-response bias: online survey: only have the strong feeling will fill the survey only provide the info of the people doing this survey and not scientific pool.non- response bias: people fail to answer a question of respond to a survey provide wrong answers, people, with no matter what reasons, choose to provide wrong answers or no response Measurement bias: asking questions that do not produce a true answer people may over estimate or underestimate,questions ( too much info) that guide people to answer the Q (has to be aware of the phrasing of Q too),double barreled Qs: ( e.g. are you satisfied with UBC and ur faculty?)How to know there is bias? Only Small group of samples to reply the survey- bias,Whether the researchers choose the participates,Whether the researchers leave our the feedback which is totally different from the rest of the population.Problems with survey:1. Convenience sampling: e.g. standing at ubc bus loop asking students who think upass is a good policy -> forgot the include the people who drive 2. Undercoverage: E.g telecalling during only specific time -> result only a small representation in the sample that it has in the population Simple random sampling ( SRS): Population should be at least x10 the sample size, Minimize the st bias 1 step  first define where the sample will come form - Sampling frame. Systematic sampling: Collect data on every nth individual E.g. student number with the last number is 9 do the survey  produces a random sample if done correctly > include representative form all times and location and make sure population doesn’t have cycle. Stratified sampling: Individuals are about homogeneous, strata is different from one another: First sliced into homogeneous groups( strata) before the sample is selected e.g. group Lfs students according to specializations  Within each strata then conduct a systematic random sample. Cluster Sampling a cluster: individuals are different/ clusters about alik//The method which divides the population into distinct groups and then look at every member of the group using some sort of nature or convenient distinction //Split the population into similar groups ( smaller groups)Then we select one or a few clusters at random. Accurate/ precise: Sampling distribution---- the word describe SD will be Standard Error,Precision can be improved by using larger sample size,Population size has no bias, no influence on precision ( standard error),But the sample sizes increase, the SE will decrease, mean still same,No bias: mean proportion of a sampling distribution equals to population proportion.Simple random sampling distribution,The true measurement stays the same,But! The statistics the samples we value that going to estimate the population parameter changes from sample to sample statistic will differ, We don’t do it in real world, becoz the sample will barely sit in the extreme high or low probability.Sample distribution: take 1 sample and plot a histogram or bar chart of all observations.Sampling distribution: take many samples from population, and calculate a statistic for each samples/ plot frequency or probability. Treatment Variable: may be the “cause” of something, purposefully change for the experiment. Response Variable: The variable that we are interested in seeing if it has changed due to the treatment; Geometric mean: log transform data normal Population distribution is the distribution of all individuals that exists. Population: Individuals/objects that make up the larger group that we are usually interested in.Parameters – Numerical Summaries about the Population; Mean = µ; SD =σ; Population proportion= p. Sample distribution: the sample is the distribution of the individuals that were surveyed. Sample: When we sample only a group of individuals or objects from the Population. Statistics- Numerical Summaries about the sample; Mean = ̅ SD = s; Sample proportion = ̂ Using Sample proportions to make inferences about the population proportion.If the sample size is large enough and conditions are met (ie, # of successes and failures) then the sampling distribution of sample proportions will be Normally Distributed. N (p, √ ): Where p is the mean and √ is the SE of the distribution ̂ ̂ When you don’t know the true population proportion (which is most of the time), you can estimate SE by: SEest = √ Since the center of the sampling distribution is p, when we collect a sample from the population, we can be fairly confident our sample statistic falls a certain distance away from MOE and sample size for proportion n=( ) n=sample size; Z*=1.96 for 95% CL; m=MOE Sample or Population Distribution What is it: A Plot of all observations in a sample or a population and how frequent each observation occurs (x). Wha
More Less

Related notes for LFS 252

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.