Study Guides
(238,399)

Canada
(115,129)

University of British Columbia
(3,636)

LFS 252
(2)

Erin Friesen
(2)

Final

# Final cheatsheet.docx

Unlock Document

University of British Columbia

Land & Food Systems

LFS 252

Erin Friesen

Winter

Description

Numerical – quantities //Categorical- qualities – describing them- car color,//A variable is a characteristic of people or things e.g. weight, eye color//Population is
the collection of all data values//A sample is a subset of the population.-> A sample is used to get a partial, representing a portion of a big group//Coding
categorical data: Categorical sometimes will contain numbers, but it still counts as categorical data.Reason- easier to input in file (into computer)e.g. How do you
like the course? 1 ok 2 good 3 dislike -> Results: 1,2,3,2,3,2,1 -> still categorical//Categorical data in two- way tables : counts and frequency//Understanding
causality: Designing experiments1. Observation study: uses group that are already created and recorded the difference ->We can only say the treatment and
outcomes are associated instead of causation Why? Individuals from each group may not be identical (ppl eat garlic my also eat more ginger) ->Confounding
variable -> may have another variable actually causing the result e.g. maybe smoking is actually causing the disease not eating garlic2.Controlled experiment -
> Establishing causality: show that an outcome is affected by some treatment- >Divide into treatment group and control group) # The sample sizes have to be
large enough account for the variability// # assigning should be done randomly .Placebo effect: the phenomenon of reacting after being told of receiving a
treatment even if there was no actual treatment again.Blind study: is a control study where the participate do not know whether they are taking treatment.
Double blind study ( Ideal, but not necessary for the experiment) : when the person collecting the data doesn’t know which treatments the participates are
taking. Distribution: describe values, frequencies and shape of the data.Visualizing numerical data : 1.Dot pot Pros: show the individual data values, easy to
spot outliers, describe the distribution visually.Cons: not very common and not good for data with too many individual values2.Stem and leaf plots:Stem: all digits
before the last digit//Leaf: the last digit Show individual data values, but also classifies data into bins with a width of 10.3.Histograms ( horizontal: numerical
data, vertical: frequency)- Group data into bins( also called intervals or classes)- The widths of bars are meaningful represent constant numbers, and have to
be the same size- Cannot have gaps between bars, except when there is no data go to that bar- Only has one order//Pros: can display larger amounts of data,
easy to analysis//Different width of bins display the chart differently -> Too small of the widths : too much details.Relative frequency histograms: the vertical
axis represents relative frequency:But still have the same shape as the FH, just the scale of VA changed -RFH: want to see what portion of the total range(
represents percentage), FH: represents quantity//Aspects of a distribution1.Shape:Skewed right: lower at right, Skewed left: lower at left, Symmetric: same
amount of values in both right and left hand sides, Number of Mounds: Unimodal(1), bimodal( 2 bumps) multimodal( more than 2) Outliers:Reasons: indicate
error in data No error e.g. one person is high salaried since he is owner but his employee is low waged-> even its outlier it is still the fact not error. Center:-
typical value: higher number in the middle-> normal distribution ( bell shape)- non typical value: lower number in the middle -> bimodal or skewed distributions
Variability: Low- numbers distribute only in certain intervals ( total 5 intervals but number mostly spread out 2 intervals//High- number spread out even.
Visualizing categorical data:1.Bar charts: Like histogram, but horizontal represent categorical data//Could be different order//Have gaps between bars//The
width of bars is meaningless, but the widths of each categories have to be the same2.Pie charts: Circle look, divide into pies and each pie represents a portion to
the frequency of the outcome//Better display of how much of a share each category has of the whole3.Pareto charts: categories from largest to smallest ( from left
to right) Mode: category occurs with the highest frequency( a thought of typical outcome)Variability: means diversity in different categories, not means many
frequency.Side- by- side bar chart: the picture that one category contains two bars E.g. in reading book chart: separate readers into females and
males.Misleading graphs: Frequency scale not starting at 0//Use symbols rather than bars -> confuse readers, cant tell the difference between
numbers//Unequal width bars Mean: describes the centerFor skewed right histogram, the means is to the right of the typical value SD: describes the
spreadLarge SD: the distribution ( bell shape) is wide and short in the center; small SD: the distribution is narrow and high in the center => N( 50,8): Means- 50,
SD- 8, sample is normally distributed.Define Majority? Use Empirical Rule (approximate):1SD: 68 % of samples fell in between 1SD away from Mean,2SD : 95
%,3SD: 99 %.Z- score: standardize the observation How many SD away from mean//The resulting units are called Standard units.Skewed distribution:
samples fell in certain group of intervals, mean will be pulled to the tail of the graphic (NOT IDEAL TO DESCRIBE THE CENTER) Median will be better
representation.Median: middle number of the average of the two middle numbers if the sample size is even///Symmetric distribution: medium and mean are
similar,Skewed distribution: medium and mean are not similar Quartiles: used to measure spread of a skewed numeric distribution.below 1Q: 25%,below 3Q
75%, IQR( Interquartile range)=Q3-Q1 = 50%>>An outlier affect Mean, SD, and Range but not Median and IQR.Boxplots: less details (median, Q1, Q3)->
useful for comparing different distributions and potential to find outliers Potential outliers: data value that is a distance of more than 1.5 interquartile ranges (
below 1Q or above 3Q)///Boxplots show:Typical range of values,Possible Q,variation//No show :Mode: cant tell which number has more, Mean, Anything for small
data sets, especially <5./// Regression analysis :Exam the relationship ( association) between two variable dataUsed scatterplots: Used to investigate a
positive, negative or no association between two numerical variables//Strength of association: -strong: small spread of y values -weak: large spread of y
values..///Linear trends: A trend is linear if there is a line and the data generally not stray far from the line Correlation Coefficient ( r) : how strength of the linear
association between two numerical variables -1< = r <= 1 close to 1: strong positive linear association// close to -1: strong negative //close to 0: weak or no
association > calculation of r : need each observation data( from both x and y) and the number of observations >> note that switching x and y has no effect
on -- r is unitless -- an outlier an strongly influence the r ( and the regression line) Modeling Linear Trends – the regression line--predicting the values of the
graphic X- predictor: influence data Y – independent variable/// Y- predicted-dependent variable * Regression line: y=a+ bx : B= slope, A= y intercept/ r=
∑ /b= r / Sy, Sx= SD/Y-intercept is predicted value when it makes sense to have a value of 0 for x////Slop( B) is only meaningful when the data follows a
linear model.The equation changes when x and y are switched, but the r doesn’t..If the linear model is a “ good fit” for data, then the mean value of y for a
given x will nearly lie on the regression line.Aggregate data: calculate the mean of every Y variables in each X variable-May increase the r . r² : measures how
much the variation in the response of y variable Would be best for making predictions about the response variable///If r² high: it is a high value variable y can be
explained by the data x -> how many percentage that x influence y value.Randomness: no predictable pattern occur e.g. dice, heads or tails.Probability:
theoretical and empirical.Theoretical probability: long run relative frequencies based on theory e.g. flipped coin: eventually the probability will be 50% and 50%
.Probability will be in the between of 0 and 1.Complement: that an event doesn’t happened – represent other events.Equally likely outcomes: each event is
equally to occur e.g. each number on dice will have an equal chance to occur .Probability rules:And: the probability of both events occur.Or: the probability of at
least one of the events must occur.Exclusive event: the probability of both events occur is 0.Conditional probability: Event A and B are associated.
Independent events: event A and B are independent-> B doesn’t influence A.Empirical probability: short run relative frequencies.Laws of large numbers: large
numbers of trials empirical approach to theoretical probability.Sample space: a list that contains all possible( and equal likely) outcomes is called the sample
space.Random variables can be discrete ( no decimal, can be listed or counted )or continuous (over a range)Use a probability model to predict the likely hood
of event: Normal model or Binomial model. Sample: make decision toward to the population.Samples have to be: 1. part of the whole population2. randomize:
make sure on the average- make sure represent the whole population, avoid bias) 3. sample size: makes difference in sampling.Population : a group of objects
being studied.Parameter: numerical value that characterize some aspects of the population e.g. probability. Census: a survey in which every member of the
population is measure(may be too expensive,takes time,destructive ( in order to destroy sth to get the data) nature: ask someone to drink 100 cans of beer or kill
all the animals, etc.).Statistical inference: drawing conclusions about a population on the basis of observing only a small subset of the populations involve
uncertainly( use terms like: predict, maybe) Sampling bias: occur becoz the sample doesn’t represent of the sample.voluntary-response bias: online survey:
only have the strong feeling will fill the survey only provide the info of the people doing this survey and not scientific pool.non- response bias: people fail to
answer a question of respond to a survey provide wrong answers, people, with no matter what reasons, choose to provide wrong answers or no response
Measurement bias: asking questions that do not produce a true answer people may over estimate or underestimate,questions ( too much info) that guide
people to answer the Q (has to be aware of the phrasing of Q too),double barreled Qs: ( e.g. are you satisfied with UBC and ur faculty?)How to know there is
bias? Only Small group of samples to reply the survey- bias,Whether the researchers choose the participates,Whether the researchers leave our the feedback
which is totally different from the rest of the population.Problems with survey:1. Convenience sampling: e.g. standing at ubc bus loop asking students who think upass is a good policy -> forgot the include the people who drive 2. Undercoverage: E.g telecalling during only specific time -> result only a small
representation in the sample that it has in the population Simple random sampling ( SRS): Population should be at least x10 the sample size, Minimize the
st
bias 1 step first define where the sample will come form - Sampling frame. Systematic sampling: Collect data on every nth individual E.g. student number
with the last number is 9 do the survey produces a random sample if done correctly > include representative form all times and location and make sure
population doesn’t have cycle. Stratified sampling: Individuals are about homogeneous, strata is different from one another: First sliced into homogeneous
groups( strata) before the sample is selected e.g. group Lfs students according to specializations Within each strata then conduct a systematic random
sample. Cluster Sampling a cluster: individuals are different/ clusters about alik//The method which divides the population into distinct groups and then
look at every member of the group using some sort of nature or convenient distinction //Split the population into similar groups ( smaller groups)Then we
select one or a few clusters at random. Accurate/ precise: Sampling distribution---- the word describe SD will be Standard Error,Precision can be improved by
using larger sample size,Population size has no bias, no influence on precision ( standard error),But the sample sizes increase, the SE will decrease,
mean still same,No bias: mean proportion of a sampling distribution equals to population proportion.Simple random sampling distribution,The true
measurement stays the same,But! The statistics the samples we value that going to estimate the population parameter changes from sample to sample statistic
will differ, We don’t do it in real world, becoz the sample will barely sit in the extreme high or low probability.Sample distribution: take 1 sample and plot a
histogram or bar chart of all observations.Sampling distribution: take many samples from population, and calculate a statistic for each samples/ plot frequency
or probability. Treatment Variable: may be the “cause” of something, purposefully change for the experiment. Response Variable: The variable that we are
interested in seeing if it has changed due to the treatment; Geometric mean: log transform data normal
Population distribution is the distribution of all individuals that exists. Population: Individuals/objects that make up the larger group that we are usually
interested in.Parameters – Numerical Summaries about the Population; Mean = µ; SD =σ; Population proportion= p. Sample distribution: the sample is the
distribution of the individuals that were surveyed. Sample: When we sample only a group of individuals or objects from the Population. Statistics- Numerical
Summaries about the sample; Mean = ̅ SD = s; Sample proportion = ̂
Using Sample proportions to make inferences about the population proportion.If the sample size is large enough and conditions are met (ie, # of successes and
failures) then the sampling distribution of sample proportions will be Normally Distributed.
N (p, √ ): Where p is the mean and √ is the SE of the distribution
̂ ̂
When you don’t know the true population proportion (which is most of the time), you can estimate SE by: SEest = √
Since the center of the sampling distribution is p, when we collect a sample from the population, we can be fairly confident our sample statistic falls a certain
distance away from MOE and sample size for proportion n=( ) n=sample size; Z*=1.96 for 95% CL; m=MOE
Sample or Population Distribution
What is it: A Plot of all observations in a sample or a population and how frequent each observation occurs (x).
Wha

More
Less
Related notes for LFS 252