Textbook Notes (368,795)
Canada (162,165)
Statistics (133)
STAB22H3 (130)
Chapter 1-14

Chap 1-14.doc

8 Pages
Unlock Document

Mahinda Samarakoon

Chap.1 – What is Statistics - Stats: way of reasoning, along with a collection of tools and methods designed to help us understand the world  statistics: are particular calculations made from data data: values with context - Variation: data varies because what we measure/see is measured imperfectly (data we look at/base our info on = imperfect picture of the world) Chap.2 – Understanding Data - Context: 5W’s and H  - Who: (leftmost column of data tables) Case – is an individual about whom or which we have data (usually a sample of cases from a larger population) - What: characteristics recorded of each individual (holds information about the same characteristic for many cases) =variables  units: tells us how much of something we have or how far two values are apart  how each value is measured - Why: why this data is needed o Categorical variable: variable that names categories ( in words or numerals)  Identifier variables: variables with exactly one individual in each category o Quantitative variable: variables which the numbers act as numerical values = units - Where: data varies depending place, When: data varies depending on time, How: how data is collected/ how many variables Chap.3 – Categorical Variables - Frequency tables (Relative frequency tables  percentage) – lists categories and gives the count (or percentage) of observations for each category. - Distribution of variables gives: 1. the possible values of the variable 2. relative frequency of each value - Area Principle: area occupied by a part of the graph should correspond to the magnitude of the value it represents (titanic ship = bad example of displaying data) - Bar graph o More flexible and easier to read o Shows the counts for each category next to each other for easy comparison o Draw attention to relative proportions  relative frequency bar chart o Segmented bar graphs: divides the bar as a “whole” in to proportionally segmented box =100%) - Pie graphs o Lack a scale o Must include all categories that make up a whole o Cannot exceed 100% (data entry cannot fall under more than 1 category) o Harder to make comparisons - Contingency Tables o Looks at two categorical variables together (two-way table) o review possible patterns in one variable that may be contingent on the category of the other o Joint distribution: dividing the cell entry by the total sample size (male count binge drinker/ total (m/f) gender count) o Marginal distribution: examining a single variable in a two-way table (e.g total female gender count/ total (m/f) gender count) proportion = 1 o Conditional distribution: condition on the value of one variable and calculate the distribution of variable (column variable of that gender/ total sample size of that gender)  Finding relationship because they show the distribution of one variable for just those cases that satisfy a condition on another variable  When the distribution of one variable is the same for all categories of another, we say that the variables are INDEPENDENT (data result does not DEPEND on category)  no association between these variables - Simpson’s Paradox: unfair averaging  relationship among proportions taken with diff groups or subset can appear to contradict relationships among the grand proportions. Chap.4 – Quantitative Variables - Histograms o Distribution of a quantitative variable slices up all possible values of the variable into equal widths  counts placed into these bins. o Gaps in histogram = missing data….unlike bar graph o Relative frequency histogram = y-axis % o Don’t show data values themselves - Stem-and-Leaf Displays o Is like a histogram, but shows individual values o Truncate = round # o Can split data of same stem into two lines - Dotplots o A simple display that puts dots along an axis for each case in the data o Good for small data set o Count on x-axis - Check Quantitative Data Condition  data are values of quantitative variable whose units are known - Distribution: Shape, Spread, Centre - Shape: o # of modes  unimodal, bimodal, multimodal, uniform o Symmetric?  tails? Skewed to the left (tail on the left) = negatively skewed, skewed to the right (tail to the right) = positively skewed o Outliers - Centre: o Mean (see pg 57)  balancing point  Not resistant measure of centre  not resistant to outliers  Trimming mean: trimming off 5-10% off the biggest and smallest values before averaging (makes mean a more resistant measure) o Median  midpoint of the distribution, half the observations are smaller, half are bigger  Resistant measure of centre  resistant to outliers o Symmetric = mean and median are exactly the same o Skewed to the right  median < mean o Skewed to the left  median > mean - Spread o Range = max-min  difference between the extremes  Sensitive to outliers o Interquartile Range (IQR) = Q3-Q1 -------------- median  Ignores extremes and concentrates on the middle of the data  Q1-M-Q3  Q1 = 25 percentile  Q1 falls above the 25% of the data  Median = 50 percentile th  Q3 = 75 percentile  Resistant measure  resistant to outliers  Ignores how individual values vary o Standard Deviation (s) -------------------------------- mean  how far each data value is from the mean  ONLY for symmetric data  s^2 is variance  1s = ~68% of data, 2s = ~95% of data, 3s = ~99-100% of data - VARIANCE o Measures of spread help us to be precise about what we don’t know o Small spread = data close to centre o Spread = 0 when all values are the same - 5 Number Summary  min—Q1—M—Q3—max  helps us see if it is left or right skewed too! - For unimodal symmetric data, the IQR is usually a bit larger than the standard deviation - Split data into separate groups if there are multiple modes (AND you and explain the reason for the separate modes) Chap.5 - Outliers - Investigating suspected outliers o Q3 + 1.5IQR o Q1 – 1.5IQR o median is roughly centred between the quartiles, then the middle half of the data is roughly symmetric - Comparing boxplots through shape, centre, spread, check for outliers - Outliers: o Understand the context of why there is an outlier (error, special scenario etc.) o If outlier is a correct value, analyze data with and without the outlier Chap.6 - To compare things measured on different scales, we must STANDARDISE the results  base values  mean and standard deviation - Standardize values (pg 122 for formula) o denoted as “z” or z-scores o z-scores have no unit o only measures the distance of each data value from the mean in standard deviations o – when below mean, + when above mean o Farther away from mean = more unusual it is  ** for running, negative z-scores are better - Linear transformations of Data o Shifting  Adding of each observation  Only affects max, min, quartiles, and centres BECAUSE THE SHAPE AND SPREAD DOESN’T CHANGE o Rescaling  Multiplying each observation  multiplies max, min, quartiles, centres, measure of spreads BUT THE SHAPE STAYS RELATIVELY THE SAME o Do not change the shape of distribution, the measures of centre and spread in a predictable manner o NON-Linear transformations of data (logs, square roots etc) changes shape, centre, spread in an unpredictable manner - Z-scores o Standardizing z-scores = shifting them by mean, and rescaling by the standard deviation o Standardizing into z-scores does NOT change shape of the distribution or the variable o Eliminates units o Changes the centre by making the mean = 0 o Changes the spread by making the SD = 1 - Density curves: a model for the frequency distribution of data using area under the curve to represent relative frequency o Must be positive or zero o Total under curve = 1 o Describes the overall pattern of a distribution o Idealized normal frequency distribution - Normal Model: unimodal, symmetric distribution models - Standard Normal Model = N (0,1)  (refer to page 146) MUST BE A NORMAL MODEL o Check the Nearly Normal Condition: must check if distribution is unimodal, nearly symmetric (make a histogram or Normal Probability Plot)  Normal Probability Plot: data is roughly Normal = the plot is roughly diagonal straight line (deviations from straight line = the distribution is not Normal)  MAJOR NEED TO DRAW THE LEFT AND RIGHT SKEWED VERSIONS • Smiling curve = right skewed • Frowning curve = left skewed  z-scores at x-axis, y-axis is the variable values (done with technology) o THEY ARE ASSUMPTIONS/IDEALIZED - 68-95-99.7 Rule of Normal Models o 1SD =68% o 2 SD =95% o 3 SD = 99.7% - Looking for Normal Percentiles (when value doesn’t fall exactly on the SD from the mean) o Convert data to z-scores  find percentile in table (pg 135) o From percentiles to z-scores  Look at table first to find the closest percentile and then look at the corresponding z-score  DON’T ROUND RESULTS IN THE MIDDLE OF CALCULATIONS Chap.7 - relationship between two quantitative variables (scatterplot) - Lurking Variables : hidden variables that lurk or stand behind an analysis but may have an important influence on the relationship studied - Finding association between two variables - Scatterplots: o Direction:  upper left to lower right = negative  lower left to upper right = positive o Form:  Association is linear if points stretched out to a generally consistent, straight form  Can be curved gently  If curves sharply up then down  multiple regressions method o Strength  Clustered in a single stream (straight, curved, bending all over, cloud)  Trend or pattern o Outlier  Stands away from the overall pattern of the scatterplot • Clusters or subgroups that stand away from the rest of the plot should raise questions about why they are different (e.g. splitting primates and carnivores) - Variable of interest = response variable (y-axis)  dependent variable - Predictor variable = explanatory variable (x-axis)  independent variable - Strength of correlation = r  -1=0=1 o Changing the unit does not change the shape of the pattern o Standardizing each variable  for each point, it is a z-score rather than values (Zx, Zy)  May look more steep, but it is because both the x and y axis have the same z-score scale o Positive association  +/+ o Negative association  +/- o Correlation formula summarizes the direction and strength of the association (pg 175)  n-1 represents the fact that the six of the sum gets bigger the more data we have - Use correlation = o Qualitative Variables Condition: correlation only applies to quantitative variables o Straight Enough Condition: the correlation measures the strength only when it is a LINEAR ASSOCIATION o Outlier Condition: correlation is not resistant to outliers (report correlation with and without outlier - Correlation: o Between -1 and 1 o Has no units o Correlation treats x and y symmetrically o Correlation not affected by changes in the centre or scale of either variable  depends only on z-scores o Measures strength of LINEAR association, variables can be strongly associated but still have a small
More Less

Related notes for STAB22H3

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.