Textbook Notes (368,795)
Canada (162,165)
Statistics (133)
STAB22H3 (130)
Chapter 1-14

Chap 1-14.doc

8 Pages
139 Views
Unlock Document

Department
Statistics
Course
STAB22H3
Professor
Mahinda Samarakoon
Semester
Fall

Description
Chap.1 – What is Statistics - Stats: way of reasoning, along with a collection of tools and methods designed to help us understand the world  statistics: are particular calculations made from data data: values with context - Variation: data varies because what we measure/see is measured imperfectly (data we look at/base our info on = imperfect picture of the world) Chap.2 – Understanding Data - Context: 5W’s and H  - Who: (leftmost column of data tables) Case – is an individual about whom or which we have data (usually a sample of cases from a larger population) - What: characteristics recorded of each individual (holds information about the same characteristic for many cases) =variables  units: tells us how much of something we have or how far two values are apart  how each value is measured - Why: why this data is needed o Categorical variable: variable that names categories ( in words or numerals)  Identifier variables: variables with exactly one individual in each category o Quantitative variable: variables which the numbers act as numerical values = units - Where: data varies depending place, When: data varies depending on time, How: how data is collected/ how many variables Chap.3 – Categorical Variables - Frequency tables (Relative frequency tables  percentage) – lists categories and gives the count (or percentage) of observations for each category. - Distribution of variables gives: 1. the possible values of the variable 2. relative frequency of each value - Area Principle: area occupied by a part of the graph should correspond to the magnitude of the value it represents (titanic ship = bad example of displaying data) - Bar graph o More flexible and easier to read o Shows the counts for each category next to each other for easy comparison o Draw attention to relative proportions  relative frequency bar chart o Segmented bar graphs: divides the bar as a “whole” in to proportionally segmented box =100%) - Pie graphs o Lack a scale o Must include all categories that make up a whole o Cannot exceed 100% (data entry cannot fall under more than 1 category) o Harder to make comparisons - Contingency Tables o Looks at two categorical variables together (two-way table) o review possible patterns in one variable that may be contingent on the category of the other o Joint distribution: dividing the cell entry by the total sample size (male count binge drinker/ total (m/f) gender count) o Marginal distribution: examining a single variable in a two-way table (e.g total female gender count/ total (m/f) gender count) proportion = 1 o Conditional distribution: condition on the value of one variable and calculate the distribution of variable (column variable of that gender/ total sample size of that gender)  Finding relationship because they show the distribution of one variable for just those cases that satisfy a condition on another variable  When the distribution of one variable is the same for all categories of another, we say that the variables are INDEPENDENT (data result does not DEPEND on category)  no association between these variables - Simpson’s Paradox: unfair averaging  relationship among proportions taken with diff groups or subset can appear to contradict relationships among the grand proportions. Chap.4 – Quantitative Variables - Histograms o Distribution of a quantitative variable slices up all possible values of the variable into equal widths  counts placed into these bins. o Gaps in histogram = missing data….unlike bar graph o Relative frequency histogram = y-axis % o Don’t show data values themselves - Stem-and-Leaf Displays o Is like a histogram, but shows individual values o Truncate = round # o Can split data of same stem into two lines - Dotplots o A simple display that puts dots along an axis for each case in the data o Good for small data set o Count on x-axis - Check Quantitative Data Condition  data are values of quantitative variable whose units are known - Distribution: Shape, Spread, Centre - Shape: o # of modes  unimodal, bimodal, multimodal, uniform o Symmetric?  tails? Skewed to the left (tail on the left) = negatively skewed, skewed to the right (tail to the right) = positively skewed o Outliers - Centre: o Mean (see pg 57)  balancing point  Not resistant measure of centre  not resistant to outliers  Trimming mean: trimming off 5-10% off the biggest and smallest values before averaging (makes mean a more resistant measure) o Median  midpoint of the distribution, half the observations are smaller, half are bigger  Resistant measure of centre  resistant to outliers o Symmetric = mean and median are exactly the same o Skewed to the right  median < mean o Skewed to the left  median > mean - Spread o Range = max-min  difference between the extremes  Sensitive to outliers o Interquartile Range (IQR) = Q3-Q1 -------------- median  Ignores extremes and concentrates on the middle of the data  Q1-M-Q3  Q1 = 25 percentile  Q1 falls above the 25% of the data  Median = 50 percentile th  Q3 = 75 percentile  Resistant measure  resistant to outliers  Ignores how individual values vary o Standard Deviation (s) -------------------------------- mean  how far each data value is from the mean  ONLY for symmetric data  s^2 is variance  1s = ~68% of data, 2s = ~95% of data, 3s = ~99-100% of data - VARIANCE o Measures of spread help us to be precise about what we don’t know o Small spread = data close to centre o Spread = 0 when all values are the same - 5 Number Summary  min—Q1—M—Q3—max  helps us see if it is left or right skewed too! - For unimodal symmetric data, the IQR is usually a bit larger than the standard deviation - Split data into separate groups if there are multiple modes (AND you and explain the reason for the separate modes) Chap.5 - Outliers - Investigating suspected outliers o Q3 + 1.5IQR o Q1 – 1.5IQR o median is roughly centred between the quartiles, then the middle half of the data is roughly symmetric - Comparing boxplots through shape, centre, spread, check for outliers - Outliers: o Understand the context of why there is an outlier (error, special scenario etc.) o If outlier is a correct value, analyze data with and without the outlier Chap.6 - To compare things measured on different scales, we must STANDARDISE the results  base values  mean and standard deviation - Standardize values (pg 122 for formula) o denoted as “z” or z-scores o z-scores have no unit o only measures the distance of each data value from the mean in standard deviations o – when below mean, + when above mean o Farther away from mean = more unusual it is  ** for running, negative z-scores are better - Linear transformations of Data o Shifting  Adding of each observation  Only affects max, min, quartiles, and centres BECAUSE THE SHAPE AND SPREAD DOESN’T CHANGE o Rescaling  Multiplying each observation  multiplies max, min, quartiles, centres, measure of spreads BUT THE SHAPE STAYS RELATIVELY THE SAME o Do not change the shape of distribution, the measures of centre and spread in a predictable manner o NON-Linear transformations of data (logs, square roots etc) changes shape, centre, spread in an unpredictable manner - Z-scores o Standardizing z-scores = shifting them by mean, and rescaling by the standard deviation o Standardizing into z-scores does NOT change shape of the distribution or the variable o Eliminates units o Changes the centre by making the mean = 0 o Changes the spread by making the SD = 1 - Density curves: a model for the frequency distribution of data using area under the curve to represent relative frequency o Must be positive or zero o Total under curve = 1 o Describes the overall pattern of a distribution o Idealized normal frequency distribution - Normal Model: unimodal, symmetric distribution models - Standard Normal Model = N (0,1)  (refer to page 146) MUST BE A NORMAL MODEL o Check the Nearly Normal Condition: must check if distribution is unimodal, nearly symmetric (make a histogram or Normal Probability Plot)  Normal Probability Plot: data is roughly Normal = the plot is roughly diagonal straight line (deviations from straight line = the distribution is not Normal)  MAJOR NEED TO DRAW THE LEFT AND RIGHT SKEWED VERSIONS • Smiling curve = right skewed • Frowning curve = left skewed  z-scores at x-axis, y-axis is the variable values (done with technology) o THEY ARE ASSUMPTIONS/IDEALIZED - 68-95-99.7 Rule of Normal Models o 1SD =68% o 2 SD =95% o 3 SD = 99.7% - Looking for Normal Percentiles (when value doesn’t fall exactly on the SD from the mean) o Convert data to z-scores  find percentile in table (pg 135) o From percentiles to z-scores  Look at table first to find the closest percentile and then look at the corresponding z-score  DON’T ROUND RESULTS IN THE MIDDLE OF CALCULATIONS Chap.7 - relationship between two quantitative variables (scatterplot) - Lurking Variables : hidden variables that lurk or stand behind an analysis but may have an important influence on the relationship studied - Finding association between two variables - Scatterplots: o Direction:  upper left to lower right = negative  lower left to upper right = positive o Form:  Association is linear if points stretched out to a generally consistent, straight form  Can be curved gently  If curves sharply up then down  multiple regressions method o Strength  Clustered in a single stream (straight, curved, bending all over, cloud)  Trend or pattern o Outlier  Stands away from the overall pattern of the scatterplot • Clusters or subgroups that stand away from the rest of the plot should raise questions about why they are different (e.g. splitting primates and carnivores) - Variable of interest = response variable (y-axis)  dependent variable - Predictor variable = explanatory variable (x-axis)  independent variable - Strength of correlation = r  -1=0=1 o Changing the unit does not change the shape of the pattern o Standardizing each variable  for each point, it is a z-score rather than values (Zx, Zy)  May look more steep, but it is because both the x and y axis have the same z-score scale o Positive association  +/+ o Negative association  +/- o Correlation formula summarizes the direction and strength of the association (pg 175)  n-1 represents the fact that the six of the sum gets bigger the more data we have - Use correlation = o Qualitative Variables Condition: correlation only applies to quantitative variables o Straight Enough Condition: the correlation measures the strength only when it is a LINEAR ASSOCIATION o Outlier Condition: correlation is not resistant to outliers (report correlation with and without outlier - Correlation: o Between -1 and 1 o Has no units o Correlation treats x and y symmetrically o Correlation not affected by changes in the centre or scale of either variable  depends only on z-scores o Measures strength of LINEAR association, variables can be strongly associated but still have a small
More Less

Related notes for STAB22H3

Log In


OR

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


OR

By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.


Submit