false

Textbook Notes
(368,795)

Canada
(162,165)

University of Toronto Scarborough
(18,529)

Statistics
(133)

STAB22H3
(130)

Mahinda Samarakoon
(14)

Chapter 1-14

Unlock Document

Statistics

STAB22H3

Mahinda Samarakoon

Fall

Description

Chap.1 – What is Statistics
- Stats: way of reasoning, along with a collection of tools and methods designed to help us understand the world
statistics: are particular calculations made from data data: values with context
- Variation: data varies because what we measure/see is measured imperfectly (data we look at/base our info on =
imperfect picture of the world)
Chap.2 – Understanding Data
- Context: 5W’s and H
- Who: (leftmost column of data tables) Case – is an individual about whom or which we have data (usually a
sample of cases from a larger population)
- What: characteristics recorded of each individual (holds information about the same characteristic for many cases)
=variables units: tells us how much of something we have or how far two values are apart how each value is
measured
- Why: why this data is needed
o Categorical variable: variable that names categories ( in words or numerals)
Identifier variables: variables with exactly one individual in each category
o Quantitative variable: variables which the numbers act as numerical values = units
- Where: data varies depending place, When: data varies depending on time, How: how data is collected/ how many
variables
Chap.3 – Categorical Variables
- Frequency tables (Relative frequency tables percentage) – lists categories and gives the count (or percentage)
of observations for each category.
- Distribution of variables gives: 1. the possible values of the variable 2. relative frequency of each value
- Area Principle: area occupied by a part of the graph should correspond to the magnitude of the value it represents
(titanic ship = bad example of displaying data)
- Bar graph
o More flexible and easier to read
o Shows the counts for each category next to each other for easy comparison
o Draw attention to relative proportions relative frequency bar chart
o Segmented bar graphs: divides the bar as a “whole” in to proportionally segmented box =100%)
- Pie graphs
o Lack a scale
o Must include all categories that make up a whole
o Cannot exceed 100% (data entry cannot fall under more than 1 category)
o Harder to make comparisons
- Contingency Tables
o Looks at two categorical variables together (two-way table)
o review possible patterns in one variable that may be contingent on the category of the other
o Joint distribution: dividing the cell entry by the total sample size (male count binge drinker/ total (m/f)
gender count)
o Marginal distribution: examining a single variable in a two-way table (e.g total female gender count/
total (m/f) gender count) proportion = 1
o Conditional distribution: condition on the value of one variable and calculate the distribution of variable
(column variable of that gender/ total sample size of that gender)
Finding relationship because they show the distribution of one variable for just those cases that
satisfy a condition on another variable
When the distribution of one variable is the same for all categories of another, we say that the
variables are INDEPENDENT (data result does not DEPEND on category) no association
between these variables
- Simpson’s Paradox: unfair averaging relationship among proportions taken with diff groups or subset can
appear to contradict relationships among the grand proportions.
Chap.4 – Quantitative Variables
- Histograms
o Distribution of a quantitative variable slices up all possible values of the variable into equal widths
counts placed into these bins.
o Gaps in histogram = missing data….unlike bar graph
o Relative frequency histogram = y-axis % o Don’t show data values themselves
- Stem-and-Leaf Displays
o Is like a histogram, but shows individual values
o Truncate = round #
o Can split data of same stem into two lines
- Dotplots
o A simple display that puts dots along an axis for each case in the data
o Good for small data set
o Count on x-axis
- Check Quantitative Data Condition data are values of quantitative variable whose units are known
- Distribution: Shape, Spread, Centre
- Shape:
o # of modes unimodal, bimodal, multimodal, uniform
o Symmetric? tails? Skewed to the left (tail on the left) = negatively skewed, skewed to the right (tail to
the right) = positively skewed
o Outliers
- Centre:
o Mean (see pg 57) balancing point
Not resistant measure of centre not resistant to outliers
Trimming mean: trimming off 5-10% off the biggest and smallest values before averaging
(makes mean a more resistant measure)
o Median midpoint of the distribution, half the observations are smaller, half are bigger
Resistant measure of centre resistant to outliers
o Symmetric = mean and median are exactly the same
o Skewed to the right median < mean
o Skewed to the left median > mean
- Spread
o Range = max-min difference between the extremes
Sensitive to outliers
o Interquartile Range (IQR) = Q3-Q1 -------------- median
Ignores extremes and concentrates on the middle of the data
Q1-M-Q3
Q1 = 25 percentile Q1 falls above the 25% of the data
Median = 50 percentile
th
Q3 = 75 percentile
Resistant measure resistant to outliers
Ignores how individual values vary
o Standard Deviation (s) -------------------------------- mean
how far each data value is from the mean
ONLY for symmetric data
s^2 is variance
1s = ~68% of data, 2s = ~95% of data, 3s = ~99-100% of data
- VARIANCE
o Measures of spread help us to be precise about what we don’t know
o Small spread = data close to centre
o Spread = 0 when all values are the same
- 5 Number Summary min—Q1—M—Q3—max helps us see if it is left or right skewed too!
- For unimodal symmetric data, the IQR is usually a bit larger than the standard deviation
- Split data into separate groups if there are multiple modes (AND you and explain the reason for the separate
modes)
Chap.5 - Outliers
- Investigating suspected outliers
o Q3 + 1.5IQR
o Q1 – 1.5IQR o median is roughly centred between the quartiles, then the middle half of the data is roughly symmetric
- Comparing boxplots through shape, centre, spread, check for outliers
- Outliers:
o Understand the context of why there is an outlier (error, special scenario etc.)
o If outlier is a correct value, analyze data with and without the outlier
Chap.6
- To compare things measured on different scales, we must STANDARDISE the results base values mean and
standard deviation
- Standardize values (pg 122 for formula)
o denoted as “z” or z-scores
o z-scores have no unit
o only measures the distance of each data value from the mean in standard deviations
o – when below mean, + when above mean
o Farther away from mean = more unusual it is
** for running, negative z-scores are better
- Linear transformations of Data
o Shifting
Adding of each observation Only affects max, min, quartiles, and centres BECAUSE THE
SHAPE AND SPREAD DOESN’T CHANGE
o Rescaling
Multiplying each observation multiplies max, min, quartiles, centres, measure of spreads BUT
THE SHAPE STAYS RELATIVELY THE SAME
o Do not change the shape of distribution, the measures of centre and spread in a predictable manner
o NON-Linear transformations of data (logs, square roots etc) changes shape, centre, spread in an
unpredictable manner
- Z-scores
o Standardizing z-scores = shifting them by mean, and rescaling by the standard deviation
o Standardizing into z-scores does NOT change shape of the distribution or the variable
o Eliminates units
o Changes the centre by making the mean = 0
o Changes the spread by making the SD = 1
- Density curves: a model for the frequency distribution of data using area under the curve to represent relative
frequency
o Must be positive or zero
o Total under curve = 1
o Describes the overall pattern of a distribution
o Idealized normal frequency distribution
- Normal Model: unimodal, symmetric distribution models
- Standard Normal Model = N (0,1) (refer to page 146) MUST BE A NORMAL MODEL
o Check the Nearly Normal Condition: must check if distribution is unimodal, nearly symmetric (make a
histogram or Normal Probability Plot)
Normal Probability Plot: data is roughly Normal = the plot is roughly diagonal straight line
(deviations from straight line = the distribution is not Normal)
MAJOR NEED TO DRAW THE LEFT AND RIGHT SKEWED VERSIONS
• Smiling curve = right skewed
• Frowning curve = left skewed
z-scores at x-axis, y-axis is the variable values (done with technology)
o THEY ARE ASSUMPTIONS/IDEALIZED
- 68-95-99.7 Rule of Normal Models
o 1SD =68%
o 2 SD =95%
o 3 SD = 99.7%
- Looking for Normal Percentiles (when value doesn’t fall exactly on the SD from the mean)
o Convert data to z-scores find percentile in table (pg 135) o From percentiles to z-scores
Look at table first to find the closest percentile and then look at the corresponding z-score
DON’T ROUND RESULTS IN THE MIDDLE OF CALCULATIONS
Chap.7
- relationship between two quantitative variables (scatterplot)
- Lurking Variables : hidden variables that lurk or stand behind an analysis but may have an important influence on
the relationship studied
- Finding association between two variables
- Scatterplots:
o Direction:
upper left to lower right = negative
lower left to upper right = positive
o Form:
Association is linear if points stretched out to a generally consistent, straight form
Can be curved gently
If curves sharply up then down multiple regressions method
o Strength
Clustered in a single stream (straight, curved, bending all over, cloud)
Trend or pattern
o Outlier
Stands away from the overall pattern of the scatterplot
• Clusters or subgroups that stand away from the rest of the plot should raise questions
about why they are different (e.g. splitting primates and carnivores)
- Variable of interest = response variable (y-axis) dependent variable
- Predictor variable = explanatory variable (x-axis) independent variable
- Strength of correlation = r -1=0=1
o Changing the unit does not change the shape of the pattern
o Standardizing each variable for each point, it is a z-score rather than values (Zx, Zy)
May look more steep, but it is because both the x and y axis have the same z-score scale
o Positive association +/+
o Negative association +/-
o Correlation formula summarizes the direction and strength of the association (pg 175)
n-1 represents the fact that the six of the sum gets bigger the more data we have
- Use correlation =
o Qualitative Variables Condition: correlation only applies to quantitative variables
o Straight Enough Condition: the correlation measures the strength only when it is a LINEAR
ASSOCIATION
o Outlier Condition: correlation is not resistant to outliers (report correlation with and without outlier
- Correlation:
o Between -1 and 1
o Has no units
o Correlation treats x and y symmetrically
o Correlation not affected by changes in the centre or scale of either variable depends only on z-scores
o Measures strength of LINEAR association, variables can be strongly associated but still have a small

More
Less
Related notes for STAB22H3

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.