false

Study Guides
(248,471)

Canada
(121,570)

University of Toronto St. George
(8,305)

Economics
(571)

ECO220Y1
(28)

Jennifer Murdock
(8)

Midterm

by
OneClass838

Unlock Document

Economics

ECO220Y1

Jennifer Murdock

Fall

Description

ECO220 Notes
Lecture 1: Sampling Errors & Non Sampling Errors
Goal → to make inferences about population parameter from sample statistics
Probability: foundation for statistics
Statistics: descriptive and inferential
o Descriptive → describes what happened (ex. Class avg)
o Inferential → conclusion about data not 100% sure
o Describes sample (data) using statistics
o Make inferences about population and its parameters using observed data (sample)
Population = set of all items of interest (ex. All students @ uoft for evaluations)
Parameter = descriptive measure of a population (something describes
population ex. What %/fraction of population)
Sample = subset of the population (ex. Small group)
Statistics = descriptive measure of a sample (ex. Of 200 in sample response __)
Sampling Error, ‘white noise’, ‘sample noise’, ‘sampling variability’ = the purely random differences
between a sample and the population that arises b/c the sample is a random subset of the population
As sample size gets larger the sampling error tends to get smaller
Ex. Pick 200 out of 60,000; could result in extremely different (due to change)
o Not wrong b/c random sample nothing wrong with survey itself
Size of sample determines the size of sampling error
o Larger samples = less sampling error
Example: bag of m&m, choose using a spoon, n= # of m&m, y = % of yellow in n
n y
4 0/4 = 0%
13 2/13 = 15.4%
Population: whole bag of m&m (all)
Parameter: what %/portion are yellow
Sample completed 2 times
Samples statistics → %yellow
Law of large Numbers
Larger samples = smaller sampling errors
o Sampling error decrease as n (sample size) increase
o There is no law of small numbers (‘law’ = small samples represent population)
Example: movie on demand, should rural company offer new channel?, randomly select 100ppl, ask 2
different questions
Population: customer base in rural for company
Sample: 100 customers
Sample statistics: mean = 2.3, proportion = 0.45
Population parameters not known b/c sampling errors
The Types of Information
Variables = characteristic recorded about each individual or case (types of info)
Quantitative = numerical measurements of a quality or amount
o Ex. a 10% decrease in prices will lead to a 20% increase in QD
Qualitative = some assessment of quality or kind
o Ex. A increase in price tends to lead to a decrease in QD
Identifier variable = unique code for each product/customer
Page | 1 Data
Rows of data table correspond to individual cases
People answer survey = respondents, people experimented on =
subjects/participants/experimental units
# of observations = sample size
‘these data are flawed’
Data = multiple
Datum = 1
3 Types of Data
Interval = numerical measurements, real numbers that are quantitative/numerical (ex. How
many marriages?)
Ordinal = ranking of categories (ex. How would you rank marital status?)
Nominal = un-ranked categories that are qualitative/categorical (only use names)
Hierarchy of Data
1. Interval - real number -> all calculations are valid
2. Ordinal - must represent the ranked order -> calculations based on ordering process valid
3. Nominal - arbitrary numbers that represents categories -> only calculations based on frequency
3 Types of Data Sets
Cross-sectional = a snapshot of different units taken in the same time period
o Ex. Annual GDP for 2010 for 20 countries (20 observations)
Time Series = track something over time
o Stationary time series = without a strong trend or change in variability (then use
histogram with time series)
o Ex. Annual Canadian GDP from 200 until 2010 ( 10 observations)
Panel (Longitudinal) = a cross-section of units where each is followed over time
o Ex. Annual GDP of 20 countries from 200 unit 2010 (200 observations)
Sampling
Stratified Sampling = a sampling design in which the population is divided into several
homogenous subpopulations, or strata, and random samples are then drawn from each stratum
o Strata = subset of a population that are internally homogenous but may differ from one
another
Systematic Sampling = a sample drawn by selecting individual systematically from a sample
frame
Convenience Sampling = a sampling technique that selects individuals who are conveniently
available
o May not represent population
Cluster sampling = a sample design in which groups, or clusters, representative of the population
are chosen at random and a census is then take of each
Multistage sampling = sampling schemes that combine several sampling methods
Page | 2 Sample size determines what can be concluded from the data regardless of the size of the
population
Voluntary Response Sample
o Hard to define sample frame, doesn’t correspond to population
o Bias toward those with strong opinions (especially negative opinions)
Simple random sample (SRS) = a sample in which each set of n individuals in the population has
an equal chance of selection
Sample Frame = list of individuals from which the sample is drawn
Sampling Vs. Non-sampling Errors
Sampling Error
o Pure chance (random) difference between sample & population (aka ‘white noise’)
o Random: no one can guess the outcome, has some underlying set of outcomes will be
equally likely
o It is impossible to match sample to population b/c too many characteristics to think of
and match
o Undercoverage = not all portions of population sampled
Non-sampling Errors
o Systematic (not random) difference between sample & population
o Biased estimate = statistic is systematically higher or lower than the parameter
o Systematic errors in data collection:
Systematic lying (ex. Ppl over estimate income)
Poor survey instrument design (ex. Unclear)
o Non-response bias:
Low response rate and non-responders are non-random (ex. Selection)
o Sampling frame differs from target population
o Sampling variability = the sample-to-sample differences
used to used to Tells us
Sample calculate Statistestimate Parameter about Population
Population Parameter = a numerically valued attribute of a model for a population (hope to estimate
from sample data)
Biased = any systematic failure of sampling method to represent its population
Measurement error = intentional or unintentional inaccurate response to a survey question
Valid Survey
Know what you want to know
Use the right sampling frame
Ask Specific rather than general question
Watch for biases
o Nonresponsive bias = bias introduced to a sample when a large fraction of those
sampled fails to respond
o Voluntary Response Bias
o Response Bias = tendency of respondents to tailor their responses to please interviewer
and consequence of slanted question wording
Be careful with question phrasing
Page | 3 Be careful with answer phrasing
o Measurement errors
o Pilot Test = a small trial run of a study to check that the method of the study are sound
Be sure you really want a representative sample
Lecture 2: Tabulations, Bar/pie Charts, Histograms & Centre
Describing 1 variable (with few unique values)
Bar charts = displays the distributions of a categorical variable, showing the counts for each
category next to each other for easy comparison
Pie Charts = show the whole group of cases as a circle (great for ½, ¼, 1/8 comparisons)
Segmented/Stacked Bar Charts = a bar chart that treats each bar as the “whole” and divides it
proportionally into segments corresponding to the % in each group
Stem & Leaf Display = like histogram but also give individual values (but require quantitative
data control)
Tabulation = list all unique values in data & relative frequency (aka, frequency table, relative
frequency table)
o Basis of bar/pie chart
o Interval, ordinal or nominal data
One variable with interval data
o Histograms = a graph that shows how the data are distributed
Frequency Histogram = Bar height measures number of observations in bin
Relative Frequency Histogram = Bar height measures fraction of observation in
bin relative to total number
Density Histogram = Bar area measures the fraction of observation in bin
relative to total number
o Classes (bins) = non-overlapping and equal sized intervals that cover range
Number of bins selected changes the appearance of the histogram
Sturges’ formula: # of bins = 1 + 3.3*log(n)
o Shape of Things **Review Lecture, slide 18-21**
Histograms give overview of a variable with a picture (can make informal
inferences about the shape of population)
Symmetric = split equally to left and right
Positively Skewed = long tail to the right (skewed to right)
Negatively Skewed = long tail to the left (skewed to left)
Modality = # of major peaks
Bell/Normal/Gaussian (means unimodal, symmetrical)
Describing 2 variable (with few unique pairs)
Cross tabulation = measures frequency that two variables take each possible pair of values (any
kind of data) (aka, contingency table or two variables)
o Basis of pie/bar chart
o Shows relationships between two variables
o Interval, ordinal or nominal data
o Creates Contingency tables
Marginal distribution = frequency distribution of either one of the variables
Conditional distributions = the distribution of a variable restricting the WHO to
consider only a smaller group of individuals
Sample vs. Population
Sample contains only a subset of observations in a population (sample errors too)
Page | 4 Sampling noise = difference between population and sample simply due to random chance
o Driven by size of the sample (and not the size of the sample relative to the size of the
population, which usually assumed infinite)
Sampling error always present
o Statistic is study of how to make inferences in light of sampling error
o Never see in perfect forms
o Consider sample size (n) when making informal inferences (larger = more accurate)
Measures of Central Tendency
Sample statistics often called summary statistics b/c they are meant to give a concise idea of
what data “looks like”
Three sample statistics that provide numeric measures of central tendency
-No Sample Error
-Have Sample Error
-Population -Population Statistic
Parameter
-Real Life
Median = middle observation after sorting
o If n is an even #, calculate median by averaging the two middle observations
o Better choice for skewed data than mean
Mode = the value that occurs with the greatest frequency
o With interval data often use modal class
o Modal Class = class with most observations
Sensitivity to Outliers
Outliers = extremely large or small values different from the bulk of the data
Robust = not sensitive to outliers
o Mean not robust b/c sensitive can rise/lower (balance)
o Median is robust b/c more subject to sample error, only looks at last and first
**REVIEW LECTURE DIAGRAMS*****
Graph Problems
(violate) Area principle = a principle that helps to interpret statistical information by insisting
that in a statistical display each data value be represented by the same amount of area (ex. 3D
pie chart)
Keep it honest (all % add up to 100%)
Look at data separately too when more than 1 variable or created contingency table
Use large enough sample size (especially for pie chart)
Don’t overstate case
Simpson’s Paradox = a phenomenon that arise when averages, or percentages, are taken across
different groups, and these group averages appear to contradict the overall averages
Lecture 3: Describing Interval data (beyond a histogram & mean/medium/mode)
Page | 5 4 Measures of Variability (spread)
Summarize data variability with statistics:
o Range = the difference between the largest and smallest observations
Measure of variability as difference become bigger, data more variable
Sample range subject to sample error
Use 2 observations (biggest & smallest)
Very sensitive to outliers
o Variance = sum of the squared deviations from the mean divided by the degrees of
freedom
Always
Numerator: total sum of squares (TSS)
Observation far from mean increases
TSS a lot
Denominator: degree of freedom (df, v, ‘nu’)
Only n-1 free observation left after
calculate mean
o Standard Deviation = the square root of the variance
Measured in same units as original variable
Variance measured in units squared
Standard deviation all same for different shaped graphs b/c use all data like
mean (all data in these graphs are centered around the middle)
S.d. depends on shape of graph and units of measure
Possible for: range = 0 & S.D. = 0
**Review Lecture, slide 6**
Empirical Rule (Normal/Bell) ~ if sample from normal population
About 68.3% of all obs. Within 1s.d. of mean
About 95.4% of all obs. Within 2s.d. of mean
About 99.7% of all obs. Within 3s.d. of mean
Chebysheff’s Theorem (always true, for any shape)
At least 100*(1-1/K )% of observations lie within K s.d.’s of the mean for
k>1
2
o At least 75% of the obs. lit die within 2s.d. of mea2 (1-1/2 = ¾)
o At least 89% of the obs. lie within 3s.d. of mean (1-1/3 = 8/9)
o Can be applied to all samples no matter how population is
distributed
o Notes: k does not have to be integer (ex. 1.5), difference
between this and Empirical is that K >1 and K ≠ 0 for

More
Less
Related notes for ECO220Y1

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.