26 Pages
Unlock Document

University of Toronto St. George
Damian Dupuy

GGR270 – Introductory Analytical Methods The Basics! [email protected] – only use for 1-2 sentence replies. Office hour is right after class. rd Tutorial is sept. 23 , held weekly. Statistics – Any collection of numerical data; vital data, economic indicators, social statistics. A methodology for collecting, presenting and analyzing data; summarize findings, theory validation, forecast (putting an investment into something i.e subway system, you need to see what the demand will be once it is built and after that because you want to make the money back), evaluate, select among alternatives. Descriptive Statistics – Organization/summarize data; replace large set of numbers with small summary measures (analyzing 230 exam test marks; average, standard deviation, etc). This can be done with graphs. Goal of this is to minimize information loss. Inferential Statistics – Links descriptive statistics to probability theory; generalize results from a smaller group to a much larger one; creating a subset from a sample and use that to reflect the whole sample. Population – Total set of elements (objects, people, regions) under examinations; i.e all potential voters in an urban area. Denoted as N. Sample – Subset of elements in the population; used to make inferences about certain characteristics of the population. Try to predict the behavior of the population by looking closely at the sample. Population is denoted as n. If you have different samples from one population, they can be denoted as n1, n2, n3. Week 2: Variables and Data Most people view data as plural. Variable: Characteristic of the population that changes or varies over time, i.e temperature, income, education, etc. Bivarious statistics – how do two variables influence each other? Is there a correlation? What is the relationship? i.e how does education affect income? Two key categories of variables: 1. Quantitative – numerical, i.e number of students who… Can be 2. (1,2,3,4,) or continuous (1.5, 2.76, 3.413). 3. Qualitative – Non numerical, i.e male/female, plant species. Data – results from measuring variables; set of measurements. Different categories – univariate, bivariate, multivariate. Variables – Scales of Measurement I Scale defines the amount of information a variable contains and what statistical techniques can be used. On exam: Will give us a statistical problem and ask what test should be used; answered by which scale and how many samples Four scales: 1. Nominal – Lowest scale of measurement, no numerical value attached (least amount of information). Classifies observations into mutually exclusive (can only be in one group) and collectively exhaustive groups (there has to be a group something will fit in to). Simply the name or category of the variable. Often called “categorical data”, i.e occupation type, gender, etc. 2. Ordinal – Stronger scale as it allows data to be ordered or rank, i.e 12 largest towns in a region, income by group (high  low). Such as when you are asked which rate of income group you are in (0-10,000, 10,000-30,000, 30,000-50,000, 50,000+ for example). 3. Interval – Unit distance separating numbers is important, i.e temperature (F or C). Does not allow for ratios and does not have a true “zero”. 4. Ratio – strongest scale of measurement (most amount of information). Ratios of distances on a number scale (you can say someone earns twice as much as someone else). Presence of an absolute “zero”, i.e temperature (kelvin), income for all races. In practice, we consider interval/ratio scales together. Describing Data 1 – Graphs: Pie charts – circular graph where measurements are distributed among categories. Good to see how many people fall into certain groups. Bar graph – graph where measurements are distributed among cateogires. Also good to see how many people faill into certain groups. Relative Frequency Histogram: Graphs are quantitative rather than qualitative data; vertical axis (Y) shows “how often” measurements fall into a particular class or subinterval (Frequency). Classes are plotted on the horizontal (X) axis. Rules of thumb: 5 to 12 intervals or categories. To find out how many bars, use 1 + (3.3)(log10) (number of observations we have). Must be mutually exlusive and collectively exhaustive. Your intervals should be the same width. *** Make sure that when you’re making your graph for your histogram that you do not put a space in between the bars. Also make sure to document everything you do and explain everything. *** Example 1: Observations: 1, 11, 14, 21, 23, 27, 28, 33, 35, 50 # of classes: k = 1 + 3.3 Log10(10) = 4.3  rounded (always up) to 5. Class width: Largest # - smallest # / # of classes = (50-1)/5 = 9.8  rounded (regularly) up to 10. Income 000’s Frequency Relative Frequency 0-9.9 1 10% 10-19.9 2 20% 20-29.9 4 40% 30-39.9 2 20% 40-50 1 10% Observing the Graph – Skewness Is the distribution symmetrical? If it is, it’s called normal. When you have a large sample, you usually end up with a normal sample. The direction of the skew is wherever the tail is. i.e if it’s positively skewed, the crest will be on the left side of the graph and the tail will be on the right. When it’s positive, it means that there is an “outlier” group, which is much more positive than the rest of the data. Observing the Graph – Mode How many peaks are there? If there are two peaks/modes, then it is called bimodal. Observing the Graph – Kurtosis How distributed are the values? How peaked is the distribution? A normal distribution is called mesokurtic. If it’s very flat, meaning that there’s significant distance between the values of the observations it is called platykurtic. If the values tend to cluster around a small number of observations, meaning that there is not a very large distance between the values, it is called leptokurtic. Describing Data 2 – Measures of the centre, measures of variability. Standard deviation is the square root of the variation. Statistics and Parameters: Graphs are limited in what they can tell us. Difficulty making inferences about a population when looking at a subset or sample. Therefore, we need to use numerical measures. Measures associated with the population are called parameters. Measures associated with samples are called statistics. Measures of the Centre: Mean – Most commonly used measure of central tendency. It is the sum of all values or observations divided by the number of observations. Denominator of mean population is N whereas for sample it is n. For sample, the x on the LHS of the equation has a horizontal line above it (x bar). For population, it is mew x. For Riemann sums, whatever is underneath the sigma is which number you start at, and on top of that is which number it goes to. Example 2: Temperature data: 7.3, 10.7, 9.1, 8.4, 13.9, 9.4, 8.2. The Riemann sum = all of them added up and then divided up 7. Therefore, X bar = 67/7 = 9.57 = 9.6. Rule of thumb for rounding: Round to the decimal place you have in your observation data. Less than five, you round downwards; five or more, you round upwards. Median: Value occupying the “middle position” in an ordered set of observations. Order the observations, lowest to highest, and locate the middle position. Expressed as: 0.5(n+1) Example 3: Uneven Observations: Temperature data: : 7.3, 10.7, 9.1, 8.4, 13.9, 9.4, 8.2. 7.3, 8.2, 8.4, 9.1, 9.4, 10.7, 13.9 th Therefore 9.1 is the median. Using the formula, .5(7+1) = 4 position in the ordered set. Even observations: Temperature data: 7.3, 8.4, 9.1, 9.4, 10.7, 13.9 Using formula, you get .5(6+1) = 3.5 position. rd th So, you add the 3 and 4 value, and then divide by two. 9.1 + 9.4 = 18.5 / 2 = 9.25 = 9.3. [email protected] Week 3: Measures of the Centre (continued): Mode – Value that occurs with the highest frequency; allows you to locate the peak of a relative frequency histogram. Choosing an Appropriate Measure: Mean is usually the best measure as it is sensitive to change in a single observation. Mean is not a good measure when the distribution is bimodal (2 modes), or when you have skewed distributions, or if you have outliers (extreme values) are present in the data set (because the outlier will pull the mean in whichever direction it is located. Measures of Dispersion: Range: Simplest measure, difference between smallest/largest value at interval/ratio scale. Influenced by outliers. Max value – min value. Quartiles: Yields more information and is less affected by outliers. Data are divided into quartiles (4 groups). Observations arranged in increasing order. If you have five groups, it’s called quintiles. Standard Deviance and Variance: Two most commonly used measures of dispersion. Compares value of each measure to the Mean (Xi – Xbar). Two key properties of the mean/value relationship: Sum of differences will always add up to zero. Sum of squared differences will be the minimum sum possible. Called ‘Least Squares’ property. Least squares property carries over into the calculation of Standard Deviance and Variance. Variance: S^2 = Riemann (xi – xbar)^2 / n-1 Constructing a Worksheet: Worksheet is a table where each column represents a component of the statistical formula. st nd rd th 1 column = xi. 2 column = xbar. 3 column = difference. 4 column = difference squared. Then, sum of all of 4 column and then divided by (n-1) = variance. Square root of that is standard deviation. rd To check if you are doing it right, add up all of the 3 column, should equal 0. Standard deviation gives us a standard picture of the width from the mean for each value. Skewness: Measures the degree of symmetry in a frequency distribution. Determines how evenly (or unevenly) the values are distributed either side of the mean. (3xbar – median)/S Coefficient of Variation: Allows for comparison of variability spatial samples. Tests which sample has the greatest variability. Standard deviation or variance are absolute measures, so they are influenced by the size of the values in the dataset. To allow a comparison of variation across two or more geographic samples, you can use a relative measure of dispersion called Coefficient of Variation. CV = S / xbar Example: Annual Rainfall Station A Station B Station C Xbar 92.6 97.3 38.8 S 16.6 12.8 9.1 CV 0.179 0.132 0.235 Station c has the greatest degree of variability. Practical Significance of Standard Deviance: Empirical rule: A certain percentage of your data will be +/- a certain value away from the mean. For example, 68% of data are +/- 1 from the mean. 27% are +/- 2 SDs. 5% are +/- 3 SD’s. If S is 5, then one standard deviation is +/- 5, 2 standard deviations is +/- 10. Tutorial: Why do we need to consider measures of center and measures of dispersion when we’re describing the shape of a distribution? We need to consider measures of center because they show us where the most common value is, whereas the measure of dispersion will show us how far most values stray from the mean. Variance shows you how honest your mean is. 3 things you have to know about relative frequency histograms: Placement of the first bar should be right up against the y-axis. Spacing between bars should not have gaps. Label your axes. Week Four: Z Scores – standard scores are referred to as Z scores. Indicates how many standard deviations separate a particular value from the mean. Can be positive or negative, depending on if they are greater or less than the mean. Z score of the mean is 0 and the standard deviation is +/- 1. Table of “Normal Values” provides probability information on a standardized scale. The formula for calculating the z score involves comparing values to the mean value and dividing by the standard deviation. The result is interpreted as the “number of standard deviations an observation lies above or below the mean”. Example 1: Rainfall in Toronto. Mean = 39.95 inches of rainfall. S = 7.5 inches. Z score for 48 inches? Z = (48 – 39.95) / 7.5 = 8.05/7.5 = 1.07 Therefore, 48 inches is 1.07 standard deviations above the mean. Describing Bivariate Data Graphs Simple bivariate graphs: Comparative Pie Charts, stacked bar graph. Correlation: Allows us to observe, statistically, the relationship between two variables. Just because two values are correlated, it doesn’t mean that one of them causes the other one. All correlation gives you is the strength and direction of the relationship. Most common graphic technique is the scatterplot. Each value we have has a value for the x variable and a value for the y value. Scatterplots tell us the strength of the relationship and the direction of it. The narrower the line in a scatterplot, the stronger the bivariate relationship. Correlation Coefficients More rigorous approach to observing and measure strength and direction of a bivariate relationship. Most constructed have a maximum value of 1.0 and can be positive or negative. +/- 1 is a perfect positive/negative relationship. Most common measure is “Pearson’s Product Moment Correlation” or “Pearson’s R”. Used for interval/ratio scale data. Spearman’s R is the correlation coefficient of nominal/ordinal scale data. Pearson’s R and covariance. Covariance measures the degree to which 2 variables vary together. Begins with deviations around means of both variables or: (x-xbar) and (y-ybar). CV xy[Sigma(x-xbar)(y-ybar)]/n-1  Sum of all d Waiting by my phone for you, but you were with your other boo. You were getting wet by him, my eyes were getting wet by you eviations from the mean multiplied by each other and divided by number of values minus 1. If this value is negative, there’s a negative correlation. If it’s positive, there’s a positive correlation. Anytime you’re dealing with a sample, you have to divide by n-1 to control the bias. Covariance Worksheet Xi Yi Xbar Ybar Xi – Xbar Yi – Ybar (Xi-Xbar) (Yi-Ybar) * * * * * * * * * * * * * * Sum = Pearson’s R: Expressed as the ratio of the covariance of X and Y to the product of the standard deviations of x and y. R=( [Sigma(x-xbar)(y-ybar)]/n-1) / S S x y If there is no relationship between the variables, Person’s R will be 0. 0.5 would be a moderate relationship. Closer to 1 is stronger. Probability and Probability Distributions: Probability Studying spatial patterns is a key concern of geographers. We try to understand what has led to those patterns. From that, we try to predict what patterns will come up in the future. Geography is about describing, explaining, and predicting geographic patterns and processes. Use probability for situations when patterns have some degree of uncertainty. For example, weather forecasts – probability of precipitation. Probability focuses on the occurrence of an event, where one of several possible outcomes could result. These outcomes are (and must be) mutually exclusive. They can be thought of as frequency of an event occurring relative to all possible outcomes. P(A) = F(A) / F(E) where P(A) is probability of outcome A occurring, F(A) is absolute frequency of A, and F(E) is frequency of all outcomes. Maximum probability of an outcome is 1. All probabilities will add up to 1. Tutorial: When we have a skewed distribution, what is a better measure of central tendency: Mean or median? The median is a better measure because the mean will be influenced by the outliers and the median will be a more accurate representation of the central tendency. The mean is more sensitive to extreme values. Tchebysheff’s law At least 0% of your values will lie between +/- 1 standard deviations. “ 75% “ “ +/- 2 standard deviations. “ 89% “ “ +/- 3 standard deviations. 2 General rule: K(standard deviations)  1 – (1/K ) = % Empirical rule: No way to do “point estimates” due to it being related to intervals. No x/y-axis. Only works when you have a normal distribution. Using this rule makes it easy to identify outliers. Week 5: On midterm – everything up to multiplication rule. Probability Rules continued. Addition rule – used when finding probability of single independent events. P(A or B) = P(A) + P(B) Ex. Probability of rolling a 5 or 6 is .167% each. So, add them up, you get .334 – that’s your chance to roll either one. Multiplication rule – P(A and B) = P(A) x P(B) Ex. Rolling two sixes in a row - .167 x .167 = .028! Probability Distributions We often see consistent or typical patterns of probabilities in certain situations – probability distributions (similar to frequency distributions; y axis contains probability of outcomes rather than frequency of outcomes) They can be both discrete and continuous – three key types: Binomial, poisson, normal Binomial: Discrete (whole numbers, not continuous) probability distribution. Used to determine the probability of multiple events in independent trials. Each independent event has only 2 possible outcomes (i.e either rain or no rain, flood or no flood). Probability of event occurring is p, probability of it not occurring = 1-p = q Poisson: Discrete probability distribution. Used when looking at events that occur randomnly in space and time. Especially used for distributions over space particularly with quadrat analysis of point patterns. Also used when probability of an event occurring is less than it not occurring (rare events, i.e tornados). Normal: Most commonly applied distribution. It is the basis and provides the theoretical basis for sampling theory and statistical inference. Need to look at the “area under the curve”. Total area under the curve represents 100% of possible outcomes. 50% of values lie to the right of the mean, and 50% to the left. Need a methodology to effectively determine probability of values on the distribution. Could use integral calculus; easier to use “Table of Normal Values”. Observations must be standardized to use the table. Sampling and Estimation Aim of inferential statistics is to generalize about characteristics of larger population. Therefore, we need a process to obtain a sample. Sampling can be spatial or non-spatial. Census tracts are not always spatial. Why Sample? Samples are necessary in cases of extremely large populations; also efficient and cost-effective. Highly detailed information can be obtained easily. Allows for follow-up activity or repetition. Sampling Error If a sample is representative, then it will accurately reflect the characteristics of the population, without bias. Element of randomness must be introduced to preserve the representative sample (selecting observations through a randomnized process). You can never eliminate bias, only minimized it. Reducing bias means reducing error. Precision (reflects notion of large sample; larger sample = more precise) and accuracy, help categorize sources of energy. Week 6: Sampling Designs Number of different sampling designs exit: Simple randomn, systematic sampling, stratified, etc. Can also have spatial sampling designs: Use Cartesian Coordinates, simple, stratified randomn, transect. Stratified randomn sampling: Impose a grid on to a map, and then choose randomn points in the squares. Transect sampling: Have two y-axes and two x-axes. Then, use endpoints to draw lines. Figure out how long lines are, then find out how much land is used along the line. Then, add up all the land use. Sampling Distribution: Sample statistics will change or vary for each randomn sample selected. Sampling Distributions – Probability distributions for statistics A sampling distribution is the distribution of a statistics that is drawn from all possible samples of a given size n. Can be developed for any statistics, not just the mean. Central Limit Theorem Sampling distribution will have its own mean and standard deviation. But… the mean of a sampling distribution has important properties – summarized by Central Limit Theorem. If all samples are randomnly drawn, and are independent, then the mean of the Sampling Distribution of sample means will be the population mean, mew. Week 7: Central Limit Theorem continued The frequency distribution of sample means will be normally distributed What this means for us is that… when the sample size is large, the sample mean is likely to be quite close to the population mean. Mean of a large sample is more likely to be closer to the true population mean than the mean of a smaller sample. Central Limit Theory – Variability: Standard deviation of the sampling distribution is equal to the sample standard deviaton divided y the square root of the sample size. This is called the standard error of the Mean Indicates how much a typical sample mean is likely to differ from the true population mean Measures the amount of sampling error The larger the sample, the smaller the amount of sampling error. Anything over a sample size of 30 is a large sample. Standard error Sigma xbar = s / root(n)]Stand error of a proportion, SEp = root (pq/n) Sample Estimation Statistical inference is concern about making decisions of predictions about population parameters, using samples Two easy ways to do this: estimation or hypothesis training. Estimators are calculated using information from samples. Usually expressed as a formula.SOOOOOOOOOOOOOSLEEPYYYYYYYYYYYYYYYYYYYYYYYYYYYYY Two types: point, interval. Practically, several statistics exist that could be point estimates. How does the estimate behave in repeated sampling? Two valuable characteristics of best estimator: unbiased, small variance. Error of Estimation: Under the empirical rule, 95% of all point estimates will lie within 2 (or more precisely, 1.96) deviations from the mean. If estimate is unbiased, the difference between the point estimate and the true parameter value will be less than 1.96 standard deviations, or standard error. Can call this the 95% margin of error. Calculated as 1.96 * standard error. Margin of Error: N = 50, Xbar = 980 lbs, S = 105. 95% margin of error = 1.96 (s/ root n ) 1.96(105/root50) = 29.1 lbs Therefore, we can say with 95% confidence that the sample estimate of 980 lbs is within +/- 29 lbs of the population parameter. As you adjust the margin of error, the margin falls and the confidence level decreases. If you were to use the 90% margin of error, then… 1.65 = 105/root50 = 24.5, or +/- 24.5 lbs. It works with proportions too… i.e 1.96 root(pq/n) Estimation: 15% of Canadians would rather vote in the US election: Survey. The poll asked 2001 canadians over the age of 15 questions about how they perceive their role and canada’s role in the world. The survey has a margin of error of 2.2 percent, 19 times out of 20. Confidence Intervals Most often you don’t know how precise the single sample mean as an estimator, i.e smaller sample sizes. Place interval around the sample mean, and calculate the probability of the true population mean falling within this interval. General Formula: Xbar +/- (Z)(sigma xbar Z values associated with say, 90% confidence level are +/- 1.65. 90% confidence interval is Xbar +/- 1.65(sigma xbar Therefore, upper band of interval is Xbar + 1.65(sigma xbar The lower band is Xbar – 1.65(sigma xbar What does it mean if you are 95% confident? If you constructed 20 intervals, each with different sample information, 19 out of 20 would contain the population parameter mew, and 1 would not. But, you can never be sure whether a particular interval contains mew. Confidence Intervals: 1 – sigma Sigma
More Less

Related notes for GGR270H1

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.