Soan 3120 Chapter Notes.docx

32 Pages
Unlock Document

Sociology and Anthropology
SOAN 3120
Michelle Dumas

Soan 3120 Chapter Notes Chapter 1: Picturing Distributions with Graphs Individuals and Variables Individuals: the objects described by a set of data; may be people, animals or things Variables: any characteristic of an individual.Avariable can take different values for different individuals Categorical Variable: places an individual into one of several groups or categories Ex. sex, major • Quantitative Variable: takes numerical values for which arithmetic operations make sense. These values are usually recorded in a unit of measurement Ex. height in inches, age in years Categorical Variables: Pie Charts and Bar Graphs Exploratory DataAnalysis: examining data to describe it’s main features • Begin by examining each variable by itself, then study the relationship among the variables • Begin with a graph(s) and add numerical summaries of specific aspects of the data To examine a single variable, display it’s distribution Distribution: tells us what value it takes and how often it takes these values The Distribution of a categorical variable lists the categories and gives the count/percent of individuals who fall in each category Roundoff Error: when the percentages in a distribution don’t equal 100% because they were rounded. This error doesn’t actually point to a mistake, just the effect of rounding. Pie Chart: used to display the distribution of a categorical variable Must include all the categories to make up a whole Used to emphasize each category’s relation to the whole Bar Graph: represents each category as a bar Can compare any set of quantities that are measured in the same units Quantitative Variables: Histograms Histogram: the most common graph of the distribution of one quantitative variable To create a histogram: 1. Choose the Classes: divide the range of the data into classes of equal width. 2. Count the Individuals in each class 3. Draw the Histogram Choosing too few classes will cause all values to fall into a few classes (Skyscraper) Choosing too many classes will cause many classes to have few or no values (Pancake) When examining a histogram, look for the overall pattern and for striking deviations from that pattern Use shape, center, spread to describe the pattern Outlier: an individual value that falls outside the overall pattern Midpoint: the value with roughly half the observations taking smaller/large values Spread: described by giving the smallest and largest values of a distribution Symmetric: a distribution in which the left and right sides of the histogram are approximately mirror images Skewed to the right: the right side of the histogram extends much farther than the left side Skewed to the left: the left side of the histogram extends much farther than the right side Note: the direction of skewedness is the direction of the long tail, NOT the direction where more observations are clustered ( the hump) Quantitative Variables: Stemplots For small data sets, a stemplot is quicker to make and presents more detailed information To make a stemplot: 1. Separate each observation into a stem (all but the final digit) and a leaf (the final digit). Stems can have as many digits as needed, but a leaf has only 1 per observation 2. Write the stems in a vertical column, smallest to largest, and draw a line separating them from the leafs. Include all the stems, even if they have no leaf 3. Write each leaf in the row to the right of its stem, in increasing order Ex. 1 3 2 4 6 3 2 5 6 4 1 3 5 7 This would represent the data: 1.3, 2.4, 2.6, 3.2, 3.5, 3.6, 4.1, 4.3, 5.7 Astemplot looks like a histogram turned on end Unlike a histogram, a stemplot preserved the actual value of each observation Note: Stemplots do NOT work well for large data sets, where each stem must hold a large number of leaves Split Stems: you can split the stems in a stemplot to double the number of stems but reduce the number of leafs on each stem Time Plots Time Plot: plots each observation against the time at which is was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale Cycles: regular up and down movements Trend: a long term upward/downward movement over time Time Series Data: time plots show the change in one variable over time Cross Sectional Data: histograms display many variables at the same time Chapter 2 Chapter 2: Describing Distributions with Numbers Measures of Center: The Mean x To find the mean of a set of observations, add their values and divide by the number of observations: x= 1 ∑ x n i The mean is sensitive to the influence of a few extreme observations (outliers) as well as skewed distributions, which pull the mean toward the tail The mean is NOT considered a resistant measure of center Measures of Center: The Median M The Median is the midpoint of a distributions, the number such that half the observations are smaller and the other half are larger. To find the median of a distribution: Arrange all the observations in order of size, smallest to largest If n is odd, M is the center observation in the list If n is even, M is midway between the two center observations in the list You can always find M by counting: (n+1) 2 from the start of the list (this does not give M, just it’s position) The median is a resistant measure of center Comparing the Mean and the Median Symmetric Distribution: Mean = Median Skewed Distribution: Mean goes towards the tail Measuring Spread: The Quartiles To calculate the quartiles: Arrange observations in order and locate M Q 1 = the median of the observations below the mean Q 3 = the median of the observations above the mean Five Number Summary and Box Plots Five Number Summary of a distribution consists of: MinQ M1 Max3 Offers a reasonably complete description of the center and spread of a distribution Box Plot: a graph of the five-number summary Acentral box spans the quartiles 1 and 3 Aline in the box marks the median M Lines extend from the box to the smallest and largest observations Chapter 2 Best used for side by side comparisons of more than one distribution Spotting Suspected Outliers IQR: The Inter-Quartile Range, measures the distance between the first and third quartiles IQR=Q −3 1 The 1.5×IQR Rule Call an observation an outlier if it falls more than 1.5 x IQRs above the third quartile or below the first quartile Measuring Spread: Standard Deviation Measures spread by looking at how far the observations are from their mean To Calculate: first calculate variance x1 ¿ ¿ s = 1 ∑ ¿ n−1 The Standard deviation is the square root of the mean x ¿ ¿i−x¿ 2 ¿ ¿ 1 ∑ ¿ n−1 s= √ ¿ The Standard Deviation Measures spread around the mean – should only be used when mean is used as the measure of center Is always ≥0 . s=0 only when all the numbers are the same Has the same units of measurement as the original observations It is NOT resistant, a few outliers will change it Choosing Measures of Center and Spread Five-Number Summary OR Mean and standard deviation Choose: Five-Number Summary: usually better than mean and standard deviation for describing a skewed distribution or a distribution with strong outliers Use Mean and standard deviation only when the distribution is relatively symmetric Chapter 3 Chapter 3: The Normal Distributions Density Curves: Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve Areas under the curve give a good approximation of the actual distribution of the data ADensity Curve: Is always on or above the horizontal axis Has an area of exactly 1 underneath it Does not include outliers Describing Density Curves The median is the equal areas point, dividing the area under the curve in half The mean is the “balance” point of the distribution Symmetric Density Curve: Mean and Median are equal Skewed: Mean is pulled away from the median, towards the tail Because the density curve is just an idealized description of a distribution, the mean and standard deviation have to be distinguished from those actually computed from the real data x→μ s→σ Normal Distributions Anormal curve describes a normal distribution All normal curves have the same overall shape Symmetric Single-peaked Bell-shaped μ,σ Any Normal curve is completely described given: The mean and median are both at the center. Changing μ without changing σ moves the curve along the horizontal axis without changing the spread σ changes the spread of the curve You can usually measure σ by eye on a normal curve by finding the points at which the change of curvature takes place The 68-95-99.7 Rule: Approximately 68% of observations fall within 1σof μ Approximately 95% of observations fall within 2σof μ Approximately 99.7% of observations fall within 3σof μ Example: Given a normal distribution with μ=6.84∧σ=1.55 , find the range of observations that fall into 95% of the curve: If 95% of observations fall withinσof μ , to calculate: μ−2=6.84−(2)(1.55)=6.84−3.10=3.74 μ+2=6.84+(2)(1.55)=6.84+3.10=9.94 Chapter 3 Thus, 2.5% of observations fall above 9.94 and 2.5% of observations fall below 3.74. 95% of observations fall within this range. Standard Normal Distribution All normal distributions are the same if we measure in units of size or about the mean μ Changing to these units is called standardizing If x is an observation from a normal distribution,N(μ,σ) The standardized value of x : x−μ z= σ Astandardized value is often called a z-score Az-score tells us how many standard deviations an original observation falls from the mean, and in which direction Standardizing a variable that has any normal distribution produces a new variable that has the standard normal distribution Standard Normal Distribution is the distribution N(0,1) with mean 0 and standard deviation 1 x N(μ,σ) μ σ If a variable has any Normal Distribution with mean and standard deviation , then the standard variable x−μ z= σ has the standard normal distribution Finding Normal Proportions Cumulative Proportions: the cumulative proportion for a value x in a distribution is the proportion of observations in the distribution that are less that or equal to x. This can be found using the Standard Normal Table Using the Standard Normal Table To find cumulative proportions form a table, we must first standardize to express the problem in the standard scale of z-scores This allows us to use one table, a table of standard Normal cumulative proportions The table entries are cumulative proportions, areas under the curve to the left of a value z Finding a Value when given a Proportion Use the Table backward, find the given proportion in the body of the table and then find the corresponding z- score Once we have a z-score, we un-standardize it back to the original x scale N(504,111) Example: Given we want to know what observation or higher would be the top 10% of the distribution The x-value that puts an observation in the top 10% is the same as the x-value for which 90% of the observations are to it’s left Use the table, locate 0.9, the corresponding z-score is 1.28 Un-standardize this value and transform z back to the original x Chapter 3 If z=1.28, that means x lies 1.28 standard deviations above the mean, therefore: x=mean+(1.28)(standarddeviation) ¿504+(1.28)(111) ¿646.08 Thus, anything larger than 646.08 would be considered the top 10% of the distribution. Chapter 4 Chapter 4: Scatterplots and Correlation Explanatory and Response Variables Response variable: measures the outcome of a study Often called independent or predictor variables Explanatory variable: may explain or influence changes in a response variable Often called dependent variables Displaying Relationships: Scatterplots The most useful graph for displaying the relationship between two quantitative variables is a scatterplot Shows the relationship between two quantitative variables measured on the same individuals Always plot the explanatory variable (x) on the horizontal axis, and the response variable (y) on the vertical axis Interpreting Scatterplots First: look at the overall pattern and for any striking deviations from that pattern Next: describe the pattern by the direction, form and strength of the relationship An important deviation is an outlier PositiveAssociation: when above average values of one variable tend to accompany above average values for the other variable NegativeAssociation: when above average values of one variable tend to accompany below average values for the other variable Adding Categorical Variables to Scatterplots To add a categorical variable, use a different plot colour or symbol for each category Measuring Linear Association: Correlation The Correlation r measures the direction and strength of the linear relationship between two quantitative variables The equation for r is: x −́x y −́y r= 1 ∑ ( i )( i ) n−1 s x sy The values for the first individuals are x 1y 1 and the values for the second individuals are x ∧y 2 2 and so on… The mean and standard deviation of xare x∧s x yare y∧s The mean and standard deviation of y The formula starts by standardizing the observations (from chapter 3) x∧y xi−x́ y í y Standardize the values by: sx and sy Chapter 4 The correlation r is an average of the products of the standardized x and y for all the individuals Facts about Correlation It makes no distinction between explanatory and response variables Because r uses standardized values, r does not change when we change the units of measurement of x and y Positive r indicates positive association, and vice-versa The Correlation r is always a number between -1 and +1 Values close to 0 indicate a weak linear relationship, while values close to -1 and +1 indicate a strong linear relationship Correlation requires that both variables be quantitative Correlation measures only the strength of linear relationships between two variables Correlation r is strongly affected by a few outlying observations Chapter 5 Chapter 5: Regression Regression Lines Aregression line is a straight line that describes how a response variable y changes as an explanatory variable x changes We often use a regression line to predict the value of y for a given value of x Suppose that y is a response variable, plotted on the vertical axis, and x is an explanatory variable, plotted on the horizontal axis.Astraight line relating y to x has the equation: y=a+bx In this equation, b is the slope: the amount by which y changes when x increases Slope is the rate of change in the response variable, on average, as the explanatory variable increases In this equation, a is the intercept: the value of y when x = 0 To plot this line on a scatterplot, use the equation to find the predicted y for 2 values of x, one near each end of the range of x. Plot each y above it’s x value, and draw a line through the two points The Least Squares Regression Line We need a way to draw a line that doesn’t require guessing where it should go, and a good regression line makes the vertical distances of the points from the line as small as possible The Least Squares Regression Line of y on x is the line that makes the sum of the squares of the vertical distance of the data points from the line as small as possible To get the equation for the least squares regression line: Calculate the means x∧́y Calculate the standard deviations sx and sy Calculate the correlation r Thus, the least squares regression line is: y=a+bx With a slope: sy b=r sx And Intercept: a=́ y−b́ x It is important to remember that this line gives us a predicted value y for x, that’s why there is a hat on y Facts about the Least Squares Regression Line The distinction between explanatory x and response y variables is essential in regression Chapter 5 There is a close connection between correlation and the slope of the least squares regression line They always have the same sign The least squares regression line always passes through the point (x, y) which is the mean of both x and y The square of the correlation, r2 is the fraction of the variation in the values of y that is explained by the least squares regression of y on x Residuals Aresidual is the difference between an observed value of the response variable and the value predicted by the regression line. Aresidual is a prediction error that remains after we have chosen the regression line Residual = observed y – predicted y = y−̂ y There is a residual for each data point, which means finding residuals sucks because you have to find the predicted y response for every x Residuals calculated from a least squares regression like have a special property: the mean of the least squares residuals is always 0 Residual Plots: a scatterplot of the regression residuals against the explanatory variable, this helps us assess how well a regression line fits the data Influential Observations An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation Points that are outliers on the x and y direction of a scatterplot are sometimes influential If the outlier does not lie close to the regression line, it will be influential, but if it does, it wont be as influential Cautions about Correlation and Regression Correlation and Regression lines describe only linear patterns, and these lines are NOT resistant Ecological Correlations: correlations based on averages rather than on individuals are misleading if they aren’t interpreted properly Extrapolation: the use of a regression line for predicting a value of y far outside the range of values for x. Such predictions are usually not accurate Lurking Variables: the relationship between two variables can often be understood only by taking other variables into account Association does not Imply Causation Astrong association between two variables is not enough to draw conclusions about cause-and- effect Chapter 5 Just because an explanatory variable x and a response variable y have a very strong correlation, this is not evidence that changes in x actually cause changes in y Chapter 6 Chapter 6: Two-Way Tables Two Way Table: describes two categorical variables Marginal Distributions The marginal distribution of one of the categorical variables in a two-way table is the distribution of values of that variable among all individuals described in the table Each distribution in a two way table, when alone, are called marginal distributions (because they appear at the right and bottom margins of the two-way table) Roundoff Error: when adding percentages, it may come up sometimes as 99.9% because each percentage had been rounded to the nearest tenth. Each marginal distribution form a two-way table is a distribution for a single categorical variable. We can use a bar graph or a pie chart to display it Conditional Distributions Aconditional distribution is the distribution of values of one variable among only individuals who have a certain value of the other variable. There is a separate conditional distribution for each value of the other variable Uses the term conditional because the distribution describes only the variable in which a condition is satisfied No single graph portrays the form of the relationship between categorical variables. No single numerical measure summarizes the strength of the association Simpson’s Paradox An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. The lurking variable in Simpson’s Paradox is categorical. It breaks the individuals into groups, in which the true relationship is then visible Chapter 7 Chapter 7: Exploring Data, Part 1 Review Alist of the skills that should be acquired after reading chapter 1-6: Data Identify the individuals and variables in a set of data Identify each variable as categorical or quantitative. Identify the unite in which each quantitative variable is measured Identify the explanatory and response variables in situations where one variable explains or influences another Displaying Distributions Recognize when a pie chart can or cannot be used Make a bar graph of the distribution of a categorical variable Interpret pie charts and bar graphs Make a histogram of the distribution of a quantitative variable Make a stemplot of the distribution of a small set of observation. Round leaves of split stems Make a time plot of a quantitative variable over time Describing Distributions (Quantitative Variable) Look for the overall pattern and for major deviations from that pattern Assess from a histogram or stemplot whether the shape of a distribution is symmetric, skewed, or neither, as well as how many peaks there are Describe the overall pattern by giving numerical measures of center and spread in addition to a verbal description of shape Decide which measures of center and spread are more appropriate: the mean and standard deviation, or the 5 number summary Recognize outliers and give plausible explanations for them Numerical Summaries of Distributions Find the median and the quartiles for a set of observations Find the 5-number summary and draw a boxplot, assess center, spread, symmetry and skew Find the mean and the standard deviation for a set of observation Understand that the median is more resistant than the mean. Recognize that skew in a distribution moves the mean away from the median toward the tail Know the basic properties of the standard deviation Density Curves Know that areas under a density curve represent proportions of all observations and that the total area under a density curve is 1 Approximately locate the median and the mean on a density curve Know that the mean and median both lie at the center of a symmetric density curve Chapter 7 Recognize the shape of Normal Curves and estimate by eye both the mean and standard deviation of such a curve Use the 68-95-99.5 rule and symmetry to state what percent of the observations from a normal distribution fall between two points Find the standardized value (z-score) of an observation. Interpret z-scores and understand that any normal distribution becomes the standard normal N (0, 1) distribution when standardized Given that a variable has a normal distribution with a stated mean and standard deviation, calculate the proportion of values above, below or between stated number(s) Given that a variable has a normal distribution, calculate the point having a stated proportion of all values above it or below it Scatterplots and Correlations Make a scatterplot to display the relationship between two quantitative variables measured on the same subjects Add a categorical variable to a scatterplot Describe the direction, form and strength of the overall pattern, in particular, recognize positive or negative association and linear patterns. Recognize outliers Judge whether it is appropriate to use correlation to describe the relationship between two quantitative variables. Find the correlation r Know the basic properties of correlation r Regression Lines Understand that regression requires an explanatory variable and a response variable. Correctly identify which variable is the explanatory variable and which is the response variable. Explain what the slope b and intercept a mean in the equation y = a + bx Draw a graph of a regression line when you are given it’s equation Use a regression line to predict y for a given x Find the slope and intercept of the least-squares regression line from the means and standard deviations of x and y and r Use r squared to describe how much of the variation in one variable can be accounted for by a straight line relationship Recognize outliers and potentially influential observations from a scatterplot with the regression line drawn on it Calculate the residuals and plot them against the explanatory variable x. Recognize that a residual plot magnifies the pattern of the scatterplot of y vs. x Cautions about correlation and Regression Understand that both r and the least squares regression line can be strongly influenced Recognize possible lurking variables that may explain the observed association between two variables Understand that even a strong correlation does not mean that there is a cause and effect relationship Give plausible explanations for an observed association between two variables: direct cause and effect, the influence of lurking variables etc. Chapter 7 Categorical Data From a two-way table, find the marginal distributions of both variables by obtaining the row sums and column sums Express any distribution in percentages by dividing the category counts by their total Describe the relationship between two categorical variables by computing and comparing percents Recognize Simpson’s Paradox Questions to Do: 3.10 (pg 86) 3.28 (pg 91) 7.1 (pg 180) 7.7 – 7.10 (pg 181) 7.18 (pg 183) 7.20 – 7.23 (pg 184-185) 7.25-7.26 (pg 185-186) 7.31-7.33 (pg 188-189) 7.37 (pg 189) 7.38 (pg 190-191) Chapter 8 Chapter 8: Producing Data: Sampling Population versus sample Population: the entire group of individuals about which we want information Sample: part of the population form which we actually collect information. We use a sample to draw conclusions about the entire population Sampling Design: describes exactly how to choose a sample from the population The first step in planning a sample survey is to figure out exactly what population we want to describe, and then figure out exactly what we want to measure How to Sample Badly Convenience Sample: a sample selected by taking the members of the population that are easiest to reach. Often produces unrepresentative data Bias: if the design of a statistical study systematically favours certain outcomes Voluntary response sampling: consists of people who choose themselves by responding to a broad appeal. They are biased because people who respond are likely to have strong opinions People who take the trouble to respond to an open invitation are usuall
More Less

Related notes for SOAN 3120

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.