midterm notes.docx

20 Pages
Unlock Document

Statistical Sciences
Augustin Vukov

Chapter 2 Data What are data? W5H provides the context for data values. Who and what are essential; without, no useful information Who cases What variables A variable gives information about each of the cases The why helps us decide which way to treat the variables Data tables Organizing data into a data table makes the information clearer Who The rows of a data table correspond to individual cases about whom (or about which if theyre not people) we record some characteristics Often, the cases are a sample of cases selected from some larger popln that wed like to understand What and why The characteristics recorded about each individual are called variables When a variable names categories and answers questions about how cases fall into those categories -> categorical (qualitative) variable When a measured variable with units answers questions about the quantity of what is measured -> quantitative variable Sometimes we treat a variable as categorical or quantitative depending on what we want to learn from it, which means some variables cant be pigeon-holed as one type or the other Counts count When we count the cases in each category of a categorical variable, the category labels are the what and the individuals counted are the who of our data Identifying identifiers Identifier variables dont tell us anything useful about the categories because we know there is exactly one individual in each Where, when, and how How data are collected can make the difference between insight and nonsense What can go wrong Dont label a variable as categorical or quantitative without thinking about the question you want it to answer Just because your variables values are numbers, dont assume that its quantitative Always be sceptical. One reason to analyze data is to discover the truth Terms Data systematically recorded information, whether numbers or labels, together with its context Context the context ideally tells who was measured, what was measured, how the data were collected, where the data were collected, and when and why the study was performed Data table an arrangement of data in which each row represents a case and each column represents a variable Case a case is an individual about whom or which we have data Variable a variable holds information about the same characteristic for many cases Unit a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams Categorical variable a variable that names categories (whether with words or numerals) is called categorical Quantitative variable a variable in which the numbers act as numerical values is called quantitative. Quantitative variables always have units Identifier variable a variable holding a unique name, ID number, or other identification for a case. Identifiers are particularly useful in matching data from two different databases or relations Chapter 3 Displaying and Describing Categorical Data Frequency tables: making piles We can organize counts into a frequency table, which records the totals and the category names Counts are useful, but sometimes we want to know the fraction or proportion of the data in each category, so we divide the counts by the total number of cases Often we multiply by 100 to express these proportions as percentages A relative frequency table displays the proportions or percentages, rather than the counts, of the values in each category Both types of tables show how the cases are distributed across the categories In this way, they describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs The area principle The best data displays observe a fundamental principle of graphing called the area principle The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents Bar charts A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison If we really want to draw attention to the relative proportion of passengers falling into each of these classes, we could replace the counts with percentages and use a relative frequency bar chart Pie charts Pie charts show the whole group of cases as a circle They slice the circle into pieces whose size is proportional to the fraction of the whole in each category Before you make a bar chart or a pie chart, always check the categorical data condition: the data are summary counts or percentages of individuals in categories Contingency tables Because the table shows how the individuals are distributed along each variable, contingent on the value of the other variable, such a table is called a contingency table The margins of the table, on the right and at the bottom, give totals When presented like this, in the margins of a contingency table, the frequency distribution of one of the variables is called its marginal distribution Conditional distributions By focusing on each row separately, we see the distribution under a condition restricting the who Conditional distributions show the distribution of one variable for just those cases that satisfy a condition on another variable In a contingency table, when the distribution of one variable is the same for all categories of another, we say that the variables are independent theres no association between these variables If the conditional distributions of one variable are (roughly) the same for every category of the other variable Segmented bar charts The resulting segmented bar charts treats each bar as the whole and divides it proportionally into segments corresponding to the percentage in each group What can go wrong Dont violate the area principle Keep it honest Dont confuse similar-sounding percentages Dont forget to look at the variables separately Be sure to use enough individuals Dont overstate your case Dont use unfair or silly averages Terms Frequency table (relative frequency table) a frequency table lists the categories in a categorical variable and gives the count (or percentage) of observations for each category Distribution the distribution of a variable gives: The possible values of the variables and The relative frequency of each value Area principle in a stat display, each data value should be represented by the same amount of area Bar chart (relative frequency bar chart) bar charts show a bar whose area represents the count (or percentage) of observations for each category of a categorical variable Pie chart pie charts show how a whole divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category Categorical data condition the methods in this chapter are appropriate for displaying and describing categorical data. Be careful not to use them with quantitative data Contingency table a contingency table displays counts and, sometimes, percentages of individuals falling into named categories on two or more variables. The table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be the contingent on the category of the other Marginal distribution in a contingency table, the distribution of either variable alone is called the marginal condition. The counts of percentages are the totals found in the margins (last row or column) of the table Conditional distribution the distribution of a variable restricting the who to consider only a smaller group of individuals is called a conditional distribution Independence variables are said to be independent if the conditional distribution of one variable is the same for each category of the other Segmented bar chart a segmented bar chart displays the conditional distribution of a categorical variable within each category of another variable Simpsons paradox relationships among proportions taken within different groups or subsets can appear to contradict relationships among the grand or overall proportions. Chapter 4 Displaying and Summarizing Quantitative Data Histograms The bins, together with these counts, give the distribution of the quantitative variable and provide the building blocks for the histogram By representing the counts as bars and plotting them against the bin values, the histogram displays the distribution at a glance In a histogram, the bins slice up all the values of the quantitative variable, so any spaces in a histogram are actual gaps in the data, indicating a region where there are no observed values Relative frequency histogram, replacing the counts on the vertical axis with the percentage or proportion of the total number of cases falling in each bin Stem-and-leaf displays A stem-and-leaf display is like a histogram, but it shows the individual values Dotplots A dotplot is a simple display It just places a dot along an axis for each case in the data Think before you draw Quantitative data condition: data are values of a quantitative variable whose units are known When you describe a distribution, you should always discuss 3 things: shape, centre, and spread The shape of a distribution 1. Humps are called modes
More Less

Related notes for STA220H1

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.