false

Study Guides
(248,402)

Canada
(121,510)

University of Toronto St. George
(8,305)

Statistical Sciences
(81)

STA220H1
(18)

Augustin Vukov
(4)

Midterm

by
OneClass4850

Unlock Document

Statistical Sciences

STA220H1

Augustin Vukov

Fall

Description

Chapter 2 Data
What are data?
W5H provides the context for data values.
Who and what are essential; without, no useful information
Who cases
What variables
A variable gives information about each of the cases
The why helps us decide which way to treat the variables
Data tables
Organizing data into a data table makes the information clearer
Who
The rows of a data table correspond to individual cases about whom (or about which if theyre not people) we
record some characteristics
Often, the cases are a sample of cases selected from some larger popln that wed like to understand
What and why
The characteristics recorded about each individual are called variables
When a variable names categories and answers questions about how cases fall into those categories -> categorical
(qualitative) variable
When a measured variable with units answers questions about the quantity of what is measured -> quantitative
variable
Sometimes we treat a variable as categorical or quantitative depending on what we want to learn from it, which
means some variables cant be pigeon-holed as one type or the other
Counts count
When we count the cases in each category of a categorical variable, the category labels are the what and the
individuals counted are the who of our data
Identifying identifiers
Identifier variables dont tell us anything useful about the categories because we know there is exactly one
individual in each
Where, when, and how
How data are collected can make the difference between insight and nonsense
What can go wrong
Dont label a variable as categorical or quantitative without thinking about the question you want it to answer
Just because your variables values are numbers, dont assume that its quantitative
Always be sceptical. One reason to analyze data is to discover the truth
Terms
Data systematically recorded information, whether numbers or labels, together with its context
Context the context ideally tells who was measured, what was measured, how the data were collected, where the data
were collected, and when and why the study was performed
Data table an arrangement of data in which each row represents a case and each column represents a variable
Case a case is an individual about whom or which we have data
Variable a variable holds information about the same characteristic for many cases
Unit a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams
Categorical variable a variable that names categories (whether with words or numerals) is called categorical
Quantitative variable a variable in which the numbers act as numerical values is called quantitative. Quantitative
variables always have units
Identifier variable a variable holding a unique name, ID number, or other identification for a case. Identifiers are
particularly useful in matching data from two different databases or relations Chapter 3 Displaying and Describing Categorical Data
Frequency tables: making piles
We can organize counts into a frequency table, which records the totals and the category names
Counts are useful, but sometimes we want to know the fraction or proportion of the data in each category, so
we divide the counts by the total number of cases
Often we multiply by 100 to express these proportions as percentages
A relative frequency table displays the proportions or percentages, rather than the counts, of the values in each
category
Both types of tables show how the cases are distributed across the categories
In this way, they describe the distribution of a categorical variable because they name the possible categories
and tell how frequently each occurs
The area principle
The best data displays observe a fundamental principle of graphing called the area principle
The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the
value it represents
Bar charts
A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each
other for easy comparison
If we really want to draw attention to the relative proportion of passengers falling into each of these classes, we
could replace the counts with percentages and use a relative frequency bar chart
Pie charts
Pie charts show the whole group of cases as a circle
They slice the circle into pieces whose size is proportional to the fraction of the whole in each category
Before you make a bar chart or a pie chart, always check the categorical data condition: the data are summary
counts or percentages of individuals in categories
Contingency tables
Because the table shows how the individuals are distributed along each variable, contingent on the value of the
other variable, such a table is called a contingency table
The margins of the table, on the right and at the bottom, give totals
When presented like this, in the margins of a contingency table, the frequency distribution of one of the
variables is called its marginal distribution
Conditional distributions
By focusing on each row separately, we see the distribution under a condition restricting the who
Conditional distributions show the distribution of one variable for just those cases that satisfy a condition on
another variable
In a contingency table, when the distribution of one variable is the same for all categories of another, we say
that the variables are independent theres no association between these variables
If the conditional distributions of one variable are (roughly) the same for every category of the other variable
Segmented bar charts
The resulting segmented bar charts treats each bar as the whole and divides it proportionally into segments
corresponding to the percentage in each group
What can go wrong
Dont violate the area principle
Keep it honest
Dont confuse similar-sounding percentages Dont forget to look at the variables separately
Be sure to use enough individuals
Dont overstate your case
Dont use unfair or silly averages
Terms
Frequency table (relative frequency table) a frequency table lists the categories in a categorical variable and gives the
count (or percentage) of observations for each category
Distribution the distribution of a variable gives:
The possible values of the variables and
The relative frequency of each value
Area principle in a stat display, each data value should be represented by the same amount of area
Bar chart (relative frequency bar chart) bar charts show a bar whose area represents the count (or percentage) of
observations for each category of a categorical variable
Pie chart pie charts show how a whole divides into categories by showing a wedge of a circle whose area
corresponds to the proportion in each category
Categorical data condition the methods in this chapter are appropriate for displaying and describing categorical data.
Be careful not to use them with quantitative data
Contingency table a contingency table displays counts and, sometimes, percentages of individuals falling into named
categories on two or more variables. The table categorizes the individuals on all variables at once to reveal
possible patterns in one variable that may be the contingent on the category of the other
Marginal distribution in a contingency table, the distribution of either variable alone is called the marginal condition.
The counts of percentages are the totals found in the margins (last row or column) of the table
Conditional distribution the distribution of a variable restricting the who to consider only a smaller group of individuals
is called a conditional distribution
Independence variables are said to be independent if the conditional distribution of one variable is the same for each
category of the other
Segmented bar chart a segmented bar chart displays the conditional distribution of a categorical variable within each
category of another variable
Simpsons paradox relationships among proportions taken within different groups or subsets can appear to contradict
relationships among the grand or overall proportions.
Chapter 4 Displaying and Summarizing Quantitative Data
Histograms
The bins, together with these counts, give the distribution of the quantitative variable and provide the building
blocks for the histogram
By representing the counts as bars and plotting them against the bin values, the histogram displays the
distribution at a glance
In a histogram, the bins slice up all the values of the quantitative variable, so any spaces in a histogram are
actual gaps in the data, indicating a region where there are no observed values
Relative frequency histogram, replacing the counts on the vertical axis with the percentage or proportion of the
total number of cases falling in each bin
Stem-and-leaf displays
A stem-and-leaf display is like a histogram, but it shows the individual values
Dotplots
A dotplot is a simple display
It just places a dot along an axis for each case in the data
Think before you draw Quantitative data condition: data are values of a quantitative variable whose units are known
When you describe a distribution, you should always discuss 3 things: shape, centre, and spread
The shape of a distribution
1. Humps are called modes

More
Less
Related notes for STA220H1

Join OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.