# STAB22H3 Chapter Notes - Chapter 1-6: The Leaves, Bar Chart, Contingency Table

74 views12 pages

STAB22: TEXTBOOK NOTES

PART 1: CHAPTERS 1-6

Naomi

Alphonsus

CHAPTER 1: STATS STARTS HERE

Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world

particular calculations made from data (values along with their context)

Data vary so statistics is about variation. Data vary b/c we can’t see everything let alone measure it all, and even what we

do see and measure, we measure imperfectly

Statistics makes sense of the world by allowing us to understand and model the variation so that we can see the underlying

patterns

Best way to understand statistics is to see it at work posing questions about the world

CHAPTER 2: DATA

Collecting data on their customers, transactions and sales lets companies track their inventory and help them predict what

their customers prefer

This data can be used to help companies predict what their customers will likely buy in the future so they can determine

how much of each item to stock…also information given in data can be used to improve customer service

Data are useless without their context. The context can be set by answering the 5W’s & H questions. The two most

important questions are Who and What, if you can’t answer “Who” and: What” you don’t have data and you don’t have any

useful information

Data in table 2.1 on page 8 has no context which is why we can’t understand what the figures mean. We can make the

meaning clear if we organize the value s into a data table (table 2.2 p9). Rows in data table answer the “who” question

Who

The rows of a data table correspond to individual cases about whom/which we record some characteristics

Cases go by different names:

Respondents: individuals who answer a survey

Subjects/participants: people on whom we experiment on

Experimental units: animals, plants, web sites and other inanimate subjects

Records/cases: the rows in a data tables

We often refer to data values as observations w/o being clear about the who

o Unless you know the “who” of the data you won’t be able to understand the data what the data means

To be able to generalize from the sample of cases selected from some larger population that we’d like to

understand, we want the sample to be representative of that population

What and Why

Variables: the characteristics recorded about each individual usually shown as the columns of a data table and they have

a name that identifies what has been measured

Variables play different roles and you can’t tell a variable’s role just by looking at it some variables just tell us what

group or category each individual belongs to

Some variables have units which tell how each value has been measured the units can tell us how much of something we

have, or how far apart two values are. Without units the values of a measured variable have no meaning

Two types of variables that we need to understand:

Categorical variable/qualitative variables: a variable that names categories and answers questions about how

cases fall into those categories

o If the values of variables are words rather than numbers, its highly likely that they are categorical variables

Quantitative variables: measured variables with units that answer questions about the quantity of what is

measured

Some variables can be both categorical and quantitative

o Ex. Amazon.com could ask your age in years (seems quantitative), it would be quantitative if they wanted to

know the average age of those customers who visit their site after 3am, but if they want to figure out what

CD to offer you in a special deal then thinking of your age as belonging to one of the categories of child,

teen, adult or senior would be more useful

You must always look to the “Why” of your study to decide whether to treat a variable as categorical or quantitative

Variables that report order without natural units are called ordinal variables

The how of data refers to the methods used to collect the data

Counts Count

Counting is a natural way to summarize the categorical variable shipping method (refers to ex of amazon’s special offer or

free shipping) the word “counts” doesn’t necessarily mean categorical

We use counts to measure the amounts of things ex. How many songs are in you iPod

We use counts in two different ways: when we count the cases in each category of a categorical variable, the category

labels are the “what and the individuals counted are the “who” of our data

Identifying Identifiers

Identifier variables themselves don’t tell us anything useful about the categories because we know there is exactly one

individual in each but they are crucial in this age of large data sets

They make it possible to combine data from different sources, to protect confidentiality and to provide unique labels

they are CATEGORIAL variables with just one individual in each category ex. UPS tracking number, SIN #, Student Number

Important to recognize when variable plays role of identifier so you don’t analyze it

Just because a variable has one case per category doesn’t limit it to being an identifier

CHAPTER 3: DISPLAYING AND DESCRIBING CATEGORICAL DATA

Recall: in a data table, the rows represent cases and the columns represent variables

The problem with data tables is that you can’t see what’s going on can’t really identify patterns, relationships, trends or

exceptions

The Three Rules of Data Analysis

1. Make a picture that will display your data in a way that will reveal things you aren’t likely to see in a table of

numbers

2. Make a picture that will show the important features and patterns in your data shows you things you did not

expect to see (unexpected patterns or extraordinary data values)

3. Make a picture that best communicates your data to others

Frequency Tables: Making Piles

To make a picture of data the first step is to make piles -> pile together things that seem to go together to see how the

cases distribute across different categories

For categorical data you just count the number of cases corresponding to each category and pile them up

Frequency table records the totals and the category names

Relative frequency table: displays the proportions or percentages, rather than the counts of the values in each category

Both types of frequency tables describe the distribution of a categorical variable b/c they name the possible categories and

tell how frequently each occurs

The Area Principle

Experience and psychological tests show that our eyes tend to be more impressed by the area than by other aspects of each

image (refer to figure 3.2 on p. 22)

Area principle: a fundamental principle of graphing data that states that the area occupied by a part of the graph should

correspond to the magnitude of the value it represents basically bigger value corresponds to bigger area, smaller value

corresponds to smaller area

Bar Charts

Obeys the area principle gives an accurate visual impression of the distribution of values

Height of each bar shows the count for its category heights determine their areas and the areas are proportional to the

counts

Bar chart: displays the distribution of a categorical variable, showing the counts for each category next to each other for

easy comparison they have small spaces b/w each bar to show that the freestanding bars can be rearranged in any

order.

You can also use a relative frequency bar chart to draw attention to the relative proportion of the variable being measured

(ex. Number of passengers aboard the titanic falling into each class category). Simply replace the counts with percentages

and graph that.

Pie Charts

Pie charts: show the whole group of cases as a circle slice the circle into pieces whose size is proportional to the fraction

of the whole in each category

Pie charts give a quick impression of how a whole group is portioned off into smaller groups, if you are comparing many

categories, it is more beneficial (easier) to display and communicate the data in a bar chart