October 3rd 2017

WEEK 4

Categorical Data

- best way to organize categorical data is through contingency tables

- with these tables we can look at subsets of the data

- purpose is the help display and organize categorical data

- dimensionality of a contingency table depends on how many categorical variables can

be measured on the observational unit

o one-way: measuring one variable, looking at it with columns

▪ total at the bottom of the chart is marginal distribution

o two-way: measuring 2- look at columns in rows

▪ rows distinguish the different variables

▪ we have 2 types of totals, one for the columns and total for each row

▪ row and column are both marginal distributions

- soeties the data is peseted as popotios, thats doe dividing the number of

individuals in each cell by the total amount of people

- in the total column and total row are marginal proportions, all of those are less than

oe, the ol thig thats is the gad total aout of people used i sue

conditional distributions

- look at the effects of just one category while controlling for the effects of the other,

rather than looking at them at the same time

- you start with your raw data and you convert everything to percentages

- the way you convert it depends on if your more interested in looking at the columns or

rows

- olus: ell alulate the total fo the olus as a oditioal distiutio

- rows: row total becomes conditional distribution

- since conditional distribution is only looking at column or a row it will always sum to

100%

- marginal distribution will not sum to 100% because your comparing it to the total

number of people surveyed

- will be on the exam*****

Bar Charts

- y-axis shows the number of cases

- x-axis shows each of the categories

- gap between the bars on the x-axis because it shows the first major categorical division-

gap etee shos thees a atego thats diffeet

- pie charts are also used to display categorical data

o ou at ead the raw counts off of a pie chart

o they contain much less information then we have form a bar chart

Grouped bar charts with 2-way information

- take one category and make it the primary category

- then within each grouping show the data for the 2 observation units ie) male and female

- gap between the primary groups but the secondary grouping are touching and do not

have a group

- emphasis the compounds first then look at the gender ratios second

- grouped bar chart still contains all the info from the table

- the secondary groups can have many variables ie) 4 age groups

Histograms

- how to visualize continuous and discrete numerical data

- plot of your sorted and binned data

- first step is to sort data from smallest to largest

- second step is to break that data down into different bins

- ex) breaking each bin into units of 5, there will be varying measures within each

grouping of the range of 5 units

- histogram plots the number of observations we have within each bin ranges

- no gaps between the bars- the x-axis is a binning of numerical variable so they supposed

to be continuous from one bin to the next

- gives us more richness about the data that we have, beyond measuring central tendency

and variation

- to compare 2 sets of units (male and female, US and Canada) you can plot the 2

histograms right onto of the other

- plot fo the frequency for each bin

Cumulative histograms

- plot the cumulative frequencies as you will cross the different bins in the x-axis

- instead of plotting the frequency you sum the frequency going from the left to the right

on the x-axis

- first value in cumulative histogram is the first value form your raw histogram

- the second value of the histogram is the sum of the first and second values from your

original histogram

- 3rd level is the sum of the first 3 bins

- the left hand side is always o and the right hand side is the total number of data

- easy to compare multiple groups- draw as different line son the same figure

Box plots

- a way to visual numerical data

- not quite as informative as a histogram

- but good visual description when you have large groups

- box plots visualize the quartiles of your data set

- y-axis is your numerical variable

- draw a black line to distinguish the median then do your first and 3rd quartile and draw a

line-that box has 50% of your data by definition and spans the IQR

- then draw on whiskers, they represent most of the remaining data but not all of it

