CHAPTER 22 - COMPARING TWO PROPORTIONS
WHERE ARE WE GOING?
- comparing 2 prop's
- want to see whether 2 groups are diff, or do they vary by chance
main text, p585
(EX- MASSA Drivers)
- 6971 male drivers
- seatbelt use
- highway safety
- n = 161 loc's in Massachussets, using SRS
- F drivers wore belt >70% of the time, regardless of gender of passenger(s)
- out of 4,208 M drivers w/ F passengers, 2777 (66%) wore belts
- out of 2,763 M drivers w/ M passengers, only 1363 (49.3%) wore belts p586
WHY COMPARE BETWEEN TWO PROPORTIONS (aka PERCENTAGES)?
- interested in finding out how 2 groups differ
(ex) is exptl treatment better than placebo?
(EX- MASSA Drivers)
- know: diff. in prop's of men wearing sealtbelts from sample
- 66% - 49.3% = 16.7%
- more interested in: true difference for ALL men?
- it is not likely that the diff. we obtained is the truth, b/c prop's will vary from sample to sample
- to do this, req. a new ruler: SD of samplign distribution model for diff. in prop's
The variance of sum or diff. of 2 indep. random var's is sum of their variances
=> aka for indep. random var's, variances always add (regardless of whether you are adding or
subtracting the 2 random var's)
WHY DOES VARIATION INCREASE, DESPITE SUBTRACTING TWO RANDOM
(An.) - bowl of cereal
- cereal box claims that there is 16oz of cereal in it
- this is not exact: b/c there is small variation from box to box - when portion of cereal is poured into bowl (we want 2oz serving), we know that it will not be
exact; there is variation assoc. w/ this too
- qn: how much cereal is left in the box?
- is the guess more closer to guess of full box?
- AFTER cereal is poured into bowl, amt of cereal in box still remains a random quantity (but
smaller mean now), BUT it is even more variable b/c of additional variation in amt that was
- variance in amt of cereal remaining in box = sum of 2 variances
- becomes more variable, now that it has been distributed into two containers
- this formula for SD ONLY works for INDEPENDENT RANDOM VARIABLES
=> must check for independence b4 using it
p587 THE STANDARD DEVIATION OF THE DIFFERENCE BETWEEN TWO
- b/c prop's obseved in indep. random samples ARE indep, can use formula
=> can put in prop's for X and Y and add variances
So then, recall that SD is sqrt of variance:
Typically, p1 and p2 unknown
- when have sample prop's (from data), can use those to estimate variances
- whenever we sub in sample prop's for the true prop's to solve for SD, we are getting SE
(standard error), the SD coming from subbing in sample prop's p588
ASSUMPTIONS AND CONDITIONS
-> INDEPENDENCE ASSUMPTION
- w/in each group, data based on results for indep. individuals.
- THIS cannot be checked for certainty, but instead check assoc. conditions
1. RANDOMIZATION CONDITION
- data from each group is drawn indep'ly & at random from homogen popn or gen'ed by
randomized comparative expt
- if not, then sample should be reasonably rep'ive of some large popn if we still want conclusions
2. 10% CONDITION
- if data sampled w/out replaement, then sample mustn't make up greater than 10% of popn
-> INDEPENDENT GROUPS ASSUMPTION
- 2 groups that're being compared MUST be indep. of each other
- if this assumption is violated, then methods will not work
- extent of indep. can be evident dep. on how data collected
Why is IGA so impt?
- cannot apply Pythagorean style-variance formula if groups are somehow related to each other,
or dep. on each other => prop's not indep
(ex) - subj's performance prior to treatment can be related to performance AFTER this treatment
- ie. same group of subj's before and after some treatment
- other analytical methods (not discussed in this txtbook) will be req. if indiivudlas from one
group are somehow linked to that of comparing group
SAMPLE SIZE CONDITION
- each group must be sufficiently large
- req. larger groups to est. prop's that're close to 0% or 100%
- check (for each group)
-> SUCCESS/FAILURE CONDITION
- both groups are sufficiently large s.t. at least 10 successes and 10 failures are obsreved from
What if this condition is not met?
- then just like how we resolved this prob. for working with one prop, can do +4
- add 1 fake success and 1 fake failure to each of the samples
- know: for sufficently large samples, each prop. has approx. Normal smapling distribution
- this is also true for their DIFFERENCE (ie. p1 - p2)
- b/c we do not know true val's, work with observed prop's instead, and use standard error
(SD(p1 - p2) to est. true standard dev. PREJUDICE IN THE PENITENTIARY?
- women serving time in federal jail
- security risk classification
- Federal jails in Canada
- compare treatment of Aboriginal and on-Aboriginal women inmates
1. classified as medium security risk?
>- 41 of 68 (Aboriginal, 60%)
>- 112 of 266 (Non-Aboriginal, 42%)
2. how many, from the whole, commited infractions?
>- 21 of 68 (Aboriginal, about 31%) >- 36 of 68 (Non-aboriginal, about 53%)
=> 18% observe diff.
INTERPRETING THE DIFFERENCE
- is it either:
- real: (ie. indicative of real diff. in gen medium risk classification rates b/ween Aborig.
vs. non-Aborig. women)
- by chance: (ie. result of some random var. that naturally occurs from sample to sample)
=> have to rule out random sampling variation b4 can conclude that there exists real diff.
HOW TO SOLVE?
- use HYPO TEST
- parameter of interest (p) = true difference between prop's from the two groups
- true diff. b/ween medium security risk classification rates of 2 groups
- H0: p1 - p2 = 0
=> there is no diff. in the prop's => diff. in prop's is 0
- aka: H0: p1 = p2 EVERYONE