Mon 320 : The Normal Distribution Reading: 12.3, 12.4 SD and Normal Curve CLT New Methods Today: np.random.normal() : gives a random number between 5 and 5 as a sample from some idealized normal distribution (remember that means mean=0, std = 1). Lecture SD = root mean square of deviations from average How can we say something more numerical than just a few standard deviations from the average? Once weve computed the mean and the standard deviation we can use: Chebyshevs Inequality To tell us where the values are. No matter what shape the distribution is, proportion of values in the range average z SDs is at 2 least 1 (1z ) Notice that there is no bound for 1 standard deviation. That range could contain arbitrarily few values. Demo: Shows us that in the births dataset 94.88 of maternal ages are between 2 standard deviations. And 99 are within 3. So it is different than the bound, but the important thing to notice is that the numbers WE got are higher than Chebyshevs Bounds. Its not uncommon to find numbers larger than Chebyshevs Lower Bounds. Chebyshev holds for binary variables as well. Standard Units for a standard scale. How many SDs above average Z Score = (value mean)SD Negative z = below avg Positive z = above avg Z = 0, exactly average When values are in standard units (zscored), average = 0, np.std = 1. For EVERY standardizedtransformed distribution! Whatever scale the values were in before! Chebyshev says that most values of z are between 5 and 5. And if you graph all the standardized values? The NORMAL DISTRIBUTION Hard to estimate SD visually, except for when the histogram is bell shaped. Demo: We are using maternal height histogram which is roughly bell shaped. We look at the inflection point. It should be at the standard deviation. You know most of the values are close to the average. Normal is hard to find, but some are close. Some are not close so we have to check before assuming normal. If its a little bit off normal you can still use the normal distribution. If a histogram is bell shaped then: avg is at center, SD is the distance between avg and inflection points on either side. Also we can say that almost all data are in the range 3 SDs. We also get to impose stricter bounds.