Statistics full notes.docx

11 Pages
Unlock Document

Statistical Sciences
Statistical Sciences 1024A/B
Mary Millard

Statistics  datasets o collection of data, usually presented in table form o rows are individuals or cases (people, things) o columns are variables (characteristic of a person/thing)  data can be rounded or truncated o rounded 6.3976 -> 6.4 o truncated 6.3976 -> 6.3  categorical variables o category or tag o qualitative  pie charts, bar graphs – usually described by words or letters  quantitative variables o numerical value o always have a unit of measurement  histograms, stem leaf plots, time plots  data context considerations o Who – what are the cases/individuals being observed? o What – what are the variables being observed o Why – what was the purpose of the study? o examples: how many ppl. studied? what was study published? what questions were asked? how were the ppl. selected.. etc..  exploratory data analysis o examining data for their main features – slenderman example  distribution of a variable o tells us what values the study takes and how often it takes them A) Categorical displays 1) Frequency tables  usually show number count and percentage count 2) Pie chart  shows fraction of cases in each category  Emphasizes relationship/comparison (equal to 100%) 3) Bar charts  each bar corresponds to a category of data  generally more flexible than pie charts  vertical scale starts at 0  maximum value should be near the top of the chart  common problems with bar graphs -scale selection shouldn’t distort graph -width of bars should be equal (ex. bad graph with the houses as bars) B) Quantitative variable displays 1) Histogram (most common form of quantitative display)  tool to describe the distribution of a variable  each bar represents a class  Height of bar represents number of cases in a range (or %)  no horizontal space between bars unless the bar is empty (bars are adjoining)  histogram interpretation a) Shape  symmetry (are approximate mirror images)  skewness – “skewed to the right” means that the right side extends further out than the left  outliers  local peaks -unimodal (1 peak) or bimodal (2 peaks) b) Center  median  where approximately data can be split into c) Spread  clustered or spread out? 2) Stem and leaf plots  single quantitative variable distribution display (numerical values)  best for small data sets 18 l 1 19 l 0 2 8 20 l 0 3 21 l 22 l 4 18 l 1 represents 181  stem and leaf plots can be split 13 l 0 2 2 3 4 6 7 8 8 9 becomes -> 13 l 0 2 2 3 4 13 l 6 7 8 8 9  back-to-back stem plots o used to compare two distributions o left side of “l” represents one distribution while right side of “l” represents the other distribution o it’s is normal for one side to have more data than the other l 18 l 1 0 2 3 4 l 19 l 0 2 8 5 7 l 20 l 0 3 1 2 l 21 l l 22 l 4 3) Time-plot  used when looking at variables measured over time  time is x-axis  points are connected (connect-the-dot) Mon Tues Wed Thurs Fri week 1 10 7 6 8 11 week 2 14 5 10 8 7 week 3 9 3 6 4 6 (Data-set of # of times students are late) (Time plot)  cross-sectional data (quantitative variable display) o show variation of a variable at a fixed time (ex. histogram and stem-plot)  time series data (quantitative variable display) o show change in variable over time (x-axis) (ex. time-plots)  resistance o a measure is resistant if its value changes only slightly to changes in observations, no matter how large the observational change is resistant non-resistant median mean interquartile range range standard deviation 1, 3, 4, 6, 8 -> change 3 to 100 -> mean increases drastically (however median does not so it is resistant) Describing distributions with numbers A) Measures of center  mean o most common measure  median o values arranged in order o for odd number of variables, median is the middle number o for even number of variables, median is the average of the two middle numbers ex. 1, 2, 3, 5, 5, 8 -> median is (3+5)/2  midrange/midpoint o average of the maximum and minimum values  (max + min)/2 o not a resistant measure o rare  mode o most frequently occurring valuable o may be more than one mode o useful for describing categorical data B) Measure of spread  range o highest value – lowest value = range o full spread of the data  can include outliers o is NOT resistant  interquartile range o is the measure of the spread of the middle half of the data rd st o difference between the 3 quartile and 1 quartile when values are lined numerically in order  interquartile range = Q3 – Q1 st  1rduartile is the value that is larger than ¼ of the observations  3 quartile is the value that is larger than ¾ of the observations  2 quartile is the mean o is a resistant measure rd st o if there are twrdnumbsts for the 3 or 1 quartile, then the average is taken to determine the 3 or 1 quartile  standard deviation o measures average distance from the mean  if all observations are the same, then the standard deviation will be 0  value cannot be negative  large standard deviation value implies that the data is more spread out o measured in the same units as the original observations ex. 90, 87, 95, 86, 81, 102, 105, 83, 88, 79 average = 89.6 variance = (90-89.6)^2 + (87-89.6)^2 + (95-89.6)^2 + (86-89.6)^2 + (81-89.6)^2… (n-1) standard deviation = square root of variance o standard deviation is not resistant  numerical summaries o measure of center and spread using either 1) or 2) 1) Mean and standard deviation  best for symmetric distributions with no outliers 2) Five number summary  counts through the data in order to reasonably complete description of centre and spread  best used for skewed distributions or those with outliers  (min) (Q1) (M or Q2) (Q3) (max)  outliers o observations above Q3 + 1.5 X interquartile range o observations below Q1 + 1.5 X interquartile range ex. 50 105 110 135 175 (min) (Q1) (Q2) (Q3) (max)  box plot o most useful for comparing several distributions (many sets of data) o gives an indication of symmetry of skewness of a distribution  boxplots are used to o compare medians o compare spread using IQR or range o check for indication of skewness  right skew – when Q3 is further above median than Q1 is below it  left skew – when Q1 is further below the median than Q3 is abo
More Less

Related notes for Statistical Sciences 1024A/B

Log In


Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.